Classification | Unisala

Classification Overview

Unlike regression, where you predict a continuous number, classification is used to predict a category. Classification has diverse applications ranging from medicine to marketing. Popular classification models include both linear models like Logistic Regression and Support Vector Machines (SVM) and nonlinear models like K-Nearest Neighbors (K-NN), Kernel SVM, and Random Forests.

Classification is sometimes referred to as churn modeling.

A churn model is a type of predictive model used to identify customers who are likely to stop using a product or service, referred to as "churn." This model is commonly employed in industries like telecommunications, subscription services, and e-commerce to retain customers and reduce turnover.

Key Elements of Churn Modeling:

Objective:

Predict which customers are at risk of leaving or stopping usage.
Understand factors contributing to churn.

Applications:

Customer Retention: Companies use churn predictions to implement targeted retention strategies, such as offering discounts or personalized services.
Marketing Campaigns: Identifying high-risk customers helps in optimizing marketing budgets by focusing on customers with the highest churn probability.

Features:

Demographic Data: Age, income, location, etc.
Behavioral Data: Frequency of service use, past purchases, complaints, etc.
Engagement Metrics: Login frequency, feature usage, or activity on the platform.

Models Used:

Logistic Regression, Decision Trees, Random Forests, Gradient Boosting, and Neural Networks are commonly used algorithms.

For example, if a telecom company wants to predict churn, it might use data such as call duration, number of dropped calls, subscription plan details, and customer service interactions. By analyzing these variables, the churn model predicts which customers are likely to leave, enabling the company to take proactive measures.

Examples of classification problems include:

Email spam detection (classifying emails as spam or not spam).
Image classification (e.g., distinguishing between cats and dogs).

Classification Algorithms

In this section, you will learn how to implement the following Machine Learning Classification models:

Logistic Regression
K-Nearest Neighbors (K-NN)
Support Vector Machine (SVM)
Kernel SVM
Naive Bayes
Decision Tree Classification
Random Forest Classification

Logistic Regression

Logistic Regression is a statistical model used to predict a categorical dependent variable based on one or more independent variables. For example, in an insurance company, you might want to predict whether a person will buy insurance (dependent variable) based on factors like age, income, etc. (independent variables).

Key Points:

Sigmoid Curve: The logistic regression model uses a sigmoid function to map predicted values to probabilities, ranging between 0 and 1.
The sigmoid function is given by:
p=11+e−zp = \frac{1}{1 + e^{-z}}Where:

pp is the predicted probability.
z=b0+b1x1+b2x2+…z = b_0 + b_1x_1 + b_2x_2 + \dots
b0b_0 is the intercept, and b1,b2,…b_1, b_2, \dots are the coefficients of the independent variables x1,x2,…x_1, x_2, \dots.

Log-Odds Transformation: The logistic regression equation can be written in terms of log-odds:
ln⁡(p1−p)=b0+b1x1\ln\left(\frac{p}{1-p}\right) = b_0 + b_1x_1Here, ln⁡(p1−p)\ln\left(\frac{p}{1-p}\right) represents the log-odds of the event.
Maximum Likelihood Estimation (MLE): Logistic regression optimizes the parameters b0,b1,…b_0, b_1, \dots to find the curve that best fits the data by maximizing the likelihood function.

Example Scenario:

Suppose you have a dataset with two features: Age and Estimated Salary.
The goal is to predict whether a person will purchase insurance (1 = Yes, 0 = No).

By applying logistic regression, the model determines the probability that a given individual belongs to a specific category (e.g., will purchase insurance).

Code Implementation with Explanations

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

# Importing the dataset 
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, :-1].values    # Features (independent variables)
y = dataset.iloc[:, -1].values     # Target (dependent variable)

# Splitting the dataset into training and test sets 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

# Feature Scaling 
# Standardizing the features to bring them to the same scale. 
# This improves the performance of many machine learning algorithms.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)  # Note: Use transform, not fit_transform, for test set

# Training the Logistic Regression model on the training set 
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(random_state=0)
LR.fit(x_train, y_train)

# Predicting a new result 
# Example: Predict if a 30-year-old with a salary of 87,000 will buy the product
# print(LR.predict(sc.transform([[30, 87000]])))

# Predicting the test set results 
y_pred = LR.predict(x_test)
# Printing predictions alongside actual values for comparison
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))

# Making the Confusion Matrix 
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)  # Confusion Matrix: Shows true positives, false positives, etc.
print(accuracy_score(y_test, y_pred))  # Accuracy score of the model

# Visualizing the training set results (Optional, requires additional code)

Explanation of Key Steps:

Data Splitting:

The dataset is split into training (75%) and test (25%) sets using train_test_split.

Feature Scaling:

StandardScaler standardizes features by removing the mean and scaling to unit variance. This is important for models like Logistic Regression that are sensitive to feature magnitudes.

Logistic Regression Model:

The LogisticRegression class from scikit-learn is used to train the model on the scaled training data.

Predictions:

Predictions are made on the test set, and results are printed alongside actual values for comparison.

Confusion Matrix and Accuracy:

The confusion matrix provides detailed performance metrics, while the accuracy score summarizes the overall performance.