Machine Learning Preprocessing

Machine Learning Preprocessing: Comprehensive Notes

Preprocessing is a vital step in any machine learning project. It ensures that the dataset is clean and ready for model training and testing. Here, we focus on handling missing data, encoding categorical variables, splitting datasets, and feature scaling. Below is a structured explanation with code examples to help understand the concepts:

1. Features and Dependent Variables

Features are the columns that help in making predictions (independent variables, x).
The dependent variable (y) is the target value we aim to predict.

We create x (features matrix) and y (dependent variable) during the data preprocessing phase:

import pandas as pd
import numpy as np

dataset = pd.read_csv('data.csv')
x = dataset.iloc[:, :-1].values  # Features matrix
y = dataset.iloc[:, -1].values  # Dependent variable

2. Handling Missing Data

Missing data can cause errors or inaccuracies in predictions. To handle missing data, we can either delete or replace the missing values. Here, we replace missing values in the dataset with the column's mean using the SimpleImputer class from sklearn.impute:

from sklearn.impute import SimpleImputer

# Check for missing values
null = dataset.iloc[:, :-1].isnull().sum()
print(null)  # Displays the count of missing values in each column

# Replace missing values with column mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

Here:

fit() calculates the mean of the specified columns.
transform() replaces missing values with the calculated mean.

3. Encoding Categorical Data

a. One-Hot Encoding

One-hot encoding transforms categorical data into binary vectors. For example, a "Country" column with values like ["France", "Spain", "Germany"] will be converted into three separate columns with binary values:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoding to the first column (e.g., Country)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

The parameter remainder='passthrough' ensures that non-transformed columns are retained in the result.

b. Label Encoding

Label encoding is applied to the dependent variable (y) to convert categorical values into numerical labels:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

4. Splitting the Dataset

Splitting the dataset into training and testing subsets ensures that the model can be validated on unseen data:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

Here:

test_size=0.2 reserves 20% of the dataset for testing.
random_state=1 ensures reproducibility.

5. Feature Scaling

Feature scaling standardizes the range of independent variables to avoid bias in the model. Two common methods are:

Standardization: Transforms data to have a mean of 0 and a standard deviation of 1. Formula: x′=x−mean(x)std(x)x' = \frac{x - \text{mean}(x)}{\text{std}(x)}
Normalization: Scales data between 0 and 1. Formula: x′=x−min(x)max(x)−min(x)x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)}

In this example, we use standardization with the StandardScaler class:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])  # Scale training data
x_test[:, 3:] = sc.transform(x_test[:, 3:])  # Scale testing data (no fit here)

print(x_test[:, 3:])  # Scaled testing data

Note: Always fit the scaler on the training data and apply the transformation to both training and test datasets to avoid information leakage.

Key Takeaways

Imputation replaces missing data effectively, preventing errors during model training.
One-hot encoding and label encoding convert categorical variables into numerical formats suitable for models.
Splitting datasets ensures unbiased model evaluation.
Feature scaling standardizes variables to improve model performance and prevent bias caused by variable ranges.

Complete Code

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Load dataset
dataset = pd.read_csv('data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Handle missing data
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

# Encode categorical data
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

le = LabelEncoder()
y = le.fit_transform(y)

# Split dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

# Feature scaling
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])

print(x_test[:, 3:])

This structured note explains preprocessing comprehensively, providing clear insights into each step with code examples. Let me know if further clarifications or additions are needed!