Multiple Linear Regression

Multiple Linear Regression Notes

Multiple linear regression is used when the dependent variable (y) is continuous, and there are multiple independent variables (x1, x2, ..., xn). This approach models the relationship between y and the predictors as:

y=b0+b1x1+b2x2+b3x3+...+bnxny = b_0 + b_1x_1 + b_2x_2 + b_3x_3 + ... + b_nx_n

Here, we discuss:

Handling categorical variables using dummy variables.
Building regression models with statistical significance methods.

Assumptions of Multiple Linear Regression

Linearity: Relationship between y and each x is linear.
Homoscedasticity: Equal variance of errors.
Multivariate Normality: Errors are normally distributed.
Independence: Observations are independent (no autocorrelation).
Lack of Multicollinearity: Predictors are not highly correlated.
Outlier Check: While not an assumption, identifying and managing outliers improves model accuracy.

Handling Categorical Variables

Categorical variables cannot be directly used in regression models. We use dummy variables to encode categories into binary vectors (0 or 1):

For each category, a column is created.
Only n−1n-1 dummy variables are used to avoid the dummy variable trap, where multicollinearity arises if all dummy variables are included.

Example:

A "State" column with values [New York, California, Florida][\text{New York, California, Florida}] would be encoded as two dummy variables: D1D_1 (New York) and D2D_2 (California).

Statistical Significance: Building Models

p-value

The p-value measures the likelihood of the observed results under the null hypothesis. A smaller p-value indicates stronger evidence against the null hypothesis (e.g., p<0.05p < 0.05).

Model Building Methods

All-In:

Include all variables if prior knowledge or constraints necessitate it.

Backward Elimination:

Start with all predictors and iteratively remove the one with the highest p-value above the significance level (e.g., 5%).

Forward Selection:

Start with no predictors and add variables with the lowest p-value until no significant predictors remain.

Bidirectional Elimination:

Combine forward selection and backward elimination using separate significance levels for adding and removing variables.

Score Comparison:

Evaluate all possible models using a goodness-of-fit criterion (e.g., Akaike Information Criterion).

Implementation in Python

# Importing libraries
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Loading the dataset
dataset = pd.read_csv('50_Startups.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Encoding categorical data
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

# Splitting the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Training the model
regressor = LinearRegression()
regressor.fit(x_train, y_train)

# Predicting results
y_pred = regressor.predict(x_test)

# Displaying predictions vs actual values
np.set_printoptions(precision=3)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))

Key Points

Feature Scaling: Not required for multiple linear regression in scikit-learn.
Dummy Variable Trap: Automatically avoided in scikit-learn's implementation.
Best Features: The library selects features that significantly influence the model.
Visualization: For multiple dimensions, visualizing results involves comparing vectors (e.g., predicted vs. actual values).

This structured note simplifies multiple linear regression, emphasizing key concepts and practical steps in Python.