A decision tree is a powerful model used for both classification and regression tasks, hence the name CART (Classification and Regression Tree).
How It Works
Decision trees work by splitting the data into subsets.
The algorithm determines the best splits to maximize homogeneity within subsets, i.e., it tries to group similar categories (e.g., maximizing the presence of a specific class in a split).
This is achieved by minimizing entropy (or impurity) at each split.
Key Insights
The algorithm selects splits to maximize homogeneity for categories like "red characters."
The DecisionTreeClassifier from the Scikit-learn library is commonly used for decision trees. By default, it uses the Gini criterion, but here we switch to Entropy for better control.
Code Breakdown
1. Importing Libraries
python
Copy
Edit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2. Loading and Preparing the Dataset
python
Copy
Edit
# Load dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
# Separate features (X) and target variable (y)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
3. Splitting Data into Train and Test Sets
python
Copy
Edit
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
python
Copy
Edit
from sklearn.tree import DecisionTreeClassifier
# Use 'entropy' as the criterion
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(x_train, y_train)
6. Predicting Test Results
python
Copy
Edit
y_pred = classifier.predict(x_test)
# Compare predictions with actual results
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))
Decision trees split data iteratively to minimize entropy and maximize homogeneity.
Scikit-learn’s DecisionTreeClassifier offers flexibility with criteria like Gini or Entropy.
The model's performance can be evaluated using tools like confusion matrix and accuracy score.
Code Explanation
1. Importing Libraries
python
Copy
Edit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pandas: Used for data manipulation and analysis. We use it to load and process the dataset.
numpy: Provides support for numerical computations and array manipulations.
matplotlib.pyplot: A plotting library used for data visualization.
2. Loading and Preparing the Dataset
python
Copy
Edit
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, :-1].values # Select all columns except the last one as features
y = dataset.iloc[:, -1].values # Select the last column as the target variable
pd.read_csv(): Reads the CSV file and loads it as a DataFrame.
x: Represents the features (independent variables) from the dataset.
y: Represents the target variable (dependent variable) that the model will predict.
3. Splitting Data into Train and Test Sets
python
Copy
Edit
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
train_test_split: Splits the dataset into two parts:
Training set: Used to train the model (80% of data here).
Test set: Used to evaluate the model's performance (20% of data here).
Parameters:
test_size=0.2: 20% of the data is allocated for testing.
random_state=0: Ensures reproducibility by fixing the random splitting.
4. Feature Scaling
python
Copy
Edit
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train) # Fit and transform the training set
x_test = scaler.transform(x_test) # Transform the test set
Why Feature Scaling?
Many machine learning models, including decision trees, perform better when the features are scaled to a similar range. Scaling ensures no feature dominates due to differences in scale (e.g., age in years vs income in thousands).
StandardScaler:
Normalizes data by subtracting the mean and dividing by the standard deviation.
fit_transform(): Computes the mean and standard deviation of the training data, then scales it.
transform(): Scales the test data using the same parameters computed from the training set.
Decision Tree Intuition
A decision tree is a powerful model used for both classification and regression tasks, hence the name CART (Classification and Regression Tree).
How It Works
Key Insights
Implementation Using Scikit-learn
The DecisionTreeClassifier from the Scikit-learn library is commonly used for decision trees. By default, it uses the Gini criterion, but here we switch to Entropy for better control.
Code Breakdown
1. Importing Libraries
2. Loading and Preparing the Dataset
3. Splitting Data into Train and Test Sets
4. Feature Scaling
5. Training the Decision Tree Classifier
6. Predicting Test Results
7. Evaluating Performance
Summary
Code Explanation
1. Importing Libraries
2. Loading and Preparing the Dataset
3. Splitting Data into Train and Test Sets
4. Feature Scaling
5. Training the Decision Tree Classifier
6. Predicting Test Results
7. Evaluating Performance