Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Decision Tree Intuition

nishan thapa

Jan 24, 2025

80 views

Decision Tree Intuition
A decision tree is a powerful model used for both classification and regression tasks, hence the name CART (Classification and Regression Tree).
How It Works
Decision trees work by splitting the data into subsets.
The algorithm determines the best splits to maximize homogeneity within subsets, i.e., it tries to group similar categories (e.g., maximizing the presence of a specific class in a split).
This is achieved by minimizing entropy (or impurity) at each split.
Key Insights
The algorithm selects splits to maximize homogeneity for categories like "red characters."
Entropy measures disorder: lower entropy indicates better splits.
Implementation Using Scikit-learn
The DecisionTreeClassifier from the Scikit-learn library is commonly used for decision trees. By default, it uses the Gini criterion, but here we switch to Entropy for better control.

Code Breakdown
1. Importing Libraries
python Copy Edit import pandas as pd import numpy as np import matplotlib.pyplot as plt
2. Loading and Preparing the Dataset
python Copy Edit # Load dataset dataset = pd.read_csv('Social_Network_Ads.csv') # Separate features (X) and target variable (y) x = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values
3. Splitting Data into Train and Test Sets
python Copy Edit from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
4. Feature Scaling
python Copy Edit from sklearn.preprocessing import StandardScaler scaler = StandardScaler() x_train = scaler.fit_transform(x_train) x_test = scaler.transform(x_test)
5. Training the Decision Tree Classifier
python Copy Edit from sklearn.tree import DecisionTreeClassifier # Use 'entropy' as the criterion classifier = DecisionTreeClassifier(criterion='entropy', random_state=0) classifier.fit(x_train, y_train)
6. Predicting Test Results
python Copy Edit y_pred = classifier.predict(x_test) # Compare predictions with actual results print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))
7. Evaluating Performance
python Copy Edit from sklearn.metrics import confusion_matrix, accuracy_score # Confusion Matrix cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:\n", cm) # Accuracy Score print("Accuracy Score:", accuracy_score(y_test, y_pred))
Summary
Decision trees split data iteratively to minimize entropy and maximize homogeneity.
Scikit-learn’s DecisionTreeClassifier offers flexibility with criteria like Gini or Entropy.
The model's performance can be evaluated using tools like confusion matrix and accuracy score.

Code Explanation
1. Importing Libraries
python Copy Edit import pandas as pd import numpy as np import matplotlib.pyplot as plt
pandas: Used for data manipulation and analysis. We use it to load and process the dataset.
numpy: Provides support for numerical computations and array manipulations.
matplotlib.pyplot: A plotting library used for data visualization.
2. Loading and Preparing the Dataset
python Copy Edit dataset = pd.read_csv('Social_Network_Ads.csv') x = dataset.iloc[:, :-1].values # Select all columns except the last one as features y = dataset.iloc[:, -1].values # Select the last column as the target variable
pd.read_csv(): Reads the CSV file and loads it as a DataFrame.
x: Represents the features (independent variables) from the dataset.
y: Represents the target variable (dependent variable) that the model will predict.
3. Splitting Data into Train and Test Sets
python Copy Edit from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
train_test_split: Splits the dataset into two parts:
Training set: Used to train the model (80% of data here).
Test set: Used to evaluate the model's performance (20% of data here).
Parameters:
test_size=0.2: 20% of the data is allocated for testing.
random_state=0: Ensures reproducibility by fixing the random splitting.
4. Feature Scaling
python Copy Edit from sklearn.preprocessing import StandardScaler scaler = StandardScaler() x_train = scaler.fit_transform(x_train) # Fit and transform the training set x_test = scaler.transform(x_test) # Transform the test set
Why Feature Scaling?
Many machine learning models, including decision trees, perform better when the features are scaled to a similar range. Scaling ensures no feature dominates due to differences in scale (e.g., age in years vs income in thousands).
StandardScaler:
Normalizes data by subtracting the mean and dividing by the standard deviation.
fit_transform(): Computes the mean and standard deviation of the training data, then scales it.
transform(): Scales the test data using the same parameters computed from the training set.
5. Training the Decision Tree Classifier
python Copy Edit from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(criterion='entropy', random_state=0) classifier.fit(x_train, y_train)
DecisionTreeClassifier:
A Scikit-learn class used to create decision tree models.
criterion='entropy': Specifies the splitting criterion to minimize entropy (measure of impurity). Another option is gini (default).
random_state=0: Ensures reproducibility.
fit(x_train, y_train):
Trains the decision tree using the training dataset.
6. Predicting Test Results
python Copy Edit y_pred = classifier.predict(x_test)
predict():
Predicts the target variable (y) for the test features (x_test) based on the trained decision tree.
y_pred:
An array containing the model’s predicted values for the test set.
7. Evaluating Performance
python Copy Edit from sklearn.metrics import confusion_matrix, accuracy_score cm = confusion_matrix(y_test, y_pred) print("Confusion Matrix:\n", cm) print("Accuracy Score:", accuracy_score(y_test, y_pred))
confusion_matrix():
Compares the predicted values (y_pred) with the actual test labels (y_test) to create a matrix:
True Positives (TP): Correctly predicted positives.
True Negatives (TN): Correctly predicted negatives.
False Positives (FP): Incorrectly predicted as positive.
False Negatives (FN): Incorrectly predicted as negative.
Example Confusion Matrix:
lua Copy Edit [[TN FP] [FN TP]]
accuracy_score():
Calculates the proportion of correctly predicted values:
Accuracy=TP + TNTotal Predictions\text{Accuracy} = \frac{\text{TP + TN}}{\text{Total Predictions}}Accuracy=Total PredictionsTP + TN

Written by nishan thapa