Decision Tree Regression: A Complete Guide

Introduction

A decision tree is a supervised machine learning algorithm used for both classification and regression tasks. It divides the dataset into smaller subsets based on decision rules derived from the data features. This guide focuses on Decision Tree Regression, where the goal is to predict a continuous numeric target.

1. Decision Tree Intuition

What is a Decision Tree?

A decision tree splits the data into regions by asking sequential questions based on the input features.
For regression, the model predicts the average of the dependent variable (target) within each region.

Types of Decision Trees

Classification Trees:
Used to predict categorical outcomes (e.g., spam vs. not spam).
Regression Trees:
Used to predict continuous outcomes (e.g., salary, house price).

Advantages of Decision Trees

Works well with both numerical and categorical data.
No need for feature scaling (unlike algorithms like SVM or Linear Regression).
Automatically determines the most significant features.

Limitations

Prone to overfitting: Complex trees might memorize training data.
Sensitive to small changes in data.

2. Dataset Overview

We will use the Position Salaries dataset, where:

Independent Variable (x): Position level (e.g., level 1, level 2).
Dependent Variable (y): Salary corresponding to each position.

3. Implementation

Step 1: Importing Libraries

# Essential libraries for data processing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Loading and Preparing the Data

# Load the dataset
dataset = pd.read_csv('Position_Salaries.csv')

# Extract independent and dependent variables
x = dataset.iloc[:, 1:-1].values  # Position levels (independent variable)
y = dataset.iloc[:, -1].values    # Salaries (dependent variable)

# Display the data
print("Independent Variable (x):\n", x)
print("Dependent Variable (y):\n", y)

Step 3: Training the Decision Tree Regressor

from sklearn.tree import DecisionTreeRegressor

# Initialize the Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=0)

# Train the model on the entire dataset
regressor.fit(x, y)

Step 4: Making Predictions

# Predict the salary for a specific position level (e.g., 6.5)
predicted_salary = regressor.predict([[6.5]])
print(f"Predicted Salary for Position Level 6.5: {predicted_salary[0]}")

Why use [[6.5]]?
The regressor.predict method expects the input as a 2D array. Wrapping 6.5 in double brackets ensures the correct shape.

Step 5: Visualizing the Results

To see the model's predictions and understand how it fits the data:

# Create a high-resolution grid for visualization
x_grid = np.arange(min(x), max(x), 0.01).reshape(-1, 1)

# Scatter plot of the original data
plt.scatter(x, y, color='red', label='Actual Data')

# Line plot of decision tree predictions
plt.plot(x_grid, regressor.predict(x_grid), color='blue', label='Model Prediction')

# Add titles and labels
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend()
plt.show()

4. Complete Code

For clarity, here’s the complete code in one place:

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor

# Load the dataset
dataset = pd.read_csv('Position_Salaries.csv')
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

# Train the Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(x, y)

# Make a prediction
predicted_salary = regressor.predict([[6.5]])
print(f"Predicted Salary for Position Level 6.5: {predicted_salary[0]}")

# Visualize the results
x_grid = np.arange(min(x), max(x), 0.01).reshape(-1, 1)
plt.scatter(x, y, color='red', label='Actual Data')
plt.plot(x_grid, regressor.predict(x_grid), color='blue', label='Model Prediction')
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position Level')
plt.ylabel('Salary')
plt.legend()
plt.show()

5. Key Takeaways

Feature Scaling: Not needed for decision trees.
Encoding: Use encoding techniques like LabelEncoder or OneHotEncoder for categorical variables.
Overfitting: Prune or limit the tree depth if the model fits the training data too perfectly.
Performance: Decision trees are ideal for small datasets and provide interpretable models.