Hierarchical Clustering

Overview

Hierarchical clustering is a clustering algorithm that groups data points into clusters based on their similarity. While it is conceptually similar to k-means clustering, it differs in its methodology. In some cases, both methods may produce the same clustering result.

Types of Hierarchical Clustering

Agglomerative (Bottom-Up Approach):

Starts with each data point as its own individual cluster.
Clusters are iteratively merged based on their similarity until there is only one cluster.

Divisive (Top-Down Approach):

Starts with all data points in a single cluster.
Splits clusters iteratively into smaller clusters until each data point forms its own cluster.

Distance Metric

Hierarchical clustering typically uses Euclidean distance to measure the distance between data points or clusters. Other distance metrics, like Manhattan or cosine similarity, can also be used but are less common in this context.

Steps in Agglomerative Hierarchical Clustering

Initialization:
Treat each data point as a separate cluster. At this stage, there are N clusters for N data points.
Merge Closest Clusters:
Identify the two closest clusters based on a distance metric (e.g., Euclidean distance) and merge them. This reduces the number of clusters to N−1.
Repeat:
Continue merging the two closest clusters iteratively until only one cluster remains.
Create a Dendrogram:
The sequence of merges is recorded in a dendrogram, a tree-like diagram that represents the hierarchy of clusters.

Dendrograms

A dendrogram visually represents the clustering process. It records how clusters are formed at each step and shows the hierarchy of clusters.
How it works:
The vertical axis represents the distance or dissimilarity between clusters.
The horizontal axis lists individual data points or clusters.
Merges are represented as horizontal lines, where the height of the line corresponds to the distance at which the merge occurred.

Purpose of Hierarchical Clustering

To identify patterns and structure in data by organizing it into a hierarchy of clusters.
To create a dendrogram that provides insights into the relationship and distance between clusters.

How to Make Predictions Using Hierarchical Clustering

To determine the optimal number of clusters:

Set a Threshold:

Draw a horizontal line across the dendrogram at a specific height (based on a distance metric such as Euclidean distance). This threshold determines how clusters are formed.
The number of clusters is equal to the number of vertical lines crossed by this threshold line.

Find the Optimal Number of Clusters:

One standard approach is to locate the largest vertical distance in the dendrogram that does not intersect any horizontal lines. This gap indicates the best place to "cut" the dendrogram for clustering.

Key Definitions

Distance Between Clusters:
The distance between two clusters can be measured in several ways:

Single Linkage: The shortest distance between any two points in the clusters.
Complete Linkage: The longest distance between any two points in the clusters.
Average Linkage: The average distance between all points in the two clusters.
Centroid Distance: The distance between the centroids (mean points) of the two clusters.

Threshold Setting:
Setting the threshold depends on the dataset and the problem context. It involves balancing the trade-off between too few clusters (under-clustering) and too many clusters (over-clustering).

Example Questions

What is a dendrogram?
A dendrogram is a diagram that records the hierarchical merging of clusters during the clustering process. It visually represents the clustering hierarchy and is used to determine the number of clusters.
How do we find the optimal number of clusters?
By identifying the largest vertical gap in the dendrogram that does not intersect horizontal lines. This gap indicates the best threshold for splitting clusters.
How does hierarchical clustering differ from k-means?

Hierarchical clustering does not require the number of clusters to be specified in advance, while k-means does.
Hierarchical clustering produces a dendrogram, whereas k-means provides final cluster assignments without a hierarchy.

Summary

Hierarchical clustering is a versatile technique for grouping data points based on similarity, providing both a clustering solution and a dendrogram for analysis. By interpreting the dendrogram and setting appropriate thresholds, we can extract meaningful clusters and gain insights into the data structure.

Step-by-Step Breakdown of the Code

1. Importing the Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch  # For dendrogram

pandas is used for data manipulation (e.g., reading the dataset).
numpy is used for numerical operations (though it’s not heavily used in the code provided).
matplotlib.pyplot is used for plotting graphs (we use it to visualize the clusters and the dendrogram).
scipy.cluster.hierarchy is used for the hierarchical clustering methods (particularly for generating the dendrogram).

2. Loading the Dataset

dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, [3, 4]].values  # Selecting the relevant columns

pd.read_csv('Mall_Customers.csv') loads the dataset, which is a CSV file containing customer data (such as annual income and spending score).
x = dataset.iloc[:, [3, 4]].values selects the columns containing Annual Income and Spending Score. These two features are used to determine the similarity between customers for clustering.
iloc[:, [3, 4]] picks the 4th and 5th columns (0-indexed) from the dataset, which are assumed to be the income and spending score.

3. Creating the Dendrogram

plt.figure(figsize=(10, 7))
dendrogram = sch.dendrogram(sch.linkage(x, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()

sch.linkage(x, method='ward'): This performs the actual hierarchical clustering. It takes the matrix of data points (x) and uses the Ward linkage method to calculate the distances between clusters.
Ward’s method is a way to minimize the variance (or the spread) within the clusters. It merges clusters in such a way that the resulting clusters have the least internal variance.
sch.dendrogram(): This function plots the dendrogram, which is a tree-like diagram that shows how clusters are formed step-by-step.
The vertical axis represents the distance or dissimilarity between clusters. The higher the vertical line, the more different the clusters are.
The horizontal axis represents the data points. Each point starts as its own cluster at the bottom, and as clusters merge, the branches of the tree rise.
Dendrogram Analysis: The key to hierarchical clustering is looking at the dendrogram. By cutting the tree at a certain level (threshold), you decide how many clusters to form. For example, if you cut it at a lower level, you’ll get more clusters; cutting it higher gives fewer clusters.

4. Fitting the Agglomerative Clustering Model

from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
y_hc = hc.fit_predict(x)

AgglomerativeClustering: This is the algorithm that performs the hierarchical clustering.
n_clusters=5: This sets the number of clusters you want to form. You've chosen to create 5 clusters.
affinity='euclidean': This defines the distance metric used to measure the dissimilarity between clusters. You’re using Euclidean distance, which is a straight-line distance between two points in the feature space (income and spending score in this case).
linkage='ward': This specifies the method to calculate the distance between clusters. You're using Ward's linkage, which minimizes the variance within clusters.
y_hc = hc.fit_predict(x): This fits the hierarchical clustering model to your data (x) and predicts the cluster assignment for each data point. The result, y_hc, is an array where each value corresponds to the cluster label assigned to each customer.

5. Visualizing the Clusters

plt.figure(figsize=(10, 7))

# Plotting each cluster with different colors
plt.scatter(x[y_hc == 0, 0], x[y_hc == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(x[y_hc == 1, 0], x[y_hc == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(x[y_hc == 2, 0], x[y_hc == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(x[y_hc == 3, 0], x[y_hc == 3, 1], s=100, c='cyan', label='Cluster 4')
plt.scatter(x[y_hc == 4, 0], x[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5')

# Title and Labels
plt.title('Clusters of Customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

plt.scatter(): This function is used to create a scatter plot. For each cluster (determined by y_hc), you plot the corresponding points (data points that belong to that cluster).
x[y_hc == 0, 0]: This selects all points that belong to Cluster 1. The first index (0) selects the annual income (feature 0), and the second index (1) selects the spending score (feature 1).
The same logic is applied for other clusters (Cluster 2, Cluster 3, etc.), each with a different color and label.
Plot Details:
The scatter plot shows the Annual Income on the X-axis and Spending Score on the Y-axis.
Different colors represent different clusters, helping you visualize how the data points are grouped based on these features.