Introduction to Unsupervised Learning: Clustering

Clustering is a type of unsupervised learning, where the goal is to group data into clusters based on similarity. Unlike supervised learning, where models learn from labeled data, in clustering, the model discovers patterns and relationships on its own.

Key Characteristics of Clustering:

Goal: To identify segments or clusters in your data without prior knowledge of the groups.
Difference from Classification: In classification, we know the target labels and train the model to predict them. In clustering, we explore the data to discover hidden patterns or structures.
Unexpected Discoveries: Clustering algorithms can reveal surprising patterns or groupings, helping uncover structures in the data you might not have anticipated.

K-Means Clustering

How Does K-Means Work?

Choose the Number of Clusters (K):

Decide how many clusters you want to form.

Initialize Centroids:

Place K centroids randomly within the data space.

Assign Points to Clusters:

Draw equidistant lines and assign each data point to the nearest centroid. This forms initial clusters.

Calculate New Centroids:

Compute the center of mass (average position) for each cluster and move the centroid to this position.

Repeat Until Convergence:

Repeat the assignment and recalculation process until the centroids no longer move significantly.

Deciding the Number of Clusters (K)

The Elbow Method helps determine the optimal number of clusters:

WCSS (Within-Cluster Sum of Squares): Measures the sum of squared distances between data points and their respective centroids.
As the number of clusters increases, WCSS decreases. The optimal number of clusters is where the decrease in WCSS slows down significantly, forming an “elbow” in the graph.

Random Initialization Trap

Poor initialization of centroids can lead to suboptimal clustering results. To avoid this, use the K-Means++ Initialization Algorithm:

Step 1: Select the first centroid randomly from the data points.
Step 2: For each remaining data point, compute its distance (D) to the nearest selected centroid.
Step 3: Use weighted random selection to choose the next centroid, where weights are proportional to D2D^2.
Step 4: Repeat until all K centroids are chosen.
Step 5: Proceed with standard K-Means clustering.

Code Implementation

1. Importing Libraries and Dataset

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, [3, 4]].values

2. Using the Elbow Method to Find Optimal Clusters

from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)

# Plotting the Elbow Method
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

3. Training the K-Means Model on the Dataset

kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(x)

4. Visualizing the Clusters

# Visualizing clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(x[y_kmeans == 3, 0], x[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4')
plt.scatter(x[y_kmeans == 4, 0], x[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5')

# Plotting centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of Customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

Key Points to Remember

Difference from Regression/Classification: Unlike supervised learning tasks, clustering doesn't predict anything; it discovers patterns and structures in data.
Dependent Variable Creation: In clustering, we create a dependent variable by grouping data into clusters.
Role of Initialization: Proper centroid initialization (e.g., K-Means++) avoids the random initialization trap, ensuring better results.
WCSS: Tracks the within-cluster variance, helping determine the optimal number of clusters using the Elbow Method.
Visualization: Visualizing clusters and centroids helps interpret and validate the results.

By following these steps, you can apply K-Means clustering effectively to uncover patterns and insights in your data!