Clustering is a type of unsupervised learning, where the goal is to group data into clusters based on similarity. Unlike supervised learning, where models learn from labeled data, in clustering, the model discovers patterns and relationships on its own.
Key Characteristics of Clustering:
Goal: To identify segments or clusters in your data without prior knowledge of the groups.
Difference from Classification: In classification, we know the target labels and train the model to predict them. In clustering, we explore the data to discover hidden patterns or structures.
Unexpected Discoveries: Clustering algorithms can reveal surprising patterns or groupings, helping uncover structures in the data you might not have anticipated.
K-Means Clustering
How Does K-Means Work?
Choose the Number of Clusters (K):
Decide how many clusters you want to form.
Initialize Centroids:
Place K centroids randomly within the data space.
Assign Points to Clusters:
Draw equidistant lines and assign each data point to the nearest centroid. This forms initial clusters.
Calculate New Centroids:
Compute the center of mass (average position) for each cluster and move the centroid to this position.
Repeat Until Convergence:
Repeat the assignment and recalculation process until the centroids no longer move significantly.
Deciding the Number of Clusters (K)
The Elbow Method helps determine the optimal number of clusters:
WCSS (Within-Cluster Sum of Squares): Measures the sum of squared distances between data points and their respective centroids.
As the number of clusters increases, WCSS decreases. The optimal number of clusters is where the decrease in WCSS slows down significantly, forming an “elbow” in the graph.
Random Initialization Trap
Poor initialization of centroids can lead to suboptimal clustering results. To avoid this, use the K-Means++ Initialization Algorithm:
Step 1: Select the first centroid randomly from the data points.
Step 2: For each remaining data point, compute its distance (D) to the nearest selected centroid.
Step 3: Use weighted random selection to choose the next centroid, where weights are proportional to D2D^2.
Step 4: Repeat until all K centroids are chosen.
Step 5: Proceed with standard K-Means clustering.
Code Implementation
1. Importing Libraries and Dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Mall_Customers.csv')
x = dataset.iloc[:, [3, 4]].values
2. Using the Elbow Method to Find Optimal Clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
# Plotting the Elbow Method
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
Difference from Regression/Classification: Unlike supervised learning tasks, clustering doesn't predict anything; it discovers patterns and structures in data.
Dependent Variable Creation: In clustering, we create a dependent variable by grouping data into clusters.
Role of Initialization: Proper centroid initialization (e.g., K-Means++) avoids the random initialization trap, ensuring better results.
WCSS: Tracks the within-cluster variance, helping determine the optimal number of clusters using the Elbow Method.
Visualization: Visualizing clusters and centroids helps interpret and validate the results.
By following these steps, you can apply K-Means clustering effectively to uncover patterns and insights in your data!
Introduction to Unsupervised Learning: Clustering
Clustering is a type of unsupervised learning, where the goal is to group data into clusters based on similarity. Unlike supervised learning, where models learn from labeled data, in clustering, the model discovers patterns and relationships on its own.
Key Characteristics of Clustering:
K-Means Clustering
How Does K-Means Work?
Deciding the Number of Clusters (K)
The Elbow Method helps determine the optimal number of clusters:
Random Initialization Trap
Poor initialization of centroids can lead to suboptimal clustering results. To avoid this, use the K-Means++ Initialization Algorithm:
Code Implementation
1. Importing Libraries and Dataset
2. Using the Elbow Method to Find Optimal Clusters
3. Training the K-Means Model on the Dataset
4. Visualizing the Clusters
Key Points to Remember
By following these steps, you can apply K-Means clustering effectively to uncover patterns and insights in your data!