K - Nearest Neighbour

What is K-Nearest Neighbors (K-NN)?

K-NN is a simple, supervised machine learning algorithm used for classification and regression tasks. It relies on the idea of similarity (distance metrics) to predict the class or value of a new data point based on its neighbors.

Key Features of K-NN

Instance-Based Learning: K-NN is a lazy learning algorithm. It doesn't build an explicit model during training but directly uses the training data for predictions.
Non-Parametric: It makes no assumptions about the underlying data distribution, making it flexible for many types of data.
Simplicity: The algorithm is easy to implement and intuitive to understand.

How Does K-NN Work?

Choosing K: Decide the number of neighbors KK. This is often done through cross-validation.

K=1K=1: The model may overfit the data.
KK too large: It may oversmooth and lose local detail.

Distance Metric: Calculate the distance from the new data point to all existing points. The most common distance metric is:

Euclidean Distance: d(P,Q)=sqrtsumi=1n(qi−pi)2d(P, Q) = \\sqrt{\\sum_{i=1}^n (q_i - p_i)^2}

Vote or Average:

For Classification: Assign the category that appears most frequently among the KK nearest neighbors.
For Regression: Take the average value of the KK nearest neighbors.

Make Predictions: Assign the class or compute the value for the new data point.

Example of K-NN for Classification

Imagine a dataset of fruits with features like weight and sweetness:

Training Data:
Apple: Weight=200g, Sweetness=6
Orange: Weight=150g, Sweetness=8
Banana: Weight=120g, Sweetness=7
New Fruit: Weight=160g, Sweetness=7

Using K=3K=3, the algorithm:

Calculates distances to all fruits in the training set.
Finds the three nearest neighbors.
Assigns the new fruit to the category with the majority of the neighbors.

Advantages of K-NN

Simplicity: Easy to implement and interpret.
Versatility: Works well for classification and regression.
No Training Phase: Fast training since it doesn’t build a model.
Handles Multi-Class Data: Works seamlessly for datasets with multiple categories.

Disadvantages of K-NN

Computationally Expensive:

It computes distances for all points in the dataset, which can be slow for large datasets.

Storage Requirements:

Requires storing the entire training dataset.

Sensitive to Irrelevant Features:

Features with no predictive value can affect distance calculations.

Imbalanced Data:

Classes with more examples can dominate predictions.

Choosing KK and Distance Metric:

The choice of KK and metric is critical and often requires tuning.

Improving K-NN

Feature Scaling:

Normalize features to ensure fair contributions to distance calculations.
Common methods: Min-Max Scaling, Standardization.

Weighted K-NN:

Assign higher weights to closer neighbors to give them more importance.

Dimensionality Reduction:

Use techniques like PCA (Principal Component Analysis) to reduce irrelevant features.

Efficient Search Techniques:

Use data structures like KD-Trees or Ball Trees to speed up neighbor searches.

Applications of K-NN

Recommendation Systems:

Suggest movies or products based on user preferences.

Image Recognition:

Classify images into categories like "cat" or "dog."

Medical Diagnosis:

Predict diseases based on patient symptoms.

Anomaly Detection:

Identify outliers in data for fraud detection or system monitoring.

Why Does K-NN Work?

The key idea is similarity: points that are close in feature space often belong to the same class or have similar target values. This aligns with the assumption that local structure in the data provides useful information.

When Should You Use K-NN?

When your dataset is relatively small or has few features.
If interpretability and simplicity are priorities.
When you don’t need real-time predictions (because predictions can be slow for large datasets).

Limitations in Practice

Curse of Dimensionality:

In high-dimensional spaces, distances lose meaning, reducing effectiveness.
Mitigation: Feature selection or dimensionality reduction.

Noise Sensitivity:

Noisy data points can mislead predictions.
Mitigation: Increase KK, or clean your data.

Outliers:

Distant outliers can significantly affect predictions.
Mitigation: Use robust distance metrics or outlier detection.

K-NN in Code

The code I provided earlier walks through a complete example using Python's sklearn library. It shows:

Loading data.
Feature scaling.
Training and evaluating a K-NN classifier.
Visualizing results.

## How K-NN Works

1. **Choose the number of K neighbors**: Decide the number of neighbors to consider (e.g., K=5).

2. **Find the K nearest neighbors**: Calculate the Euclidean distance between the new data point and all other data points.

3. **Count the categories**: Determine how many neighbors belong to each category.

4. **Assign the category**: Assign the new data point to the category with the majority of neighbors.

### Formula for Euclidean Distance:

For two points \( P(x_1, y_1) \) and \( Q(x_2, y_2) \):

\text{Distance} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

---

### Python Implementation

```python

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Step 1: Importing the dataset

dataset = pd.read_csv('Social_Network_Ads.csv') //any data set that you have

x = dataset.iloc[:, :-1].values

y = dataset.iloc[:, -1].values

# Step 2: Splitting the dataset into Training and Test sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Step 3: Feature scaling

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)

x_test = scaler.transform(x_test)

# Step 4: Training the K-NN model

from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)

classifier.fit(x_train, y_train)

# Step 5: Predicting a single result

# Example: Predict the category for Age=30, Estimated Salary=87000

#print(classifier.predict(scaler.transform([[30, 87000]])))

# Step 6: Predicting the Test set results

y_pred = classifier.predict(x_test)

# Display predictions alongside actual values

print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))

# Step 7: Evaluating the model

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:\n", cm)

print("Accuracy:", accuracy_score(y_test, y_pred))

# Step 8: Visualizing the Training set results

from matplotlib.colors import ListedColormap

X_set, y_set = scaler.inverse_transform(x_train), y_train

X1, X2 = np.meshgrid(

np.arange(start=X_set[:, 0].min() - 10, stop=X_set[:, 0].max() + 10, step=0.5),

np.arange(start=X_set[:, 1].min() - 1000, stop=X_set[:, 1].max() + 1000, step=0.5)

)

plt.contourf(X1, X2, classifier.predict(scaler.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),

alpha=0.75, cmap=ListedColormap(['#FA8072', '#1E90FF']))

plt.xlim(X1.min(), X1.max())

plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):

plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],

c=ListedColormap(['#FA8072', '#1E90FF'])(i), label=j)

plt.title('K-NN (Training set)')

plt.xlabel('Age')

plt.ylabel('Estimated Salary')

plt.legend()

plt.show()

# Step 9: Visualizing the Test set results

X_set, y_set = scaler.inverse_transform(x_test), y_test