Spam detection project

Importing Required Libraries

python

Copy

Edit
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string

Purpose: These libraries are essential for:
Data manipulation (pandas).
Numerical operations (numpy).
Visualization (matplotlib).
Natural Language Processing (nltk).
NLTK Components:
stopwords: Predefined common words in English to be filtered out (e.g., "the", "and").
PorterStemmer: For stemming, which reduces words to their root forms (e.g., "running" → "run").

2. Loading and Cleaning the Dataset

python

Copy

Edit
df = pd.read_csv('spam.csv', encoding='latin1')
df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True)
df.rename(columns={'v1': 'target', 'v2': 'text'}, inplace=True)

The dataset is loaded with unnecessary columns (Unnamed: 2, Unnamed: 3, Unnamed: 4), which are dropped.
The columns are renamed for clarity:
v1 → target: Represents whether the message is spam or ham.
v2 → text: Represents the actual message.

3. Exploratory Data Analysis (EDA)

python

Copy

Edit
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['target'] = encoder.fit_transform(df['target'])
df = df.drop_duplicates(keep='first')

Target Encoding: Converts target into binary values:
spam → 1.
ham → 0.
Duplicate Removal: Ensures there are no duplicate rows in the dataset.

Adding New Features

python

Copy

Edit
df['num_characters'] = df['text'].apply(len)
df['num_words'] = df['text'].apply(lambda x: len(nltk.word_tokenize(x)))
df['num_sentence'] = df['text'].apply(lambda x: len(nltk.sent_tokenize(x)))

Adds three new columns to analyze text properties:
num_characters: Total number of characters in the message.
num_words: Total number of words in the message.
num_sentence: Total number of sentences in the message.

4. Data Preprocessing

python

Copy

Edit
def transform_text(text):
    ps = PorterStemmer()
    text = text.lower()
    text = nltk.word_tokenize(text)
    y = []
    for i in text:
        if i.isalnum():
            y.append(i)
    text = y[:]
    y.clear()
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
    text = y[:]
    y.clear()
    for i in text:
        y.append(ps.stem(i))
    return " ".join(y)

df['transformed_text'] = df['text'].apply(transform_text)

Steps in Text Transformation:

Lowercasing: Converts all text to lowercase for uniformity.
Tokenization: Splits text into individual words.
Remove Non-Alphanumeric Characters: Filters out special characters.
Stopwords Removal: Removes common words that don’t add meaning (e.g., "is", "at").
Stemming: Reduces words to their root forms.

5. Corpus Analysis

Spam Corpus

python

Copy

Edit
spam_corpus = []
for msg in df[df['target'] == 1]['transformed_text'].tolist():
    for word in msg.split():
        spam_corpus.append(word)

Collects all words from spam messages into a list (spam_corpus).

Ham Corpus

python

Copy

Edit
ham_corpus = []
for msg in df[df['target'] == 0]['transformed_text'].tolist():
    for word in msg.split():
        ham_corpus.append(word)

Collects all words from ham messages into a list (ham_corpus).

6. Text Vectorization

python

Copy

Edit
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer()
tf = TfidfVectorizer(max_features=3000)
x = cv.fit_transform(df['transformed_text']).toarray()

Bag of Words (BOW): Converts text into numerical vectors based on word frequency.
TF-IDF: Considers word frequency and importance within the dataset.

7. Train-Test Split

python

Copy

Edit
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

Splits the dataset into training (80%) and testing (20%) subsets.

8. Model Building and Evaluation

python

Copy

Edit
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
BN = BernoulliNB()
BN.fit(x_train, y_train)
y_pred = BN.predict(x_test)
cm = confusion_matrix(y_test, y_pred)

Naive Bayes Classifier:

BernoulliNB: Used for binary classification, especially with textual data.

Metrics:

Confusion Matrix: Shows true positives, true negatives, false positives, and false negatives.
Accuracy Score: Proportion of correctly classified messages.

Outputs

Prints the predicted labels (y_pred), confusion matrix (cm), and accuracy score.