Adds three new columns to analyze text properties:
num_characters: Total number of characters in the message.
num_words: Total number of words in the message.
num_sentence: Total number of sentences in the message.
4. Data Preprocessing
python
Copy
Edit
def transform_text(text):
ps = PorterStemmer()
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(ps.stem(i))
return " ".join(y)
df['transformed_text'] = df['text'].apply(transform_text)
Steps in Text Transformation:
Lowercasing: Converts all text to lowercase for uniformity.
Tokenization: Splits text into individual words.
Remove Non-Alphanumeric Characters: Filters out special characters.
Stopwords Removal: Removes common words that don’t add meaning (e.g., "is", "at").
Stemming: Reduces words to their root forms.
5. Corpus Analysis
Spam Corpus
python
Copy
Edit
spam_corpus = []
for msg in df[df['target'] == 1]['transformed_text'].tolist():
for word in msg.split():
spam_corpus.append(word)
Collects all words from spam messages into a list (spam_corpus).
Ham Corpus
python
Copy
Edit
ham_corpus = []
for msg in df[df['target'] == 0]['transformed_text'].tolist():
for word in msg.split():
ham_corpus.append(word)
Collects all words from ham messages into a list (ham_corpus).
6. Text Vectorization
python
Copy
Edit
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer()
tf = TfidfVectorizer(max_features=3000)
x = cv.fit_transform(df['transformed_text']).toarray()
Bag of Words (BOW): Converts text into numerical vectors based on word frequency.
TF-IDF: Considers word frequency and importance within the dataset.
7. Train-Test Split
python
Copy
Edit
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
Splits the dataset into training (80%) and testing (20%) subsets.
8. Model Building and Evaluation
python
Copy
Edit
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
BN = BernoulliNB()
BN.fit(x_train, y_train)
y_pred = BN.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
Naive Bayes Classifier:
BernoulliNB: Used for binary classification, especially with textual data.
Importing Required Libraries
2. Loading and Cleaning the Dataset
python Copy Edit df = pd.read_csv('spam.csv', encoding='latin1') df.drop(columns=['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace=True) df.rename(columns={'v1': 'target', 'v2': 'text'}, inplace=True)3. Exploratory Data Analysis (EDA)
Adding New Features
4. Data Preprocessing
python Copy Edit def transform_text(text): ps = PorterStemmer() text = text.lower() text = nltk.word_tokenize(text) y = [] for i in text: if i.isalnum(): y.append(i) text = y[:] y.clear() for i in text: if i not in stopwords.words('english') and i not in string.punctuation: y.append(i) text = y[:] y.clear() for i in text: y.append(ps.stem(i)) return " ".join(y) df['transformed_text'] = df['text'].apply(transform_text)Steps in Text Transformation:
5. Corpus Analysis
Spam Corpus
python Copy Edit spam_corpus = [] for msg in df[df['target'] == 1]['transformed_text'].tolist(): for word in msg.split(): spam_corpus.append(word)Ham Corpus
python Copy Edit ham_corpus = [] for msg in df[df['target'] == 0]['transformed_text'].tolist(): for word in msg.split(): ham_corpus.append(word)6. Text Vectorization
7. Train-Test Split
8. Model Building and Evaluation
Naive Bayes Classifier:
Metrics:
Outputs