Adds three new columns to analyze text properties:
num_characters: Total number of characters in the message.
num_words: Total number of words in the message.
num_sentence: Total number of sentences in the message.
4. Data Preprocessing
python
Copy
Edit
def transform_text(text):
ps = PorterStemmer()
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(ps.stem(i))
return " ".join(y)
df['transformed_text'] = df['text'].apply(transform_text)
Steps in Text Transformation:
Lowercasing: Converts all text to lowercase for uniformity.
Tokenization: Splits text into individual words.
Remove Non-Alphanumeric Characters: Filters out special characters.
Stopwords Removal: Removes common words that don’t add meaning (e.g., "is", "at").
Stemming: Reduces words to their root forms.
5. Corpus Analysis
Spam Corpus
python
Copy
Edit
spam_corpus = []
for msg in df[df['target'] == 1]['transformed_text'].tolist():
for word in msg.split():
spam_corpus.append(word)
Collects all words from spam messages into a list (spam_corpus).
Ham Corpus
python
Copy
Edit
ham_corpus = []
for msg in df[df['target'] == 0]['transformed_text'].tolist():
for word in msg.split():
ham_corpus.append(word)
Collects all words from ham messages into a list (ham_corpus).
6. Text Vectorization
python
Copy
Edit
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
cv = CountVectorizer()
tf = TfidfVectorizer(max_features=3000)
x = cv.fit_transform(df['transformed_text']).toarray()
Bag of Words (BOW): Converts text into numerical vectors based on word frequency.
TF-IDF: Considers word frequency and importance within the dataset.
7. Train-Test Split
python
Copy
Edit
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
Splits the dataset into training (80%) and testing (20%) subsets.
8. Model Building and Evaluation
python
Copy
Edit
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
BN = BernoulliNB()
BN.fit(x_train, y_train)
y_pred = BN.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
Naive Bayes Classifier:
BernoulliNB: Used for binary classification, especially with textual data.
Importing Required Libraries
2. Loading and Cleaning the Dataset
3. Exploratory Data Analysis (EDA)
Adding New Features
4. Data Preprocessing
Steps in Text Transformation:
5. Corpus Analysis
Spam Corpus
Ham Corpus
6. Text Vectorization
7. Train-Test Split
8. Model Building and Evaluation
Naive Bayes Classifier:
Metrics:
Outputs