Natural Language Processing (NLP) Intuition

Natural Language Processing (NLP) Intuition

What is NLP?

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. It enables machines to understand, interpret, generate, and respond to human language in a meaningful and context-aware manner.

Types of NLP Approaches

Seq2Seq ModelsOne of the most powerful NLP architectures.
Used for applications like machine translation, text summarization, and chatbot responses.
Classical vs. Deep Learning ModelsClassical NLP Models:Rule-based approaches (e.g., if-else conditions in chatbots)
Statistical methods (e.g., Bag of Words for classification)
Deep Learning-based NLP (DNLP):CNNs for text recognition and classification.
Sequence-to-sequence (Seq2Seq) models for tasks like machine translation and speech recognition.

Examples of NLP Applications

If-Else Chatbots: Basic rule-based chatbots that respond based on predefined conditions.
Audio Frequency Components Analysis: Used in speech recognition.
Bag of Words Model: Converts text into numerical vectors for classification tasks.
CNN for Text Recognition: Uses convolutional neural networks to classify text and images.
Seq2Seq Models: Handle many-to-many tasks such as language translation.

Understanding the Bag of Words Model

The Bag of Words (BoW) model is a fundamental NLP technique used for text classification. It represents text data as numerical vectors based on word frequency and position.

Vector Representation

Each document is represented as a vector.
Special tokens like Start of Sentence (SOS) and End of Sentence (EOS) are included.
If an unseen special word appears, it is positioned at the end of the vector.
The model’s goal is to classify the text (e.g., Yes/No classification for spam detection).

Process of Bag of Words Model

Convert Emails into VectorsEach email is transformed into a vector of length 20,000, representing word occurrences.
Apply a ModelA simple approach is using Logistic Regression to classify text (e.g., spam detection).
An alternative approach is using a Neural Network, where vectors are input into layers consisting of 20,000 neurons.

Structure of NLP Code

1. Importing the Dataset

Typically, we separate independent variables (features XX) from dependent variables (labels YY).
However, in NLP, data cleaning is done before separating variables.

2. Data Cleaning & Creating the Bag of Words Model

Tokenization: Splitting text into words/tokens.
Removing special characters and stop words.
Converting text to lowercase for uniformity.
Constructing a word frequency matrix.

3. Splitting the Dataset

Divide data into Training Set (used for learning) and Test Set (used for evaluation).

4. Training the Naïve Bayes Model

The Naïve Bayes classifier is trained on the dataset (can be replaced with other models as an exercise).

5. Predicting Test Set Results

The trained model makes predictions on the test set.

6. Creating a Confusion Matrix

The confusion matrix helps evaluate the model's performance.
Each cell contains either 0 (word not present) or 1 (word present in the review).

Tokenization

Tokenization is the process of breaking down text into smaller components (tokens), typically words or subwords, to make text analysis easier.

By following these structured steps, we can effectively implement NLP models for various tasks such as text classification, sentiment analysis, and machine translation.

Initial Imports and Data Loading

python


Copy
import nltk.stem.porter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv('/Users/nishanthapa/Desktop/AI_winter/NLP/hi.tsv', delimiter='\t', quoting=3)

This loads the required libraries and reads a TSV (tab-separated values) file. The quoting=3 parameter tells pandas to ignore double quotes in the text.

Text Cleaning Process

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

These imports are for text preprocessing. The code will:

Use regular expressions (re) for text cleaning
Remove stopwords (common words like "the", "is", "at" that don't add much meaning)
Apply stemming (reducing words to their root form, e.g., "running" → "run")

Creating Clean Corpus

corpus = []

for i in range(0,1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review)
    corpus.append(review)

For each review, this code:

Removes all characters except letters
Converts text to lowercase
Splits into individual words
Removes stopwords (except "not" since it's important for sentiment)
Applies stemming
Joins words back together
Adds the cleaned review to the corpus

Text Vectorization

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
x = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,-1].values

This converts the text data into numerical form using the Bag of Words model, keeping the 1500 most frequent words. Each review becomes a vector where each position represents a word's frequency.

Train-Test Split

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=0)

Splits the data into training (20%) and testing (80%) sets.

Model Training and Prediction

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)

Trains a Naive Bayes classifier on the data and makes predictions on the test set.

Evaluation

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))