Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. It enables machines to understand, interpret, generate, and respond to human language in a meaningful and context-aware manner.
Types of NLP Approaches
Seq2Seq ModelsOne of the most powerful NLP architectures.
Used for applications like machine translation, text summarization, and chatbot responses.
Classical vs. Deep Learning ModelsClassical NLP Models:Rule-based approaches (e.g., if-else conditions in chatbots)
Statistical methods (e.g., Bag of Words for classification)
Deep Learning-based NLP (DNLP):CNNs for text recognition and classification.
Sequence-to-sequence (Seq2Seq) models for tasks like machine translation and speech recognition.
Examples of NLP Applications
If-Else Chatbots: Basic rule-based chatbots that respond based on predefined conditions.
Audio Frequency Components Analysis: Used in speech recognition.
Bag of Words Model: Converts text into numerical vectors for classification tasks.
CNN for Text Recognition: Uses convolutional neural networks to classify text and images.
Seq2Seq Models: Handle many-to-many tasks such as language translation.
Understanding the Bag of Words Model
The Bag of Words (BoW) model is a fundamental NLP technique used for text classification. It represents text data as numerical vectors based on word frequency and position.
Vector Representation
Each document is represented as a vector.
Special tokens like Start of Sentence (SOS) and End of Sentence (EOS) are included.
If an unseen special word appears, it is positioned at the end of the vector.
The model’s goal is to classify the text (e.g., Yes/No classification for spam detection).
Process of Bag of Words Model
Convert Emails into VectorsEach email is transformed into a vector of length 20,000, representing word occurrences.
Apply a ModelA simple approach is using Logistic Regression to classify text (e.g., spam detection).
An alternative approach is using a Neural Network, where vectors are input into layers consisting of 20,000 neurons.
Structure of NLP Code
1. Importing the Dataset
Typically, we separate independent variables (features XX) from dependent variables (labels YY).
However, in NLP, data cleaning is done before separating variables.
2. Data Cleaning & Creating the Bag of Words Model
Tokenization: Splitting text into words/tokens.
Removing special characters and stop words.
Converting text to lowercase for uniformity.
Constructing a word frequency matrix.
3. Splitting the Dataset
Divide data into Training Set (used for learning) and Test Set (used for evaluation).
4. Training the Naïve Bayes Model
The Naïve Bayes classifier is trained on the dataset (can be replaced with other models as an exercise).
5. Predicting Test Set Results
The trained model makes predictions on the test set.
6. Creating a Confusion Matrix
The confusion matrix helps evaluate the model's performance.
Each cell contains either 0 (word not present) or 1 (word present in the review).
Tokenization
Tokenization is the process of breaking down text into smaller components (tokens), typically words or subwords, to make text analysis easier.
By following these structured steps, we can effectively implement NLP models for various tasks such as text classification, sentiment analysis, and machine translation.
Initial Imports and Data Loading
python
Copy
import nltk.stem.porter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv('/Users/nishanthapa/Desktop/AI_winter/NLP/hi.tsv', delimiter='\t', quoting=3)
This loads the required libraries and reads a TSV (tab-separated values) file. The quoting=3 parameter tells pandas to ignore double quotes in the text.
Text Cleaning Process
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
These imports are for text preprocessing. The code will:
Use regular expressions (re) for text cleaning
Remove stopwords (common words like "the", "is", "at" that don't add much meaning)
Apply stemming (reducing words to their root form, e.g., "running" → "run")
Creating Clean Corpus
corpus = []
for i in range(0,1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
review = ' '.join(review)
corpus.append(review)
For each review, this code:
Removes all characters except letters
Converts text to lowercase
Splits into individual words
Removes stopwords (except "not" since it's important for sentiment)
Applies stemming
Joins words back together
Adds the cleaned review to the corpus
Text Vectorization
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
x = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,-1].values
This converts the text data into numerical form using the Bag of Words model, keeping the 1500 most frequent words. Each review becomes a vector where each position represents a word's frequency.
Train-Test Split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.8, random_state=0)
Splits the data into training (20%) and testing (80%) sets.
Natural Language Processing (NLP) Intuition
What is NLP?
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. It enables machines to understand, interpret, generate, and respond to human language in a meaningful and context-aware manner.
Types of NLP Approaches
Examples of NLP Applications
Understanding the Bag of Words Model
The Bag of Words (BoW) model is a fundamental NLP technique used for text classification. It represents text data as numerical vectors based on word frequency and position.
Vector Representation
Process of Bag of Words Model
Structure of NLP Code
1. Importing the Dataset
2. Data Cleaning & Creating the Bag of Words Model
3. Splitting the Dataset
4. Training the Naïve Bayes Model
5. Predicting Test Set Results
6. Creating a Confusion Matrix
Tokenization
Tokenization is the process of breaking down text into smaller components (tokens), typically words or subwords, to make text analysis easier.
By following these structured steps, we can effectively implement NLP models for various tasks such as text classification, sentiment analysis, and machine translation.
This loads the required libraries and reads a TSV (tab-separated values) file. The quoting=3 parameter tells pandas to ignore double quotes in the text.
These imports are for text preprocessing. The code will:
corpus = []
For each review, this code:
This converts the text data into numerical form using the Bag of Words model, keeping the 1500 most frequent words. Each review becomes a vector where each position represents a word's frequency.
Splits the data into training (20%) and testing (80%) sets.
Trains a Naive Bayes classifier on the data and makes predictions on the test set.