Feature Extraction

Series on NLP #4

Most classic ML algorithms can't take a raw text, Thus introduced feature "extraction" from the raw text in order to pass numerical features to the ML algorithm

for example: we can count the occurence of each word to map text to a number.

Let's discuss Counter Vectorisation along with Term-Frequency and Inverse Document Frequency.

messages = ["hey, let's go to the game today!", "call your sister", "want to go to walk your dogs"]

Count verctorizer: Count the occurence of unique words. Treats each unique words as features, then account the each unique word through out the documents i.e each string in an array.
For large set of documents, also known as corpus, we going to have sparse matrix, a matrix of a lots of zeros.
this sort of matrix is know as the document term matrix or DTM
we're just counting the number of times each unique word throughout the entire vocabulary.

An alternative to Count vectorizer is something called TF-IDF vectorizer i.e term frequency - inverse document frequency vectorizer

TF-IDF vectorizer:
It also creates a document term matrix from our messages
However instead of just filling the document term matrix i.e DTM with token counts, it calculates term frequency inverse document frequency value for each word

let's talk about what TF-IDF means?

Term Frequency tf(t,d)
A function of term in a particular document i.e sentence
number of times that term t occurs in document d

However, Term frequency alone is not enough for a thorough feature analysis of text.

For example: stop words like a, or the.

Because the term "the" is so common term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "red" or "dogs"

An inverse document frequency factor is incorporated which

diminishes the weight of terms that occur very frequently in the document
increases the weight of terms that occur rarely.

how to implement in code:

from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()

documentTerm = vect.fit_transform(messages)