Implementing Supervised learning to build a model which can identify if a movie review is positive or negative?

Our Procedure Overview:

Using Pandas we are going to read text files which has our data set
We then clean any not a number value or white space as a review
Using SkLearn library
we split the training and testing sets i.e 70% , 30% respectively.
now we will perform a vectorisation using feature extraction
text => numerical value for computer to understand
once we have the vectorisation done, we fit the model i.e we train it
Model is now ready to test, we test it using our data set 30%
Evaluate the predicted results with our actual results
Calculate confusing metrics
Calculate classification report
find accuracy_score as well

here since we need to vectorize our training set perform we can train our model,

again when testing it as well computer cannot understand plain text, so we need to vectorize the testing set as well. Since this process is repeated multiple times, sklearn provides an alternative way through pipeline. For example

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state = 42)

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

count_vect.fit(X_train)

x_train_counts = count_vect.transform(X_train)

here we had to fit and then transform

we can also do it directly using fit_transform

X_train_counts = count_vect.fit_transform(X_train)

So for to give more weight to more important words,

we then need to use term frequency inverse document frequency next

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

finally you import a model/classifier and then train it and test it

from sklearn.svm import LinearSVC

clf = LinearSVC()

clf.fit(X_train_tfidf, y_train)

Here until this point only our training set has been vectorized into a full vocabulary. In order to perform an analysis n our test set, we we would actaully have to repeat the process again, which can be tiresome. Especially if you have a long process.

So Sklearn provide us a pipeline here. So all the steps above can be shortly and precisely written like this:

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

text_clf.fit(X_train, y_train)

predictions = text_clf.predict(X_test)