Implementing Supervised learning to build a model which can identify if a movie review is positive or negative?
Our Procedure Overview:
Using Pandas we are going to read text files which has our data set
We then clean any not a number value or white space as a review
Using SkLearn library
we split the training and testing sets i.e 70% , 30% respectively.
now we will perform a vectorisation using feature extraction
text => numerical value for computer to understand
once we have the vectorisation done, we fit the model i.e we train it
Model is now ready to test, we test it using our data set 30%
Evaluate the predicted results with our actual results
Calculate confusing metrics
Calculate classification report
find accuracy_score as well
here since we need to vectorize our training set perform we can train our model,
again when testing it as well computer cannot understand plain text, so we need to vectorize the testing set as well. Since this process is repeated multiple times, sklearn provides an alternative way through pipeline. For example
finally you import a model/classifier and then train it and test it
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf, y_train)
Here until this point only our training set has been vectorized into a full vocabulary. In order to perform an analysis n our test set, we we would actaully have to repeat the process again, which can be tiresome. Especially if you have a long process.
So Sklearn provide us a pipeline here. So all the steps above can be shortly and precisely written like this:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
Implementing Supervised learning to build a model which can identify if a movie review is positive or negative?
Our Procedure Overview:
here since we need to vectorize our training set perform we can train our model,
again when testing it as well computer cannot understand plain text, so we need to vectorize the testing set as well. Since this process is repeated multiple times, sklearn provides an alternative way through pipeline. For example
here we had to fit and then transform
we can also do it directly using fit_transform
So for to give more weight to more important words,
we then need to use term frequency inverse document frequency next
finally you import a model/classifier and then train it and test it
So Sklearn provide us a pipeline here. So all the steps above can be shortly and precisely written like this: