Unisala

How can we evaluate our model after training our model?

#ModelEvaluation #Series #1

Typically in any classification task your model can only achieve two results.

Correct on it's prediction
Incorrect on it's prediction

Fortunately incorrect vs correct expands to situations where you have multiple classes. It does not matter if you are trying to predict categories with 8 different types of classes or 8 different categories. Your model fundamentally only has two outputs.

Correct ✅ Output
Incorrect ❌ Output

For example in binary classification, spam vs ham i,.e le gitimate message

Since this is classification, it's supervised learning. so we will train the model with around 70% of the data set. Then we will test the model with out testing data.

Note: Raw text is converted into numerical information, i.e vectorization.

Once we train our model, let's the the output.

Model Prediction

Ham, compare it to correct label HAM we have ✅
Ham, compare it with correct label SPAM we have incorrect ❌

At the end we have count of correct matches and counts of incorrect matches.

Important: the most fundamental part, in the real world, not all incorrect or correct matches hold equal value!

which is why we have various classification metrics. It's not enough to understand that you got a particular count of correct vs particular count incorrect. It's various ratios that we need to take into account.

A single metric won't tell a complete story.

Key Classification Metrics:

Accuracy:
Number of correct predictions made by the model divided by the total number of predictions. i,e total correct prediction/total predictions
for example: if our model correctly predicted 80 out 100 messages. we have 80/100 i.e 0.8 or 80% accurate model.
It is really useful if the target class is well balanced. for example, if we had roughly the same amount of spam messages as we have the legitimate message.
It is not a good choice with unbalanced classes. for example, if we had 99 legitimate ham messages and 1 spam text message, if our model was simple a line that always predicted HAM we would get 99% accuracy!
In this context, we want to understand recall and precision.
Recall:
Ability of a model to find all the relevant cases within a dataset.
Number of true positives / (number of true positives + number of false negatives)
Precision:
Ability of a model to find only the relevant cases within a dataset.
Number of true positives / (number of true positives + number of false positives)
F1 Score:
In cases where we want to find an optimal blend of precision and recall we can combine the two metrics using what is called F1 score
F1 score = harmonic mean of precision and recall
F1 score = 2 * ( (precision * recall) / (precision + recall) )
we use harmonic mean instead of simple average mean is because it punishes the extreme values.
if we have a model with precision 1.0 and recall 0.0
simple average mean = 0.5
harmonic mean = 2 * ( ( 1* 0) / (1 + 0) ) = 0

Often you have a trade off between Recall and Precision.

Recall expresses the ability to find all relevant instances in a data set.

Precision expresses the proportion of the data points our model says was relevant, that actually were relevant.

Precision and Recall typically make more sense in the context of a confusion matrix.

We can organize our predicted values compared to the real values in a confusion matrix.

which we will discuss in the future series.