Intro to Deep Learning — performance metrics(Precision, Recall, F1, ROC, PR, PRG)

This article introduces commonly used performance metrics in deep learning and machine learning. Using appropriate performance metrics would be important to compare and identify the best machine learning model for any given problem. To understand the popular metrics — accuracy, precision, recall, f1, etc., let’s first go over the confusion matrix.

Confusion matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data. There are four entries in a confusion matrix — true positive, false positive, false negative, and true negative. True positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model incorrectly predicts the positive class. A false negative is an outcome where the model incorrectly predicts the negative class.

Confusion matrix

Precision, Accuracy

Accuracy measures the number of data the model correctly predicted(true positive + true negative) over the total number of data.

Precision measures the number of true positive over the sum of the classes the model predicted as positive(true positive + false positive). It can be thought of as the probability the model correctly predicted as positive was actually positive.

Recall

Recall measures the number of true positives over the sum of positive classes. It can be thought of as the fraction the model correctly predicted among the positive classes.

Precision and recall don’t consider the true negative. To get high precision, the model needs to reduce false positive(i.e. when the model incorrectly predicts as positive which was actually negative class). A good example application where precision could be an appropriate metric would be a spam email scanner. To get a high recall, the model needs to decrease false negative(i.e. when the model incorrectly predicts as negative which was actually positive). Achieving high recall would be important in the applications where the false negative value should be low, such as disease diagnosis.

F1 Score

F1 score combines precision and recall and is defined by the harmonic mean of them. F1 score also doesn’t consider false negative and it is commonly used in natural language processing, information retrieval systems, etc.

Now let’s look at additional metrics like ROC and PR, which are common ways to compare models that predict probabilities for two-class problems.

ROC(Receiver Operating Characteristic) curve represents the relation between true positive rate(y-axis) and false positive rate(x-axis), where true positive rate = true positive / (true positive + false negative) and false positive rate = false positive / (false positive + true negative). A good ROC curve would have a relatively high true positive rate and show a steep curve where the x-axis value is small.

PR(Precision and Recall) curve represents the relation between precision and recall, where precision is the y-axis and recall is the x-axis. Therefore, a good PR curve would have relatively high precision and high recall, drawing a wide curve around the origin.

Let’s compare two binary classifiers, Adaboost and Logistic regression, using OpenML dataset and ROC / PR curves for each of them.

ROC curve of logistic regression and adaboost
PR Curve of logistic regression and adaboost

PRG(PR Gain)

Recently, a new metric PRG(‘PR Gain curve’) is introduced. Peter et al. claimed that PR Gain curve provides a better comparison between different models and enables the identification of inferior ones. To empirically check this claim, let’s compare AUROC (Area under ROC), AUPR (Area under PR), and AUPRG (Area under PRG) for two classifiers.

From the AUPR scores, it is observed that AUPR favours the model with a lower expected F1 score(Adaboost) than the other model(logistic regression). The result of logistic regression and Adaboost showed a similar AUPR score and it would be hard to select the better model by comparing only the AUPR scores between two classifiers. AUPRG shows a better comparison between two classifiers and this supports that it would be better to use PRG curves in practices.

References

Machine Learning | Software Engineer