Intro to Deep Learning — performance metrics(Precision, Recall, F1, ROC, PR, PRG)
This article introduces commonly used performance metrics in deep learning and machine learning. Using appropriate performance metrics would be important to compare and identify the best machine learning model for any given problem. To understand the popular metrics — accuracy, precision, recall, f1, etc., let’s first go over the confusion matrix.
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data. There are four entries in a confusion matrix — true positive, false positive, false negative, and true negative. True positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model incorrectly predicts the positive class. A false negative is an outcome where the model incorrectly predicts the negative class.
Accuracy measures the number of data the model correctly predicted(true positive + true negative) over the total number of data.
Precision measures the number of true positive over the sum of the classes the model predicted as positive(true positive + false positive). It can be thought of as the probability the model correctly predicted as positive was actually positive.
Recall measures the number of true positives over the sum of positive classes. It can be thought of as the fraction the model correctly predicted among the positive classes.
Precision and recall don’t consider the true negative. To get high precision, the model needs to reduce false positive(i.e. when the model incorrectly predicts as positive which was actually negative class). A good example application where precision could be an appropriate metric would be a spam email scanner. To get a high recall, the model needs to decrease false negative(i.e. when the model incorrectly predicts as negative which was actually positive). Achieving high recall would be important in the applications where the false negative value should be low, such as disease diagnosis.
F1 score combines precision and recall and is defined by the harmonic mean of them. F1 score also doesn’t consider false negative and it is commonly used in natural language processing, information retrieval systems, etc.
Now let’s look at additional metrics like ROC and PR, which are common ways to compare models that predict probabilities for two-class problems.
ROC(Receiver Operating Characteristic) curve represents the relation between true positive rate(y-axis) and false positive rate(x-axis), where true positive rate = true positive / (true positive + false negative) and false positive rate = false positive / (false positive + true negative). A good ROC curve would have a relatively high true positive rate and show a steep curve where the x-axis value is small.
PR(Precision and Recall) curve represents the relation between precision and recall, where precision is the y-axis and recall is the x-axis. Therefore, a good PR curve would have relatively high precision and high recall, drawing a wide curve around the origin.
Let’s compare two binary classifiers, Adaboost and Logistic regression, using OpenML dataset and ROC / PR curves for each of them.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curvediabets = pd.read_csv("dataset_37_diabetes.csv")
diabets_x = diabets.drop("class", axis=1)
diabets_y = diabets["class"]diabets_x_train, diabets_x_test, diabets_y_train, diabets_y_test = train_test_split(diabets_x, diabets_y, test_size=0.3, random_state=42)#train logistic regression classifier
lrmodel = LogisticRegression(max_iter=1000)
lr_yhat = lrmodel.predict_proba(diabets_x_test)#train adaboost classifier
abmodel = AdaBoostClassifier()
ab_yhat = abmodel.predict_proba(diabets_x_test)#plot roc curves
lr_pos_probs = lr_yhat[:, 1]
ab_pos_probs = ab_yhat[:, 1]fpr_lr, tpr_lr, _ = roc_curve(diabets_y_test, lr_pos_probs, pos_label='tested_positive')
plt.plot(fpr_lr, tpr_lr, label='Logistic')fpr_ab, tpr_ab, _ = roc_curve(diabets_y_test, ab_pos_probs, pos_label='tested_positive')
plt.plot(fpr_ab, tpr_ab, label='Adaboost')plt.scatter(, , label='all positive classifier', color='red')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
#plot PR curve
precision_lr, recall_lr, _ = precision_recall_curve(diabets_y_test, lr_pos_probs, pos_label='tested_positive')
plt.plot(recall_lr, precision_lr, label='Logistic')precision_ab, recall_ab, _ = precision_recall_curve(diabets_y_test, ab_pos_probs, pos_label='tested_positive')
plt.plot(recall_ab, precision_ab, label='Adaboost')plt.scatter([1., 1.], [precision_lr, precision_ab], label='all positive classifier')plt.xlabel('Recall')
Recently, a new metric PRG(‘PR Gain curve’) is introduced. Peter et al. claimed that PR Gain curve provides a better comparison between different models and enables the identification of inferior ones. To empirically check this claim, let’s compare AUROC (Area under ROC), AUPR (Area under PR), and AUPRG (Area under PRG) for two classifiers.
from prg import prg
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score#AUROC
roc_auc_lr = roc_auc_score(diabets_y_test, lr_pos_probs)
roc_auc_ab = roc_auc_score(diabets_y_test, ab_pos_probs)#AUPR
pr_auc_lr = auc(recall_lr, precision_lr)
pr_auc_ab = auc(recall_ab, precision_ab)#convert label dtype to int
binarylabel = 
for label in diabets_y_test:
if label == "tested_positive":
binarylabel = np.array(binarylabel)#AUPRG
prg_curve_lr = prg.create_prg_curve(binarylabel, lr_pos_probs)
prg_curve_ab = prg.create_prg_curve(binarylabel, ab_pos_probs)
auprg_lr = prg.calc_auprg(prg_curve_lr)
auprg_ab = prg.calc_auprg(prg_curve_ab)print("AUROC for logistic regression : " + str(roc_auc_lr))
print("AUROC for adaboost : " + str(roc_auc_ab))
print("AUPR for logistic regression : " + str(pr_auc_lr))
print("AUPR for adaboost : " + str(pr_auc_ab))
print("AUPRG for logistic regression : " + str(auprg_lr))
print("AUPRG for adaboost : " + str(auprg_ab))
AUROC for logistic regression : 0.7963576158940396
AUROC for adaboost : 0.7551738410596026
AUPR for logistic regression : 0.6608411274115765
AUPR for adaboost : 0.633429143930323
AUPRG for logistic regression : 0.7054493093059626
AUPRG for adaboost : 0.6422446878807431
From the AUPR scores, it is observed that AUPR favours the model with a lower expected F1 score(Adaboost) than the other model(logistic regression). The result of logistic regression and Adaboost showed a similar AUPR score and it would be hard to select the better model by comparing only the AUPR scores between two classifiers. AUPRG shows a better comparison between two classifiers and this supports that it would be better to use PRG curves in practices.
- Peter A. Flach and Meelis Kull, Precision-Recall-Gain Curves: PR Analysis Done Right, NIPS 2015.