# Intro to Deep Learning — performance metrics(Precision, Recall, F1, ROC, PR, PRG)

This article introduces commonly used performance metrics in deep learning and machine learning. Using appropriate performance metrics would be important to compare and identify the best machine learning model for any given problem. To understand the popular metrics — accuracy, precision, recall, f1, etc., let’s first go over the confusion matrix.

**Confusion matrix**

A **confusion matrix** is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data. There are four entries in a confusion matrix — true positive, false positive, false negative, and true negative. True positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class. A false positive is an outcome where the model *incorrectly* predicts the *positive* class. A false negative is an outcome where the model *incorrectly* predicts the *negative *class.

**Precision, Accuracy**

Accuracy measures the number of data the model correctly predicted(true positive + true negative) over the total number of data.

Precision measures the number of true positive over the sum of the classes the model predicted as positive(true positive + false positive). It can be thought of as the probability the model correctly predicted as positive was actually positive.

**Recall**

Recall measures the number of true positives over the sum of positive classes. It can be thought of as the fraction the model correctly predicted among the positive classes.

Precision and recall don’t consider the true negative. To get high precision, the model needs to reduce false positive(i.e. when the model incorrectly predicts as positive which was actually negative class). A good example application where precision could be an appropriate metric would be a spam email scanner. To get a high recall, the model needs to decrease false negative(i.e. when the model incorrectly predicts as negative which was actually positive). Achieving high recall would be important in the applications where the false negative value should be low, such as disease diagnosis.

**F1 Score**

F1 score combines precision and recall and is defined by the harmonic mean of them. F1 score also doesn’t consider false negative and it is commonly used in natural language processing, information retrieval systems, etc.

Now let’s look at additional metrics like ROC and PR, which are common ways to compare models that predict probabilities for two-class problems.

**ROC(Receiver Operating Characteristic) **curve represents the relation between true positive rate(y-axis) and false positive rate(x-axis), where true positive rate = true positive / (true positive + false negative) and false positive rate = false positive / (false positive + true negative). A good ROC curve would have a relatively high true positive rate and show a steep curve where the x-axis value is small.

**PR(Precision and Recall)** curve represents the relation between precision and recall, where precision is the y-axis and recall is the x-axis. Therefore, a good PR curve would have relatively high precision and high recall, drawing a wide curve around the origin.

Let’s compare two binary classifiers, Adaboost and Logistic regression, using OpenML dataset and ROC / PR curves for each of them.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import AdaBoostClassifier

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.metrics import roc_curve

from sklearn.metrics import precision_recall_curvediabets = pd.read_csv("dataset_37_diabetes.csv")

diabets_x = diabets.drop("class", axis=1)

diabets_y = diabets["class"]diabets_x_train, diabets_x_test, diabets_y_train, diabets_y_test = train_test_split(diabets_x, diabets_y, test_size=0.3, random_state=42)#train logistic regression classifier

lrmodel = LogisticRegression(max_iter=1000)

lrmodel.fit(diabets_x_train, diabets_y_train)

lr_yhat = lrmodel.predict_proba(diabets_x_test)#train adaboost classifier

abmodel = AdaBoostClassifier()

abmodel.fit(diabets_x_train, diabets_y_train)

ab_yhat = abmodel.predict_proba(diabets_x_test)#plot roc curves

lr_pos_probs = lr_yhat[:, 1]

ab_pos_probs = ab_yhat[:, 1]fpr_lr, tpr_lr, _ = roc_curve(diabets_y_test, lr_pos_probs, pos_label='tested_positive')

plt.plot(fpr_lr, tpr_lr, label='Logistic')fpr_ab, tpr_ab, _ = roc_curve(diabets_y_test, ab_pos_probs, pos_label='tested_positive')

plt.plot(fpr_ab, tpr_ab, label='Adaboost')plt.scatter([1], [1], label='all positive classifier', color='red')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.legend()

plt.show()

#plot PR curve

precision_lr, recall_lr, _ = precision_recall_curve(diabets_y_test, lr_pos_probs, pos_label='tested_positive')

plt.plot(recall_lr, precision_lr, label='Logistic')precision_ab, recall_ab, _ = precision_recall_curve(diabets_y_test, ab_pos_probs, pos_label='tested_positive')

plt.plot(recall_ab, precision_ab, label='Adaboost')plt.scatter([1., 1.], [precision_lr[0], precision_ab[0]], label='all positive classifier')plt.xlabel('Recall')

plt.ylabel('Precision')

plt.legend().

plt.show()

**PRG(PR Gain)**

Recently, a new metric PRG(‘PR Gain curve’) is introduced. Peter et al. claimed that PR Gain curve provides a better comparison between different models and enables the identification of inferior ones. To empirically check this claim, let’s compare AUROC (Area under ROC), AUPR (Area under PR), and AUPRG (Area under PRG) for two classifiers.

from prg import prg

from sklearn.metrics import auc

from sklearn.metrics import roc_auc_score#AUROC

roc_auc_lr = roc_auc_score(diabets_y_test, lr_pos_probs)

roc_auc_ab = roc_auc_score(diabets_y_test, ab_pos_probs)#AUPR

pr_auc_lr = auc(recall_lr, precision_lr)

pr_auc_ab = auc(recall_ab, precision_ab)#convert label dtype to int

binarylabel = []

for label in diabets_y_test:

if label == "tested_positive":

binarylabel.append(1)

else:

binarylabel.append(0)

binarylabel = np.array(binarylabel)#AUPRG

prg_curve_lr = prg.create_prg_curve(binarylabel, lr_pos_probs)

prg_curve_ab = prg.create_prg_curve(binarylabel, ab_pos_probs)

auprg_lr = prg.calc_auprg(prg_curve_lr)

auprg_ab = prg.calc_auprg(prg_curve_ab)print("AUROC for logistic regression : " + str(roc_auc_lr))

print("AUROC for adaboost : " + str(roc_auc_ab))

print("AUPR for logistic regression : " + str(pr_auc_lr))

print("AUPR for adaboost : " + str(pr_auc_ab))

print("AUPRG for logistic regression : " + str(auprg_lr))

print("AUPRG for adaboost : " + str(auprg_ab))

`AUROC for logistic regression : 0.7963576158940396`

AUROC for adaboost : 0.7551738410596026

AUPR for logistic regression : 0.6608411274115765

AUPR for adaboost : 0.633429143930323

AUPRG for logistic regression : 0.7054493093059626

AUPRG for adaboost : 0.6422446878807431

From the AUPR scores, it is observed that AUPR favours the model with a lower expected F1 score(Adaboost) than the other model(logistic regression). The result of logistic regression and Adaboost showed a similar AUPR score and it would be hard to select the better model by comparing only the AUPR scores between two classifiers. AUPRG shows a better comparison between two classifiers and this supports that it would be better to use PRG curves in practices.

**References**

- Peter A. Flach and Meelis Kull, Precision-Recall-Gain Curves: PR Analysis Done Right, NIPS 2015.
- https://en.wikipedia.org/wiki/F-score