Skip to main content

Table 1 Brief overview of some of the most frequently used performance measures for machine learning models

From: Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification

Performance metricInterpretation
AccuracyBrief definition:
Accuracy is a classifier that works best on balanced data sets. It is a measure that informs about the correct classifications out of all classifications. It can have values from 0 - 100 %
If we are dealing with a binary classification, e.g. cancer versus healthy, and we have 20 patients with cancer and 80 healthy controls, a model accuracy of 80% would mean that the model classified every subject into the majority class (healthy) and is completely unable to classify cancer patients, although the accuracy indicates a good performance.
SensitivityBrief definition:
The sensitivity is the true positive rate of a test. This means, how many subjects with a disease are actually identified as having the disease by the test. The values range from 0 to 100%.
Let us say we have a epigenetic test, that claims to identify the presence of a specific type of cancer. When evaluating the test, it was able to identify 30 out of 60 cancer patients correctly. The sensitivity of this test would then be 50% (30/60)
SpecificityBrief definition:
The specificity is the true negative rate of a test. In other words, it represents the proportion of people without the disease, that will have a negative result. Just like for sensitivity, the values range from 0- 100%
We assume we are dealing with the same diagnostic test for cancer as in the explanation of sensitivity. Out of 90 healthy subjects, 70 had a negative diagnosis. This means the specificity of the test is 78% (70/90)
PrecisionBrief definition:
Precision is a measure that tells us out of all predicted cases, how many are actual cases. Possible values range from 0 to 1.
In the cancer example, how many predicted cancer cases are actual cancer cases.
RecallBrief definition:
Recall is a measure that informs us how many cases we were able to identify as cases. The value range is 0 to 1.
Out of all the cancer patients, how many was the predictive model able to identify as cancer patients?
F1-ScoreBrief definition:
The F1-score is the harmonic mean between precision and recall. In this case, we aim for both high recall and high precision, meaning we want to be able to identify a large amount of cases and we also want to be sure that the majority of predicted cases are actual cases. The F1-score ranges from 0 to 1, where 0 is the worst performance.
If we have a near-perfect precision and recall, meaning we ate able to classify a large amount of the cancer patients as cancer patients (recall) and we are sure that our prediction is correct (precision), the harmonic mean between the two of them for a good model would be ~ 0.9.
Area under the receiver operator curve (ROC AUC)Brief definition:
The area under the receiver operator curve is a measure of how sensitive and specific a test performs. In a graphical representation, the x-axis depicts the negative predictions and the y-axis the positive predictions. If a test performs bad in terms of sensitivity and specificity, the area under the curve would be 0.5, which means it is not better than tossing a coin.