Skip to main content

Table 2 Overview of the literature on machine learning and clinical epigenetics, including data type, machine learning method used, sample size, and performance measures.

From: Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification

DiseaseML methodSample sizeEpigenetic data typePerformanceValidation methodAuthors
Metastatic brain tumoursRandom forest1860
165 patients
Infinium HumanMethylation 450KAUC for type
GBM-A = 0.87
BM-C = 0.82
BM-C–GBM-A = 0.92
AUC for site of origin
LuCa, BrCa, Melan = 0.99
BootstrapOrozco, 2018 [57]
Cerebral palsyNon-metric multidimensional scaling
Linear discriminant analysis
Random forest
22 CP patients
21 controls
Methyl-sensitive restriction endonuclease (MSRE)Accuracy = 73%
Sensitivity = 100%
Specificity = 40%
AUC = 0.691
Bootstrap
20-fold cross-validation
Crowgey, 2018 [38]
Prostate cancerLeast absolute shrinkage and selection operator234 PrCa
76 controls
Infinium HumanMethylation 450KTraining set
100% accuracy, sensitivity, specificity, AUC
Validation set
Sensitivity = 96%
Specificity = 98%
Accuracy = 97%
AUC =98%
None reportedAref-Eshghi, 2018 [54]
Central nervous system tumoursRandom forest2801
(91 different classes)
Infinium HumanMethylation 450K
Infinium HumanMethylation EPIC
Whole Genome Bisulphite Sequencing
Cross-validation error rate (raw) = 4.89%
Cross-validation error rate (calibrated) = 4.28%
AUC = 0.99
8 methylation class error rate = 1.14%
Multiclass approach:
Sensitivity = 0.989
Specificity = 0.999
Classification concordant with pathology on validation set = 76%
3-fold, nested cross-validationCapper, 2018 [55]
Neurodevelopmental syndromesSupport vector machine285 cases across 14 syndromes
650 controls
Infinium HumanMethylation 450K + EPICAccuracy = 99.6%
Sensitivity = 100%
Specificity = 100%
10-fold cross-validationAref-Eshghi, 2018 [53]
Coronary heart diseaseRandom forest1545
173 with coronary heart disease
Infinium HumanMethylation 450KAccuracy = 78%
Sensitivity = 0.75
Specificity = 0.80
10-fold cross-validationDogan, 2017 [56]
BAFopathiesSupport vector machineCases
n = 29 (CSS1 = 14; CSS3 = 5; CSS4 = 2; NCBRS = 7)
Controls
156 (CSS1 = 84; CSS3 = 30; CSS4 = 0; NCBRS = 42)
Infinium HumanMethylation 450K + EPICTesting set
Accuracy = 98.8%
10-fold cross-validationAref-Eshghi, 2018 [72]
Lung cancerMulti-class support vector machineTraining set
LADC = 126
SQCLC = 134
SCL = 28
Test set
LADC = 452
SQCLC = 359
Infinium HumanMethylation 27k (training)
Infinium HumanMethylation 450K (independent)
Training set
Accuracy = 86.54% ± 2.2
Precision = 66.79% ± 1.9
Recall = 84.37% ± 2.5
F-score = 74.55% ± 2.2
Independent sets
Accuracy = 84.6%
Precision = 85.94%
Recall = 85.52%
F-score = 85.04%
Leave-one-out cross-validationCai, 2015 [73]
CancersSupport vector machineComparisons between
Male = 7, female = 14
T-ALL/B_ALL = 17
Healthy T/B cells = 13
AML = 8
BPH = 10
Prostate carcinoma = 10
Healthy kidney = 9
Kidney carcinoma = 9
Prostate = 20
Kidney = 18
Bisulphite Sequencing (GenePix4000)Accuracy
Male vs female = 91%
T/B cells vs ALL = 94%
ALL vs AML = 94%
Kidney vs kidney carcinoma = 92%
Prostate vs kidney = 92%
50-fold cross-validationAdorjan, 2002 [74]
Breast cancerRandom forest543
TCGA, gene expression, and methylation
Infinium HumanMethylation 450KBootstrap error = 20%
Average AUC = 88%
.632 bootstrap errorList, 2014 [75]
Lung cancerRandom forest support vecor machine
linear regression
naïve Bayes
50Infinium HumanMethylation 450K (+ CHIP-Seq from ENCODE)Training set
AUC = 86.4%
Test set
AUC = 83.6%
10-fold cross-validationLi, 2015 [76]
CLL subtypesSupport vector machineTraining set
211
Validation set
97
Bisulphite pyrosequencingNot reported. Authors just state the prediction was accurate..632 bootstrap errorQueiros, 2015 [77]
CLL subtypesSVM135Bisulphite pyrosequencing (PyroMark)No testing of algorithmNABhoi, 2016 [78]
Various cancersOne class logistic regression12000 (33 cancers)Infinium HumanMethylation 450KNone reportedNoneMalta, 2018 [79]
Prediction of methylation in leukemia and healthy cellsDeep learning via deep methyl using stacked denoising autoencoderTwo cell lines:
GM12878: B-lymphocyte cell line from a female
K562:immortalised cell line from a female patient with chronic myelogenous leukemia
Reduced representation bisulfite sequencing (RRBS)Accuracy
GM12878:
84.82% for unknown neighbouring regions
89.7% blinded
K562
72.01% for unknown neighbouring regions
88.6% blinded
leave-one-out cross-validationWang, 2016 [70]
Prediction of methylation status of single cellsConvolutional neural network18 serum-cultured mouse embryonic stem cells
25 human hepatocellular carcinoma cells,
6 human hepatoblastoma-derived cells
6 mESCs
Single-cell bisulphite sequencing
single-cell reduced representation bisulphite sequencing
Based on additional file 2 of the publication:
Mean/sd accuracy: 87.9%/0.09%
Mean/sd AUC: 0.87/0.08
Mean/sd F1: 0.67/0.21
Holdout validationAngermueller, 2017 [71]