Skip to main content

Table 2 Overview of the literature on machine learning and clinical epigenetics, including data type, machine learning method used, sample size, and performance measures.

From: Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification

Disease

ML method

Sample size

Epigenetic data type

Performance

Validation method

Authors

Metastatic brain tumours

Random forest

1860

165 patients

Infinium HumanMethylation 450K

AUC for type

GBM-A = 0.87

BM-C = 0.82

BM-C–GBM-A = 0.92

AUC for site of origin

LuCa, BrCa, Melan = 0.99

Bootstrap

Orozco, 2018 [57]

Cerebral palsy

Non-metric multidimensional scaling

Linear discriminant analysis

Random forest

22 CP patients

21 controls

Methyl-sensitive restriction endonuclease (MSRE)

Accuracy = 73%

Sensitivity = 100%

Specificity = 40%

AUC = 0.691

Bootstrap

20-fold cross-validation

Crowgey, 2018 [38]

Prostate cancer

Least absolute shrinkage and selection operator

234 PrCa

76 controls

Infinium HumanMethylation 450K

Training set

100% accuracy, sensitivity, specificity, AUC

Validation set

Sensitivity = 96%

Specificity = 98%

Accuracy = 97%

AUC =98%

None reported

Aref-Eshghi, 2018 [54]

Central nervous system tumours

Random forest

2801

(91 different classes)

Infinium HumanMethylation 450K

Infinium HumanMethylation EPIC

Whole Genome Bisulphite Sequencing

Cross-validation error rate (raw) = 4.89%

Cross-validation error rate (calibrated) = 4.28%

AUC = 0.99

8 methylation class error rate = 1.14%

Multiclass approach:

Sensitivity = 0.989

Specificity = 0.999

Classification concordant with pathology on validation set = 76%

3-fold, nested cross-validation

Capper, 2018 [55]

Neurodevelopmental syndromes

Support vector machine

285 cases across 14 syndromes

650 controls

Infinium HumanMethylation 450K + EPIC

Accuracy = 99.6%

Sensitivity = 100%

Specificity = 100%

10-fold cross-validation

Aref-Eshghi, 2018 [53]

Coronary heart disease

Random forest

1545

173 with coronary heart disease

Infinium HumanMethylation 450K

Accuracy = 78%

Sensitivity = 0.75

Specificity = 0.80

10-fold cross-validation

Dogan, 2017 [56]

BAFopathies

Support vector machine

Cases

n = 29 (CSS1 = 14; CSS3 = 5; CSS4 = 2; NCBRS = 7)

Controls

156 (CSS1 = 84; CSS3 = 30; CSS4 = 0; NCBRS = 42)

Infinium HumanMethylation 450K + EPIC

Testing set

Accuracy = 98.8%

10-fold cross-validation

Aref-Eshghi, 2018 [72]

Lung cancer

Multi-class support vector machine

Training set

LADC = 126

SQCLC = 134

SCL = 28

Test set

LADC = 452

SQCLC = 359

Infinium HumanMethylation 27k (training)

Infinium HumanMethylation 450K (independent)

Training set

Accuracy = 86.54% ± 2.2

Precision = 66.79% ± 1.9

Recall = 84.37% ± 2.5

F-score = 74.55% ± 2.2

Independent sets

Accuracy = 84.6%

Precision = 85.94%

Recall = 85.52%

F-score = 85.04%

Leave-one-out cross-validation

Cai, 2015 [73]

Cancers

Support vector machine

Comparisons between

Male = 7, female = 14

T-ALL/B_ALL = 17

Healthy T/B cells = 13

AML = 8

BPH = 10

Prostate carcinoma = 10

Healthy kidney = 9

Kidney carcinoma = 9

Prostate = 20

Kidney = 18

Bisulphite Sequencing (GenePix4000)

Accuracy

Male vs female = 91%

T/B cells vs ALL = 94%

ALL vs AML = 94%

Kidney vs kidney carcinoma = 92%

Prostate vs kidney = 92%

50-fold cross-validation

Adorjan, 2002 [74]

Breast cancer

Random forest

543

TCGA, gene expression, and methylation

Infinium HumanMethylation 450K

Bootstrap error = 20%

Average AUC = 88%

.632 bootstrap error

List, 2014 [75]

Lung cancer

Random forest support vecor machine

linear regression

naïve Bayes

50

Infinium HumanMethylation 450K (+ CHIP-Seq from ENCODE)

Training set

AUC = 86.4%

Test set

AUC = 83.6%

10-fold cross-validation

Li, 2015 [76]

CLL subtypes

Support vector machine

Training set

211

Validation set

97

Bisulphite pyrosequencing

Not reported. Authors just state the prediction was accurate.

.632 bootstrap error

Queiros, 2015 [77]

CLL subtypes

SVM

135

Bisulphite pyrosequencing (PyroMark)

No testing of algorithm

NA

Bhoi, 2016 [78]

Various cancers

One class logistic regression

12000 (33 cancers)

Infinium HumanMethylation 450K

None reported

None

Malta, 2018 [79]

Prediction of methylation in leukemia and healthy cells

Deep learning via deep methyl using stacked denoising autoencoder

Two cell lines:

GM12878: B-lymphocyte cell line from a female

K562:immortalised cell line from a female patient with chronic myelogenous leukemia

Reduced representation bisulfite sequencing (RRBS)

Accuracy

GM12878:

84.82% for unknown neighbouring regions

89.7% blinded

K562

72.01% for unknown neighbouring regions

88.6% blinded

leave-one-out cross-validation

Wang, 2016 [70]

Prediction of methylation status of single cells

Convolutional neural network

18 serum-cultured mouse embryonic stem cells

25 human hepatocellular carcinoma cells,

6 human hepatoblastoma-derived cells

6 mESCs

Single-cell bisulphite sequencing

single-cell reduced representation bisulphite sequencing

Based on additional file 2 of the publication:

Mean/sd accuracy: 87.9%/0.09%

Mean/sd AUC: 0.87/0.08

Mean/sd F1: 0.67/0.21

Holdout validation

Angermueller, 2017 [71]