From: Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification
Disease | ML method | Sample size | Epigenetic data type | Performance | Validation method | Authors |
---|---|---|---|---|---|---|
Metastatic brain tumours | Random forest | 1860 165 patients | Infinium HumanMethylation 450K | AUC for type GBM-A = 0.87 BM-C = 0.82 BM-C–GBM-A = 0.92 AUC for site of origin LuCa, BrCa, Melan = 0.99 | Bootstrap | Orozco, 2018 [57] |
Cerebral palsy | Non-metric multidimensional scaling Linear discriminant analysis Random forest | 22 CP patients 21 controls | Methyl-sensitive restriction endonuclease (MSRE) | Accuracy = 73% Sensitivity = 100% Specificity = 40% AUC = 0.691 | Bootstrap 20-fold cross-validation | Crowgey, 2018 [38] |
Prostate cancer | Least absolute shrinkage and selection operator | 234 PrCa 76 controls | Infinium HumanMethylation 450K | Training set 100% accuracy, sensitivity, specificity, AUC Validation set Sensitivity = 96% Specificity = 98% Accuracy = 97% AUC =98% | None reported | Aref-Eshghi, 2018 [54] |
Central nervous system tumours | Random forest | 2801 (91 different classes) | Infinium HumanMethylation 450K Infinium HumanMethylation EPIC Whole Genome Bisulphite Sequencing | Cross-validation error rate (raw) = 4.89% Cross-validation error rate (calibrated) = 4.28% AUC = 0.99 8 methylation class error rate = 1.14% Multiclass approach: Sensitivity = 0.989 Specificity = 0.999 Classification concordant with pathology on validation set = 76% | 3-fold, nested cross-validation | Capper, 2018 [55] |
Neurodevelopmental syndromes | Support vector machine | 285 cases across 14 syndromes 650 controls | Infinium HumanMethylation 450K + EPIC | Accuracy = 99.6% Sensitivity = 100% Specificity = 100% | 10-fold cross-validation | Aref-Eshghi, 2018 [53] |
Coronary heart disease | Random forest | 1545 173 with coronary heart disease | Infinium HumanMethylation 450K | Accuracy = 78% Sensitivity = 0.75 Specificity = 0.80 | 10-fold cross-validation | Dogan, 2017 [56] |
BAFopathies | Support vector machine | Cases n = 29 (CSS1 = 14; CSS3 = 5; CSS4 = 2; NCBRS = 7) Controls 156 (CSS1 = 84; CSS3 = 30; CSS4 = 0; NCBRS = 42) | Infinium HumanMethylation 450K + EPIC | Testing set Accuracy = 98.8% | 10-fold cross-validation | Aref-Eshghi, 2018 [72] |
Lung cancer | Multi-class support vector machine | Training set LADC = 126 SQCLC = 134 SCL = 28 Test set LADC = 452 SQCLC = 359 | Infinium HumanMethylation 27k (training) Infinium HumanMethylation 450K (independent) | Training set Accuracy = 86.54% ± 2.2 Precision = 66.79% ± 1.9 Recall = 84.37% ± 2.5 F-score = 74.55% ± 2.2 Independent sets Accuracy = 84.6% Precision = 85.94% Recall = 85.52% F-score = 85.04% | Leave-one-out cross-validation | Cai, 2015 [73] |
Cancers | Support vector machine | Comparisons between Male = 7, female = 14 T-ALL/B_ALL = 17 Healthy T/B cells = 13 AML = 8 BPH = 10 Prostate carcinoma = 10 Healthy kidney = 9 Kidney carcinoma = 9 Prostate = 20 Kidney = 18 | Bisulphite Sequencing (GenePix4000) | Accuracy Male vs female = 91% T/B cells vs ALL = 94% ALL vs AML = 94% Kidney vs kidney carcinoma = 92% Prostate vs kidney = 92% | 50-fold cross-validation | Adorjan, 2002 [74] |
Breast cancer | Random forest | 543 TCGA, gene expression, and methylation | Infinium HumanMethylation 450K | Bootstrap error = 20% Average AUC = 88% | .632 bootstrap error | List, 2014 [75] |
Lung cancer | Random forest support vecor machine linear regression naïve Bayes | 50 | Infinium HumanMethylation 450K (+ CHIP-Seq from ENCODE) | Training set AUC = 86.4% Test set AUC = 83.6% | 10-fold cross-validation | Li, 2015 [76] |
CLL subtypes | Support vector machine | Training set 211 Validation set 97 | Bisulphite pyrosequencing | Not reported. Authors just state the prediction was accurate. | .632 bootstrap error | Queiros, 2015 [77] |
CLL subtypes | SVM | 135 | Bisulphite pyrosequencing (PyroMark) | No testing of algorithm | NA | Bhoi, 2016 [78] |
Various cancers | One class logistic regression | 12000 (33 cancers) | Infinium HumanMethylation 450K | None reported | None | Malta, 2018 [79] |
Prediction of methylation in leukemia and healthy cells | Deep learning via deep methyl using stacked denoising autoencoder | Two cell lines: GM12878: B-lymphocyte cell line from a female K562:immortalised cell line from a female patient with chronic myelogenous leukemia | Reduced representation bisulfite sequencing (RRBS) | Accuracy GM12878: 84.82% for unknown neighbouring regions 89.7% blinded K562 72.01% for unknown neighbouring regions 88.6% blinded | leave-one-out cross-validation | Wang, 2016 [70] |
Prediction of methylation status of single cells | Convolutional neural network | 18 serum-cultured mouse embryonic stem cells 25 human hepatocellular carcinoma cells, 6 human hepatoblastoma-derived cells 6 mESCs | Single-cell bisulphite sequencing single-cell reduced representation bisulphite sequencing | Based on additional file 2 of the publication: Mean/sd accuracy: 87.9%/0.09% Mean/sd AUC: 0.87/0.08 Mean/sd F1: 0.67/0.21 | Holdout validation | Angermueller, 2017 [71] |