Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification

Rauschert, S.; Raubenheimer, K.; Melton, P. E.; Huang, R. C.

doi:10.1186/s13148-020-00842-4

Clinical Epigenetics

Table 2 Overview of the literature on machine learning and clinical epigenetics, including data type, machine learning method used, sample size, and performance measures.

From: Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification

Disease	ML method	Sample size	Epigenetic data type	Performance	Validation method	Authors
Metastatic brain tumours	Random forest	1860 165 patients	Infinium HumanMethylation 450K	AUC for type GBM-A = 0.87 BM-C = 0.82 BM-C–GBM-A = 0.92 AUC for site of origin LuCa, BrCa, Melan = 0.99	Bootstrap	Orozco, 2018 [57]
Cerebral palsy	Non-metric multidimensional scaling Linear discriminant analysis Random forest	22 CP patients 21 controls	Methyl-sensitive restriction endonuclease (MSRE)	Accuracy = 73% Sensitivity = 100% Specificity = 40% AUC = 0.691	Bootstrap 20-fold cross-validation	Crowgey, 2018 [38]
Prostate cancer	Least absolute shrinkage and selection operator	234 PrCa 76 controls	Infinium HumanMethylation 450K	Training set 100% accuracy, sensitivity, specificity, AUC Validation set Sensitivity = 96% Specificity = 98% Accuracy = 97% AUC =98%	None reported	Aref-Eshghi, 2018 [54]
Central nervous system tumours	Random forest	2801 (91 different classes)	Infinium HumanMethylation 450K Infinium HumanMethylation EPIC Whole Genome Bisulphite Sequencing	Cross-validation error rate (raw) = 4.89% Cross-validation error rate (calibrated) = 4.28% AUC = 0.99 8 methylation class error rate = 1.14% Multiclass approach: Sensitivity = 0.989 Specificity = 0.999 Classification concordant with pathology on validation set = 76%	3-fold, nested cross-validation	Capper, 2018 [55]
Neurodevelopmental syndromes	Support vector machine	285 cases across 14 syndromes 650 controls	Infinium HumanMethylation 450K + EPIC	Accuracy = 99.6% Sensitivity = 100% Specificity = 100%	10-fold cross-validation	Aref-Eshghi, 2018 [53]
Coronary heart disease	Random forest	1545 173 with coronary heart disease	Infinium HumanMethylation 450K	Accuracy = 78% Sensitivity = 0.75 Specificity = 0.80	10-fold cross-validation	Dogan, 2017 [56]
BAFopathies	Support vector machine	Cases n = 29 (CSS1 = 14; CSS3 = 5; CSS4 = 2; NCBRS = 7) Controls 156 (CSS1 = 84; CSS3 = 30; CSS4 = 0; NCBRS = 42)	Infinium HumanMethylation 450K + EPIC	Testing set Accuracy = 98.8%	10-fold cross-validation	Aref-Eshghi, 2018 [72]
Lung cancer	Multi-class support vector machine	Training set LADC = 126 SQCLC = 134 SCL = 28 Test set LADC = 452 SQCLC = 359	Infinium HumanMethylation 27k (training) Infinium HumanMethylation 450K (independent)	Training set Accuracy = 86.54% ± 2.2 Precision = 66.79% ± 1.9 Recall = 84.37% ± 2.5 F-score = 74.55% ± 2.2 Independent sets Accuracy = 84.6% Precision = 85.94% Recall = 85.52% F-score = 85.04%	Leave-one-out cross-validation	Cai, 2015 [73]
Cancers	Support vector machine	Comparisons between Male = 7, female = 14 T-ALL/B_ALL = 17 Healthy T/B cells = 13 AML = 8 BPH = 10 Prostate carcinoma = 10 Healthy kidney = 9 Kidney carcinoma = 9 Prostate = 20 Kidney = 18	Bisulphite Sequencing (GenePix4000)	Accuracy Male vs female = 91% T/B cells vs ALL = 94% ALL vs AML = 94% Kidney vs kidney carcinoma = 92% Prostate vs kidney = 92%	50-fold cross-validation	Adorjan, 2002 [74]
Breast cancer	Random forest	543 TCGA, gene expression, and methylation	Infinium HumanMethylation 450K	Bootstrap error = 20% Average AUC = 88%	.632 bootstrap error	List, 2014 [75]
Lung cancer	Random forest support vecor machine linear regression naïve Bayes	50	Infinium HumanMethylation 450K (+ CHIP-Seq from ENCODE)	Training set AUC = 86.4% Test set AUC = 83.6%	10-fold cross-validation	Li, 2015 [76]
CLL subtypes	Support vector machine	Training set 211 Validation set 97	Bisulphite pyrosequencing	Not reported. Authors just state the prediction was accurate.	.632 bootstrap error	Queiros, 2015 [77]
CLL subtypes	SVM	135	Bisulphite pyrosequencing (PyroMark)	No testing of algorithm	NA	Bhoi, 2016 [78]
Various cancers	One class logistic regression	12000 (33 cancers)	Infinium HumanMethylation 450K	None reported	None	Malta, 2018 [79]
Prediction of methylation in leukemia and healthy cells	Deep learning via deep methyl using stacked denoising autoencoder	Two cell lines: GM12878: B-lymphocyte cell line from a female K562:immortalised cell line from a female patient with chronic myelogenous leukemia	Reduced representation bisulfite sequencing (RRBS)	Accuracy GM12878: 84.82% for unknown neighbouring regions 89.7% blinded K562 72.01% for unknown neighbouring regions 88.6% blinded	leave-one-out cross-validation	Wang, 2016 [70]
Prediction of methylation status of single cells	Convolutional neural network	18 serum-cultured mouse embryonic stem cells 25 human hepatocellular carcinoma cells, 6 human hepatoblastoma-derived cells 6 mESCs	Single-cell bisulphite sequencing single-cell reduced representation bisulphite sequencing	Based on additional file 2 of the publication: Mean/sd accuracy: 87.9%/0.09% Mean/sd AUC: 0.87/0.08 Mean/sd F1: 0.67/0.21	Holdout validation	Angermueller, 2017 [71]

Back to article page

ISSN: 1868-7083

Contact us

Submission enquiries: avalyn.villar@springernature.com
General enquiries: info@biomedcentral.com