Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification
Clinical Epigenetics volume 12, Article number: 51 (2020)
Machine learning is a sub-field of artificial intelligence, which utilises large data sets to make predictions for future events. Although most algorithms used in machine learning were developed as far back as the 1950s, the advent of big data in combination with dramatically increased computing power has spurred renewed interest in this technology over the last two decades.
Within the medical field, machine learning is promising in the development of assistive clinical tools for detection of e.g. cancers and prediction of disease. Recent advances in deep learning technologies, a sub-discipline of machine learning that requires less user input but more data and processing power, has provided even greater promise in assisting physicians to achieve accurate diagnoses.
Within the fields of genetics and its sub-field epigenetics, both prime examples of complex data, machine learning methods are on the rise, as the field of personalised medicine is aiming for treatment of the individual based on their genetic and epigenetic profiles.
We now have an ever-growing number of reported epigenetic alterations in disease, and this offers a chance to increase sensitivity and specificity of future diagnostics and therapies. Currently, there are limited studies using machine learning applied to epigenetics. They pertain to a wide variety of disease states and have used mostly supervised machine learning methods.
Clinical epigenetics is a promising field of research. There is evidence that DNA methylation changes at cytosine-phosphate-guanine (CpG) sites are associated with disease development [1,2,3]. Beyond genetic background, DNA methylation may additionally reflect environmental exposures and could improve diagnostic accuracy and prognostic prediction of certain diseases and be targetable by personalised therapy in the future [4, 5].
The current medical environment is characterised by collection of vast amounts of patient, hospital, and administrative data [6, 7], which makes traditional approaches to investigating these data individually less ideal. Machine learning (ML), however, is able to integrate large and complex data sets . These data sources have the potential to enhance patient care and outcomes. A personalised medicine approach is tightly connected to increases in omics-data. For example, DNA sequence databases double in size twice a year . Indeed, the increases in computer processing coupled with the rapid reduction in the cost of genomic sequencing have outpaced the rate of computing hardware advances . Whilst far from a panacea, ML may be a tool to assist physicians in interpreting information-rich clinical data, including those collected in epigenetic studies [11, 12].
This review was guided by the question, “What are the machine learning models that utilize DNA methylation to classify or diagnose disease states?” This review focused on three key aspects within the search strategy, namely, the data science technique, the biomedical technique, and the outcome of interest. The search strategy involved two databases, namely, PubMed and Google Scholar. The search string for the PubMed database was as follows: (‘machine learning’ OR ‘artificial intelligence’) AND (“epigenetic*” OR “DNA methylation”) AND (“classification” OR “diagnosis”). For Google Scholar, the terms machine learning, artificial intelligence, epigenetic, DNA methylation, classification, and diagnosis were utilized. Following the identification of key articles, references in the identified articles were checked to further identify relevant literature (n = 1). Once selected, all literature was evaluated for the type of ML utilized, the type of DNA methylation technique used, ML performance measures, validation technique, and the number of samples and number of controls in testing sets and validation sets.
This review is written in the context of the concurrent burgeoning interest for the medical practitioner in potential clinical applications of epigenetics and ML. The first aim of this review is to provide a brief overview of epigenetics, followed by its clinical application potentials. The second aim is to provide a brief summary of the current state of ML and its application to the field of epigenetics and personalised medicine. Finally, section three delves into future directions that may be of value to scientists and physicians looking to harness the power of ML in epigenetics. As the field of ML is likely to find widespread application in clinical practice via diagnostic tools, this review aims to be a brief guide to the current state of ML in epigenetics.
Epigenetics and its clinical potential
Epigenetics, sometimes described as the study of heritable changes in gene expression that occur without a change in DNA sequence , is postulated to be the product of a complex interaction between an individual’s genotype, age, and lifestyle factors such as diet, alcohol consumption, and smoking [14,15,16,17]. In 1942, the term “epigenetics” was first coined by Conrad H Waddington . The word is derived from the Greek word “epigenesis”, and initially described the influences of genetic processes on development .
Several diseases have been shown to be associated with differential DNA methylation including various cancers, obesity, and cardiovascular disease [19,20,21,22,23]. Broadly, four major categories of epigenetic changes exist: DNA methylation, RNA-centred mechanisms (including non-coding RNAs and microRNAs), histone modifications, and chromatin conformation . Of these, DNA methylation is the most commonly studied epigenetic modification in mammals, particularly methylation of a cytosine molecule adjacent to a guanine molecule . The cytosine-guanine dinucleotide is referred to as a CpG site and these sites often occur in clusters termed CpG islands .
One of the most popular methods of measuring genome-wide DNA methylation profiles is through microarrays, chiefly the Illumina HumanMethylation Infinium BeadArray . Each generation of the Illumina technology has been associated with diminishing cost and a larger portion of the genome measured, with the number of CpG sites measured from ~ 27,000  to ~ 450 000  and most recently to ~ 850,000 with the EPIC array . Other techniques, such as pyrosequencing and methyl-sensitive endonuclease restriction, are potentially more accurate than the Illumina HumanMethylation microarray technique, but only suitable for low-throughput studies, as they are also very time-consuming . Therefore, whilst the Illumina microarray has limitations, it is still one of the most widely used DNA methylation techniques in the epigenetic field [27, 31].
A recent review in Nature Review Genetics gives a comprehensive overview of the clinical potential of epigenetics . Epigenetics is closely linked to environmental influences and hence potentially better suited to disease diagnosis and treatment than genetics alone . As epigenetics has been shown to play a role in the mediation between early life adverse environments and later life disease onset, it has a potential role for early diagnosis . It has been shown that adverse early life, such as famine  or exposure to maternal smoking during pregnancy [15, 35], can program the development of the child mediated on an epigenetic level .
However, the biggest successes to date in using epigenetic information as a biomarker have been achieved in oncology, where biomarkers have been approved by the US Food and Drug Administration . One such example is the mSEPT9 biomarker for colorectal cancer, which has been discovered in 2003 and is now a commercialized kit that can diagnose colorectal cancer in blood plasma based on epigenetic markers .
To date, ML has yielded limited biomarkers that have made it into current clinical practice. However, it is likely that in the upcoming decades the application of ML to the epigenome  will yield many more potential biomarkers and drug targets, particularly because ML is optimized to find meaning in large and complex data sets. In genomics and transcriptomics, ML methods are already used for example in gene set enrichment analysis, to find highly overrepresented pathways .
Overview of machine learning and systematic literature review for machine learning in epigenetics
AI, as part of computer science, uses algorithms to allow computers to perform traditionally ‘human’ executive functions such as problem-solving and decision-making . AI includes fields such as natural language processing, expert system, robotics, and ML . The various biomedical applications of AI fields other than ML is beyond the scope of the current review, and substantial reviews are available elsewhere [40, 42,43,44]. As previously mentioned, one subdiscipline of AI that shows strong potential in the field of data-driven medical fields is that of ML [11, 45].
ML enables computers to learn and make predictions by finding patterns within the data . With increased amounts of data available, ML approaches become more adept at pattern prediction, a factor that makes ML particularly suited to data-rich medical fields like genomics and its sub-field epigenetics. ML algorithms are generally categorised into supervised, unsupervised, and deep learning. A simplified visual representation of the relationship between these fields is presented in Fig. 1.
Within the field, there are some essential concepts that clinicians ought to be familiar with when considering ML. A simplified approach to steps for developing and applying an ML algorithm is outlined in Fig. 2. A suggested processing pipeline is to split the available data into three sub data sets: a training data set, where the selected algorithm is optimised and the parameters are evaluated, a test data set, where the performance of the trained algorithm is evaluated, and a validation data set, which ideally comes from a different source than the training and test data set. This last step, the validation, is not always possible due to unavailability of data but allows for a more robust estimation of the algorithm performance beyond the training data set. A good alternative for this is k-fold cross-validation. This means, during the training process, the data is randomly split into k training and test sets, which allows for a good approximation of the external validity of the model . Common performance measures employed in classification tasks that use balanced data sets for training are accuracy, sensitivity, specificity, and precision [47, 48]. For imbalanced data sets (low number of cases versus controls), more robust performance evaluators that take into account class distribution are more appropriate, for example, F1-score, area under the curve (AUC), and Cohen’s Kappa [47,48,49].
Supervised learning is a subset of ML where labels to a dataset are known, for example, cancer patients versus healthy controls, which is subsequently used to train an algorithm that can make predictions about the health outcome on unseen data, without knowing the disease status [11, 40]. This form of ML is reliant on user input to categorise the different instances in the learning process. Supervised learning algorithms have been effectively utilised in classification and prediction tasks . Commonly used algorithms within this category of ML include linear or logistic regression, support vector machine, random forest algorithms, and least absolute shrinkage and selection operator regression (LASSO) . Briefly, support vector machine is based on the idea that by transforming the data, eventually it will be possible to separate classes by a hyperplane, which in the two-dimensional space is a simple line . The points nearest to this hyperplane are called support vectors and are essential for the classification . A Random Forest algorithm is a decision tree-based model, that builds up a multitude of decision trees of differing depth . Further, for every tree, a random subset of the data set is utilised and at every split in the decision tree, a random subset of the features is used. This makes every decision tree in the forest highly uncorrelated to the next and the final predictor, which is an average of the whole ensemble of trees, will be highly unbiased . Finally, LASSO is a logistic regression based model that also performs feature selection, meaning the most important variables for prediction are selected from the data set via a so-called penalization model that weighs the features depending on their effect . For further information and details on the algorithms, please refer to the original publications referenced here [40, 51, 52].
Examples of supervised learning using epigenetic data include classification of metastatic brain tumours, prostate cancer, coronary heart disease, neurodevelopmental syndromes, and central nervous system tumours [53,54,55,56,57]. This review focuses on supervised learning, as this is mostly used when trying to develop a diagnostic test to assist clinicians in the diagnostic process (examples: Tabl 1).
Whilst supervised learning provides a robust method by which to classify diseases versus healthy individuals, there are inherent limitations. Firstly, supervised learning usually requires user input in order to define training classes (or classify the disease and healthy patients) to develop a model . Secondly, since ML algorithms are sensitive to the quality of the data, it is essential that they be correctly labelled . If the training data has examples that are incorrectly labelled, the supervised learning classifier will make incorrect predictions . Finally, supervised learning is susceptible to ‘over-fitting’—the tendency to work very well on the training data but having limited performance on other external data sets . Despite these limitations, supervised learning is one of the most widely used ML techniques in classification and prediction in epigenetics (Table 1).
Another class type of algorithm that can be used in supervised ML is deep learning. Deep learning algorithms are capable of processing high volume, high-dimensionality data—data with a high number of variable input sources—and identifying complex patterns . For epigenetics, deep learning provides an enticing avenue to explore. Common deep learning techniques include artificial neural networks and convolutional neural networks [59, 60]. Historically, deep learning is considered one of the more computationally expensive types of AI, requiring large amounts of computing power in order to be effective . The advances of computing power and high-speed internet in the last half a decade has led to efficient and effective use of deep learning, particularly through web-based (super-)computing services such as Amazon Web Services, Google’s Cloud service, and Microsoft Azure.
Perhaps the most problematic issue with deep learning is the inability to identify precisely how the algorithm has determined the outcome, known colloquially as ‘black-boxing’ . Black-boxing is an especially significant limitation in the medical context due to the implications on patient safety and ability to prove clinical reasoning [61, 62].
In contrast to supervised learning, unsupervised learning does not require labels in order to work [40, 63]. However, whilst unsupervised algorithms provide strength of correlation between individual variables within a data set, they are unable to assign the potential biological relevance and/or plausibility of these patterns of correlation [40, 63]. Therefore, human input is required to assess the biological plausibility and the salience of any associated clusters identified by the algorithm [40, 63]. Common problems that unsupervised learning has been used for include clustering and association tasks . Clustering, as the name suggests, clusters data points according to inherent groupings in the data. Common methods used in unsupervised learning include k-means clustering and hierarchical clustering, principle component analysis, and partial least squares discriminant analysis [64, 65]. The latter two methods are often utilised in dimensionality reduction, or the removal of random input variables to increase the performance of a model .
Within an epigenetic context, unsupervised learning can be used to detect DNA methylation patterns between diseased and non-diseased individuals, for example, between breast cancer brain metastases subtypes [38, 57]. Unsupervised learning algorithms are especially useful to detect patterns in data sets that have large amounts of data points, such as those in microarray and omics data sets [66, 67].
The main limitation of unsupervised learning is that the algorithms do not provide insight into the importance or relevance of clustering and/or associations . The concept of ‘correlation does not mean causation’ is especially relevant to unsupervised ML. Due to the inability of unsupervised ML algorithms to prescribe meaning to associations, caution should be exercised when interpreting any associations identified by an unsupervised ML algorithm, as they may be data artefacts as opposed to true biological effects. Furthermore, unsupervised learning is sensitive to noise within the data . If there is a large amount of irrelevant data within a data set, an unsupervised learning algorithm may cluster points erroneously. Therefore, data used for unsupervised learning must be carefully pre-processed to ensure it is of high quality prior to analysis. Deep learning approaches can also be used for unsupervised tasks. An example of a clinical application is a deep learning model that was trained on unlabelled mammography images to identify breast density scores which showed a very strong positive relationship with manual scores, predictive of breast cancer .
Epigenetics and machine learning: existing literature
Overall, 16 studies were identified that utilised ML to diagnose or classify diseases [39, 54–58, 71–80).
There was extensive heterogeneity in the disease outcomes, types of algorithms, performance measures, validation methods, and sample sizes between studies. Table 1 summarises the studies that have investigated the use of ML for diagnosis or classification in various cancers (n = 10), cerebral palsy (n = 1), neurodevelopmental syndromes (n = 1), coronary artery disease (n = 1), and BAFopathies (n = 1; disruption of the BRG1/BRM-associated factor (BAF) complex has been linked to several neurodevelopmental syndromes, commonly referred to as BAFopathies). A special case where the two identified deep learning approaches, DeepCpG and DeepMethyl, as they both predicted methylation status in the genome rather than a disease status [70, 71] (Table 2).
The types of algorithms used have all been supervised learning, including support vector machines (n = 7), random forest (n = 7), LASSO regression (n = 1), non-metric multidimensional scaling (n = 1), logistic regression (n = 1), convolutional neural network (n = 1), and stacked denoising autoencoder (n = 1). Of note, some research used multiple models.
The types of epigenetic data include microarray techniques (n = 11), bisulphite sequencing (n = 3), and methyl-sensitive restricted endonuclease (n =1). Of these collection methods, most studies used one type of DNA methylation technique only (n = 9), whilst others combined measurement techniques, meaning Infinium HumanMethylation 450K and EPIC or CHIP-Seq from The Encyclopedia of DNA Elements (ENCODE) (n = 5).
From the selected publication, it appears that the two most popular methods were support vector machine and random forest. Based on the approaches identified, it seems the most successful combination is 10-fold cross-validation with either a random forest or support vector machine for array-based methods and deep learning-based models for prediction of the methylation status of the DNA.
Epigenetic data have traits that make it amenable to ML. Firstly, DNA methylation is usually both chemically and biologically stable over time . Consequently, the measurement of DNA methylation allows for a reliable measure of the chemical composition of the epigenome at any given point in time. Secondly, large-scale, data-rich repositories such as The Cancer Genome Atlas (TCGA), ENCODE, and the BLUEPRINT consortium provide large amounts of samples to employ comprehensive, high-throughput statistical analyses of differentially methylated regions with biological relevance [80,81,82]. These repositories may provide for the training data for a ML algorithm, or an independent test set in order to determine the ML algorithm’s external validity and subsequent clinical utility [81, 83]. Since ML algorithms require large amounts of data to make accurate predictions, the establishment of these databanks is a significant milestone in the utility of AI in epigenetics. Finally, most datasets consist of DNA methylation profiles derived from peripheral blood, meaning that patients will only be required to provide a small blood sample. It should be noted that DNA methylation profiles are tissue-specific, and that the use of peripheral blood as a measure of DNA methylation may be less useful in diseases such as certain cancers , with more clinical utility in diseases like obesity [85, 86].
Challenges and future perspectives
Whilst there are advantages to combining epigenetics with ML to assist clinicians in the diagnostic process, there are significant challenges that must be addressed. First, very large datasets, requiring cross-jurisdiction collaboration are needed, especially if the diseases that need prediction are rare. This problem occurs 2-fold in epigenetic data: initially with the patient to healthy control ratio (with many datasets containing many more controls as compared to disease cases) and secondly within the individual methylomes, where there is a higher proportion of sections in the DNA that are densely methylated, referred to as differentially methylated regions (DMR), compared to the number of non-DMR sites [12, 87]. Second, most epigenetic data sets have more variables than samples, making it difficult for many ML algorithms to function effectively . A potential solution is to collect more data, something that collaborative data repositories are providing. Concurrent, careful consideration of the type of algorithm and suitable performance measures of the prediction should be made to prevent erroneous data interpretations.
Third, not all associations in a DNA methylation dataset are linear. Several CpGs may be linked to the same gene which may influence other portions of the methylome and transcriptome, which has particularly been identified as an issue in gene set enrichment analysis [89, 90]. Additionally, the Illumina HumanMethylation450 array only covers 2% of all CpG sites in the methylome . These challenges must be recognised before the full clinical potential of epigenetics is realised.
Fourth, for proper development, improvement and testing of novel machine learning approaches, it will be crucial to increase efforts to make large epigenetic datasets publicly available. This should include the raw data of different platforms, so research can be conducted into the effect of different normalisation methods on ML model performance and assessing which models work best for array-based and bisulphite sequencing-based data formats. One of the largest efforts in providing access to sequencing data is provided by The National Center for Biotechnology Information (NCBI). This includes databases such as the sequencing read archive (SRA) that are invaluable for research into new computational methods . The SRA is operated by the International Nucleotide Sequence Database Collaboration (INSDC) and was initially started to publicly deposit sequencing reads . Currently, more and more funding bodies and scientific journals request a deposition of experiment data in the SRA, which is not only beneficial for reproducibility of research, but also for efforts into the development of new analytical tools. Resources such as SRA made it possible to develop sequencing analysis tools such as Magic-BLAST (Basic Local Alignment Search Tool), which allows to align sequencing reads to a reference genome based on a sequencing database .
In an epigenetic context, deep learning has been used to classify genetic mutations in gliomas and prediction of single-cell DNA methylation status [71, 93]. Whilst still in its infancy, applications of deep learning to classification tasks using DNA methylation data may have benefits over traditional ML.
Another challenge for the field of ML is prediction bias. Several cases in facial recognition, especially relevant to deep learning because of their black box character, have shown that the predictive models are biased towards populations of European ancestry . Therefore, the challenge of getting representative datasets that do not exacerbate existing health differences for disadvantaged populations is one of the biggest challenges that the ML community needs to address .
As an in-depth introduction to epigenetics and ML was out of the scope of this review, we aimed to give an overview of epigenetics and the potential of ML in clinical applications. The interested reader may refer to the cited literature on the different topics of epigenetics and machine learning.
ML is starting to find patterns in ever-growing genetic and epigenetic data sets that relate to the development of diseases. Although very accurate, deep learning methods will need to undergo further research to define what is going on within the “black box”, before clinicians can confidently make informed decisions whilst utilising such tools. In the meantime, interpretable ML algorithms are likely to be on the horizon with the potential to assist in more confident diagnoses. Whilst ML is sometimes depicted in the media and literature as a threat to the clinician’s profession and autonomy, clinicians should perhaps view its application as an assistive tool. ML can be used, just like evolving technologies across the ages (from the stethoscope, to X-Rays, to MRIs) as providing adjunctive information; it is a matter of being properly informed about limitations of the method of algorithm development and understanding where and to whom it is appropriate to apply.
Availability of data and materials
Area under the curve
Cytosine phosphate guanine
Differentially methylated region
the Encyclopedia of DNA Elements
Least absolute shrinkage and selection operator
The Cancer Genome Atlas
Heyn H, Esteller M. DNA methylation profiling in the clinic: applications and challenges. Nat Rev Genet. 2012;13(10):679–92.
Aslibekyan S, Claas SA, Arnett DK. Clinical applications of epigenetics in cardiovascular disease: the long road ahead. Translational research : the journal of laboratory and clinical medicine. 2015;165(1):143–53.
Mill J, Heijmans BT. From promises to practical strategies in epigenetic epidemiology. Nat Rev Genet. 2013;14(8):585–94.
Jones PA, Issa J-PJ, Baylin S. Targeting the cancer epigenome for therapy. Nat Rev Genet. 2016;17:630.
How Kit A, Nielsen HM, Tost J. DNA methylation based biomarkers: practical considerations and applications. Biochimie. 2012;94(11):2314–37.
Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Information Science and Systems. 2014;2(1):3.
Wang F, Casalino LP, Khullar D. Deep learning in medicine—promise, progress, and challenges Deep Learning in Medicine—Promise, Progress, and ChallengesDeep Learning in Medicine—Promise, Progress, and Challenges. JAMA Intern Med. 2019;179(3):293–4.
Holzinger A, Jurisica I. Knowledge discovery and data mining in biomedical informatics: the future is in integrative, interactive machine learning solutions. Interactive knowledge discovery and data mining in biomedical informatics: Springer; 2014. p. 1-18.
Pfeiffer G, Baumgart S, Schröder J, Schimmler M, editors. A massively parallel architecture for bioinformatics. Computational Science – ICCS 2009; 2009 2009//; Berlin, Heidelberg: Springer Berlin Heidelberg.
Sarda S, Hannenhalli S. Next-generation sequencing and epigenomics research: a hammer in search of nails. Genomics & informatics. 2014;12(1):2–11.
Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. 2019;380(14):1347–58.
Holder LB, Haque MM, Skinner MK. Machine learning for epigenetics and future medical applications. Epigenetics. 2017;12(7):505–14.
Rodenhiser D, Mann M. Epigenetics and human disease: translating basic biology into clinical applications. Can Med Assoc J. 2006;174(3):341–8.
Joubert BR, Håberg SE, Nilsen RM, Wang X, Vollset SE, Murphy SK, et al. 450K epigenome-wide scan identifies differential DNA methylation in newborns related to maternal smoking during pregnancy. Environ Health Perspect. 2012;120(10):1425–31.
Joubert BR, Felix JF, Yousefi P, Bakulski KM, Just AC, Breton C, et al. DNA methylation in newborns and maternal smoking in pregnancy: genome-wide consortium meta-analysis. Am J Hum Genet. 2016;98(4):680–96.
Anderson OS, Sant KE, Dolinoy DC. Nutrition and epigenetics: an interplay of dietary methyl donors, one-carbon metabolism and DNA methylation. J Nutr Biochem. 2012;23(8):853–9.
Alegría-Torres JA, Baccarelli A, Bollati V. Epigenetics and lifestyle. Epigenomics. 2011;3(3):267–77.
Felsenfeld G. A brief history of epigenetics. Cold Spring Harb Perspect Biol. 2014;6(1):a018200.
Robertson KD. DNA methylation and human disease. Nat Rev Genet. 2005;6(8):597.
Cui H, Cruz-Correa M, Giardiello FM, Hutcheon DF, Kafonek DR, Brandenburg S, et al. Loss of IGF2 imprinting: a potential marker of colorectal cancer risk. Science. 2003;299(5613):1753–5.
Bhusari S, Yang B, Kueck J, Huang W, Jarrard DF. Insulin-like growth factor-2 (IGF2) loss of imprinting marks a field defect within human prostates containing cancer. Prostate. 2011;71(15):1621–30.
Soubry A, Schildkraut JM, Murtha A, Wang F, Huang Z, Bernal A, et al. Paternal obesity is associated with IGF2 hypomethylation in newborns: results from a Newborn Epigenetics Study (NEST) cohort. BMC Med. 2013;11(1):29.
Gluckman PD, Hanson MA, Buklijas T, Low FM, Beedle AS. Epigenetic mechanisms that underpin metabolic and cardiovascular diseases. Nat Rev Endocrinol. 2009;5(7):401.
Liang M. Epigenetic mechanisms and hypertension. Hypertension. 2018;72(6):1244–54.
Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16(1):6–21.
Bernstein BE, Meissner A, Lander ES. The mammalian epigenome. Cell. 2007;128(4):669–81.
Kurdyukov S, Bullock M. DNA methylation analysis: choosing the right method. Biology (Basel). 2016;5(1):3.
Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, Shen R, et al. Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics. 2009;1(1):177–200.
Sandoval J, Heyn H, Moran S, Serra-Musach J, Pujana MA, Bibikova M, et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics. 2011;6(6):692–702.
Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics. 2016;8(3):389–99.
Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of Infinium HumanMethylation450 data processing. Brief Bioinform. 2013;15(6):929–41.
Berdasco M, Esteller M. Clinical epigenetics: seizing opportunities for translation. Nat Rev Genet. 2018;1.
Ong M-L, Lin X, Holbrook J. Measuring epigenetics as the mediator of gene/environment interactions in DOHaD. J Dev Orig Health Dis. 2015;6(1):10–6.
Jang H, Serra C. Nutrition, epigenetics, and diseases. Clinical nutrition research. 2014;3(1):1–8.
Rauschert S, Melton P, Burdge G, Craig J, Godfrey K, Holbrook J, et al. Maternal smoking during pregnancy induces persistent epigenetic changes into adolescence, independent of postnatal smoke exposure and is associated with cardiometabolic risk. Front Genet. 2019;10:770.
Bianco-Miotto T, Craig JM, Gasser YP, van Dijk SJ, Ozanne SE. Epigenetics and DOHaD: from basics to birth and beyond. J Dev Orig Health Dis. 2017;8(5):513–9.
Payne SR. From discovery to the clinic: the novel DNA methylation biomarker m SEPT9 for the detection of colorectal cancer in blood. Epigenomics. 2010;2(4):575–85.
Crowgey EL, Marsh AG, Robinson KG, Yeager SK, Akins RE. Epigenetic machine learning: utilizing DNA methylation patterns to predict spastic cerebral palsy. BMC bioinformatics. 2018;19(1):225.
Bari MG, Ung CY, Zhang C, Zhu S, Li H. Machine learning-assisted network inference approach to identify a new class of genes that coordinate the functionality of cancer networks. Sci Rep. 2017;7(1):6993.
Krittanawong C, Zhang H, Wang Z, Aydar M, Kitai T. Artificial intelligence in precision cardiovascular medicine. J Am Coll Cardiol. 2017;69(21):2657–64.
Rech J, Althoff K-D. Artificial intelligence and software engineering: Status and future trends. KI. 2004;18(3):5–11.
Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial intelligence in surgery: promises and perils. Ann Surg. 2018;268(1):70–6.
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44.
Hamet P, Tremblay J. Artificial intelligence in medicine. Metabolism. 2017;69:S36–40.
Saria S, Butte A, Sheikh A. Better medicine through machine learning: what’s real, and what’s artificial? PLoS Med. 2019;15(12):e1002721.
Wong T-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48(9):2839–46.
Ben-David A. Comparison of classification accuracy using Cohen’s Weighted Kappa. Expert Syst Appl. 2008;34(2):825–32.
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45(4):427–37.
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl. 2017;73:220–39.
Kotsiantis SB, Zaharakis ID, Pintelas PE. Machine learning: a review of classification and combining techniques. Artif Intell Rev. 2006;26(3):159–90.
Cristianini N, Ricci E. Support Vector Machines. In: Kao M-Y, editor. Encyclopedia of Algorithms. Boston, MA: Springer US; 2008. p. 928–32.
Breiman L. Random Forests. machine learning. 2001;45(1):5-32.
Aref-Eshghi E, Rodenhiser DI, Schenkel LC, Lin H, Skinner C, Ainsworth P, et al. Genomic DNA methylation signatures enable concurrent diagnosis and clinical genetic variant classification in neurodevelopmental syndromes. Am J Hum Genet. 2018;102(1):156–74.
Aref-Eshghi E, Schenkel LC, Ainsworth P, Lin H, Rodenhiser DI, Cutz J-C, et al. Genomic DNA methylation-derived algorithm enables accurate detection of malignant prostate tissues. Front Oncol. 2018;8.
Capper D, Jones DT, Sill M, Hovestadt V, Schrimpf D, Sturm D, et al. DNA methylation-based classification of central nervous system tumours. Nature. 2018;555(7697):469.
Dogan MV, Grumbach IM, Michaelson JJ, Philibert RA. Integrated genetic and epigenetic prediction of coronary heart disease in the Framingham Heart Study. PLoS One. 2018;13(1):e0190549.
Orozco JI, Knijnenburg TA, Manughian-Peter AO, Salomon MP, Barkhoudarian G, Jalas JR, et al. Epigenetic Profiling for the Molecular Classification of Metastatic Brain Tumors. bioRxiv. 2018:268193.
Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent data analysis. 2002;6(5):429–49.
LeCun Y, Bengio Y, Hinton G. Deep learning. nature. 2015;521(7553):436.
Jain AK, Mao J, Mohiuddin KM. Artificial neural networks: a tutorial. Computer. 1996;29(3):31–44.
Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 2019;1(5):206–15.
Zahid FM, Heumann C. Multiple imputation with sequential penalized regression. Statistical methods in medical research. 2018:962280218755574.
Alanazi HO, Abdullah AH, Qureshi KN. A critical review for developing accurate and dynamic predictive models using machine learning methods in medicine and health care. J Med Syst. 2017;41(4):69.
Tarca AL, Carey VJ, Chen X-W, Romero R, Drăghici S. Machine learning and its applications to biology. PLoS Comput Biol. 2007;3(6):e116.
Boulesteix A-L, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2006;8(1):32–44.
Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC. Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform. 2016;17(4):628–41.
Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18(1):39–50.
Deo RC. Machine Learning in Medicine. Circulation. 2015;132(20):1920–30.
Kallenberg M, Petersen K, Nielsen M, Ng AY, Diao P, Igel C, et al. Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring. IEEE Trans Med Imaging. 2016;35(5):1322–31.
Wang Y, Liu T, Xu D, Shi H, Zhang C, Mo Y-Y, et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci Rep. 2016;6:19598.
Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67.
Aref-Eshghi E, Bend EG, Hood RL, Schenkel LC, Carere DA, Chakrabarti R, et al. BAFopathies’ DNA methylation epi-signatures demonstrate diagnostic utility and functional continuum of Coffin–Siris and Nicolaides–Baraitser syndromes. Nat Commun. 2018;9(1):4885.
Cai Z, Xu D, Zhang Q, Zhang J, Ngai S-M, Shao J. Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol BioSyst. 2015;11(3):791–800.
Adorján P, Distler J, Lipscher E, Model F, Müller J, Pelet C, et al. Tumour class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Res. 2002;30(5):e21-e.
List M, Hauschild A-C, Tan Q, Kruse TA, Baumbach J, Batra R. Classification of breast cancer subtypes by combining gene expression and DNA methylation data. Journal of integrative bioinformatics. 2014;11(2):1–14.
Li J, Ching T, Huang S, Garmire LX, editors. Using epigenomics data to predict gene expression in lung cancer. BMC bioinformatics; 2015: BioMed Central.
Queiros AC, Villamor N, Clot G, Martinez-Trillos A, Kulis M, Navarro A, et al. A B-cell epigenetic signature defines three biologic subgroups of chronic lymphocytic leukemia with clinical impact. Leukemia. 2015;29(3):598–605.
Bhoi S, Ljungström V, Baliakas P, Mattsson M, Smedby KE, Juliusson G, et al. Prognostic impact of epigenetic classification in chronic lymphocytic leukemia: the case of subset# 2. Epigenetics. 2016;11(6):449–55.
Malta TM, Sokolov A, Gentles AJ, Burzykowski T, Poisson L, Weinstein JN, et al. Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell. 2018;173(2):338–54. e15.
Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363–9.
Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, et al. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200–9.
Silva TC, Colaprico A, Olsen C, D'Angelo F, Bontempi G, Ceccarelli M, et al. TCGA Workflow: analyze cancer genomics and epigenomics data using Bioconductor packages. F1000Res. 2016;5:1542.
Leung MK, Delong A, Alipanahi B, Frey BJ. Machine learning in genomic medicine: a review of computational problems and data sets. Proc IEEE. 2015;104(1):176–97.
Sina AAI, Carrascosa LG, Liang Z, Grewal YS, Wardiana A, Shiddiky MJA, et al. Epigenetically reprogrammed methylation landscape drives the DNA self-assembly and serves as a universal cancer biomarker. Nat Commun. 2018;9(1):4915.
Huang Y-T, Chu S, Loucks EB, Lin C-L, Eaton CB, Buka SL, et al. Epigenome-wide profiling of DNA methylation in paired samples of adipose tissue and blood. Epigenetics. 2016;11(3):227–36.
Hewitt AW, Januar V, Sexton-Oates A, Joo JE, Franchina M, Wang JJ, et al. DNA methylation landscape of ocular tissue relative to matched peripheral blood. Sci Rep. 2017;7:46330.
Haque MM, Skinner MK, Holder LB. Imbalanced class learning in epigenetics. J Comput Biol. 2014;21(7):492–507.
Kirpich A, Ainsworth EA, Wedow JM, Newman JR, Michailidis G, McIntyre LM. Variable selection in omics data: A practical evaluation of small sample sizes. PLoS One. 2018;13(6):e0197910.
Li S, He T, Pawlikowska I, Lin T. Correcting length-bias in gene set analysis for DNA methylation data. Statistics and Its Interface. 2017;10(2):279–89.
Deutsch CK, McIlvane WJ. Non-Mendelian etiologic factors in neuropsychiatric illness: pleiotropy, epigenetics, and convergence. Behav Brain Sci. 2012;35(5):363–4.
Leinonen R, Sugawara H, Shumway M. International nucleotide sequence database C. The sequence read archive. Nucleic Acids Res. 2011;39(Database issue):D19–21.
Boratyn GM, Thierry-Mieg J, Thierry-Mieg D, Busby B, Madden TL. Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics. 2019;20(1):405.
Chang P, Grinband J, Weinberg B, Bardis M, Khy M, Cadena G, et al. Deep-learning convolutional eural Networks Accurately Classify Genetic Mutations in Gliomas. American Journal of Neuroradiology. 2018.
Phillips PJ, Jiang F, Narvekar A, Ayyad J, O'Toole AJ. An other-race effect for face recognition algorithms. ACM Trans Appl Percept. 2011;8(2):1–11.
Char DS, Shah NH, Magnus D. Implementing machine learning in health care—addressing ethical challenges. N Engl J Med. 2018;378(11):981–3.
We would like to acknowledge Professor Lawrence Beilin for reviewing the final manuscript.
SR received support from the European LifeCycle project through the fellowship call of June 2018, grant agreement no. 733206. RCH is supported by NHMRC Fellowship (grant number 1053384). RCH, PM, and SR received further support through the NHMRC EU-collaborative grant with the number APP1142858—early life stressors and lifecycle health.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Rauschert, S., Raubenheimer, K., Melton, P.E. et al. Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification. Clin Epigenet 12, 51 (2020). https://doi.org/10.1186/s13148-020-00842-4