Skip to main content

Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification



Machine learning is a sub-field of artificial intelligence, which utilises large data sets to make predictions for future events. Although most algorithms used in machine learning were developed as far back as the 1950s, the advent of big data in combination with dramatically increased computing power has spurred renewed interest in this technology over the last two decades.

Main body

Within the medical field, machine learning is promising in the development of assistive clinical tools for detection of e.g. cancers and prediction of disease. Recent advances in deep learning technologies, a sub-discipline of machine learning that requires less user input but more data and processing power, has provided even greater promise in assisting physicians to achieve accurate diagnoses.

Within the fields of genetics and its sub-field epigenetics, both prime examples of complex data, machine learning methods are on the rise, as the field of personalised medicine is aiming for treatment of the individual based on their genetic and epigenetic profiles.


We now have an ever-growing number of reported epigenetic alterations in disease, and this offers a chance to increase sensitivity and specificity of future diagnostics and therapies. Currently, there are limited studies using machine learning applied to epigenetics. They pertain to a wide variety of disease states and have used mostly supervised machine learning methods.


Clinical epigenetics is a promising field of research. There is evidence that DNA methylation changes at cytosine-phosphate-guanine (CpG) sites are associated with disease development [1,2,3]. Beyond genetic background, DNA methylation may additionally reflect environmental exposures and could improve diagnostic accuracy and prognostic prediction of certain diseases and be targetable by personalised therapy in the future [4, 5].

The current medical environment is characterised by collection of vast amounts of patient, hospital, and administrative data [6, 7], which makes traditional approaches to investigating these data individually less ideal. Machine learning (ML), however, is able to integrate large and complex data sets [8]. These data sources have the potential to enhance patient care and outcomes. A personalised medicine approach is tightly connected to increases in omics-data. For example, DNA sequence databases double in size twice a year [9]. Indeed, the increases in computer processing coupled with the rapid reduction in the cost of genomic sequencing have outpaced the rate of computing hardware advances [10]. Whilst far from a panacea, ML may be a tool to assist physicians in interpreting information-rich clinical data, including those collected in epigenetic studies [11, 12].

This review was guided by the question, “What are the machine learning models that utilize DNA methylation to classify or diagnose disease states?” This review focused on three key aspects within the search strategy, namely, the data science technique, the biomedical technique, and the outcome of interest. The search strategy involved two databases, namely, PubMed and Google Scholar. The search string for the PubMed database was as follows: (‘machine learning’ OR ‘artificial intelligence’) AND (“epigenetic*” OR “DNA methylation”) AND (“classification” OR “diagnosis”). For Google Scholar, the terms machine learning, artificial intelligence, epigenetic, DNA methylation, classification, and diagnosis were utilized. Following the identification of key articles, references in the identified articles were checked to further identify relevant literature (n = 1). Once selected, all literature was evaluated for the type of ML utilized, the type of DNA methylation technique used, ML performance measures, validation technique, and the number of samples and number of controls in testing sets and validation sets.

This review is written in the context of the concurrent burgeoning interest for the medical practitioner in potential clinical applications of epigenetics and ML. The first aim of this review is to provide a brief overview of epigenetics, followed by its clinical application potentials. The second aim is to provide a brief summary of the current state of ML and its application to the field of epigenetics and personalised medicine. Finally, section three delves into future directions that may be of value to scientists and physicians looking to harness the power of ML in epigenetics. As the field of ML is likely to find widespread application in clinical practice via diagnostic tools, this review aims to be a brief guide to the current state of ML in epigenetics.

Epigenetics and its clinical potential

Epigenetics, sometimes described as the study of heritable changes in gene expression that occur without a change in DNA sequence [13], is postulated to be the product of a complex interaction between an individual’s genotype, age, and lifestyle factors such as diet, alcohol consumption, and smoking [14,15,16,17]. In 1942, the term “epigenetics” was first coined by Conrad H Waddington [18]. The word is derived from the Greek word “epigenesis”, and initially described the influences of genetic processes on development [18].

Several diseases have been shown to be associated with differential DNA methylation including various cancers, obesity, and cardiovascular disease [19,20,21,22,23]. Broadly, four major categories of epigenetic changes exist: DNA methylation, RNA-centred mechanisms (including non-coding RNAs and microRNAs), histone modifications, and chromatin conformation [24]. Of these, DNA methylation is the most commonly studied epigenetic modification in mammals, particularly methylation of a cytosine molecule adjacent to a guanine molecule [25]. The cytosine-guanine dinucleotide is referred to as a CpG site and these sites often occur in clusters termed CpG islands [26].

One of the most popular methods of measuring genome-wide DNA methylation profiles is through microarrays, chiefly the Illumina HumanMethylation Infinium BeadArray [27]. Each generation of the Illumina technology has been associated with diminishing cost and a larger portion of the genome measured, with the number of CpG sites measured from ~ 27,000 [28] to ~ 450 000 [29] and most recently to ~ 850,000 with the EPIC array [30]. Other techniques, such as pyrosequencing and methyl-sensitive endonuclease restriction, are potentially more accurate than the Illumina HumanMethylation microarray technique, but only suitable for low-throughput studies, as they are also very time-consuming [27]. Therefore, whilst the Illumina microarray has limitations, it is still one of the most widely used DNA methylation techniques in the epigenetic field [27, 31].

A recent review in Nature Review Genetics gives a comprehensive overview of the clinical potential of epigenetics [32]. Epigenetics is closely linked to environmental influences and hence potentially better suited to disease diagnosis and treatment than genetics alone [32]. As epigenetics has been shown to play a role in the mediation between early life adverse environments and later life disease onset, it has a potential role for early diagnosis [33]. It has been shown that adverse early life, such as famine [34] or exposure to maternal smoking during pregnancy [15, 35], can program the development of the child mediated on an epigenetic level [36].

However, the biggest successes to date in using epigenetic information as a biomarker have been achieved in oncology, where biomarkers have been approved by the US Food and Drug Administration [37]. One such example is the mSEPT9 biomarker for colorectal cancer, which has been discovered in 2003 and is now a commercialized kit that can diagnose colorectal cancer in blood plasma based on epigenetic markers [37].

To date, ML has yielded limited biomarkers that have made it into current clinical practice. However, it is likely that in the upcoming decades the application of ML to the epigenome [38] will yield many more potential biomarkers and drug targets, particularly because ML is optimized to find meaning in large and complex data sets. In genomics and transcriptomics, ML methods are already used for example in gene set enrichment analysis, to find highly overrepresented pathways [39].

Overview of machine learning and systematic literature review for machine learning in epigenetics

AI, as part of computer science, uses algorithms to allow computers to perform traditionally ‘human’ executive functions such as problem-solving and decision-making [40]. AI includes fields such as natural language processing, expert system, robotics, and ML [41]. The various biomedical applications of AI fields other than ML is beyond the scope of the current review, and substantial reviews are available elsewhere [40, 42,43,44]. As previously mentioned, one subdiscipline of AI that shows strong potential in the field of data-driven medical fields is that of ML [11, 45].

ML enables computers to learn and make predictions by finding patterns within the data [40]. With increased amounts of data available, ML approaches become more adept at pattern prediction, a factor that makes ML particularly suited to data-rich medical fields like genomics and its sub-field epigenetics. ML algorithms are generally categorised into supervised, unsupervised, and deep learning. A simplified visual representation of the relationship between these fields is presented in Fig. 1.

Fig. 1

Overview of the field of artificial intelligence and its sub-field machine learning

Within the field, there are some essential concepts that clinicians ought to be familiar with when considering ML. A simplified approach to steps for developing and applying an ML algorithm is outlined in Fig. 2. A suggested processing pipeline is to split the available data into three sub data sets: a training data set, where the selected algorithm is optimised and the parameters are evaluated, a test data set, where the performance of the trained algorithm is evaluated, and a validation data set, which ideally comes from a different source than the training and test data set. This last step, the validation, is not always possible due to unavailability of data but allows for a more robust estimation of the algorithm performance beyond the training data set. A good alternative for this is k-fold cross-validation. This means, during the training process, the data is randomly split into k training and test sets, which allows for a good approximation of the external validity of the model [46]. Common performance measures employed in classification tasks that use balanced data sets for training are accuracy, sensitivity, specificity, and precision [47, 48]. For imbalanced data sets (low number of cases versus controls), more robust performance evaluators that take into account class distribution are more appropriate, for example, F1-score, area under the curve (AUC), and Cohen’s Kappa [47,48,49].

Fig. 2

Workflow for applying a machine learning algorithm

Supervised learning

Supervised learning is a subset of ML where labels to a dataset are known, for example, cancer patients versus healthy controls, which is subsequently used to train an algorithm that can make predictions about the health outcome on unseen data, without knowing the disease status [11, 40]. This form of ML is reliant on user input to categorise the different instances in the learning process. Supervised learning algorithms have been effectively utilised in classification and prediction tasks [50]. Commonly used algorithms within this category of ML include linear or logistic regression, support vector machine, random forest algorithms, and least absolute shrinkage and selection operator regression (LASSO) [40]. Briefly, support vector machine is based on the idea that by transforming the data, eventually it will be possible to separate classes by a hyperplane, which in the two-dimensional space is a simple line [51]. The points nearest to this hyperplane are called support vectors and are essential for the classification [51]. A Random Forest algorithm is a decision tree-based model, that builds up a multitude of decision trees of differing depth [52]. Further, for every tree, a random subset of the data set is utilised and at every split in the decision tree, a random subset of the features is used. This makes every decision tree in the forest highly uncorrelated to the next and the final predictor, which is an average of the whole ensemble of trees, will be highly unbiased [52]. Finally, LASSO is a logistic regression based model that also performs feature selection, meaning the most important variables for prediction are selected from the data set via a so-called penalization model that weighs the features depending on their effect [40]. For further information and details on the algorithms, please refer to the original publications referenced here [40, 51, 52].

Examples of supervised learning using epigenetic data include classification of metastatic brain tumours, prostate cancer, coronary heart disease, neurodevelopmental syndromes, and central nervous system tumours [53,54,55,56,57]. This review focuses on supervised learning, as this is mostly used when trying to develop a diagnostic test to assist clinicians in the diagnostic process (examples: Tabl 1).

Whilst supervised learning provides a robust method by which to classify diseases versus healthy individuals, there are inherent limitations. Firstly, supervised learning usually requires user input in order to define training classes (or classify the disease and healthy patients) to develop a model [40]. Secondly, since ML algorithms are sensitive to the quality of the data, it is essential that they be correctly labelled [40]. If the training data has examples that are incorrectly labelled, the supervised learning classifier will make incorrect predictions [40]. Finally, supervised learning is susceptible to ‘over-fitting’—the tendency to work very well on the training data but having limited performance on other external data sets [58]. Despite these limitations, supervised learning is one of the most widely used ML techniques in classification and prediction in epigenetics (Table 1).

Table 1 Brief overview of some of the most frequently used performance measures for machine learning models

Another class type of algorithm that can be used in supervised ML is deep learning. Deep learning algorithms are capable of processing high volume, high-dimensionality data—data with a high number of variable input sources—and identifying complex patterns [59]. For epigenetics, deep learning provides an enticing avenue to explore. Common deep learning techniques include artificial neural networks and convolutional neural networks [59, 60]. Historically, deep learning is considered one of the more computationally expensive types of AI, requiring large amounts of computing power in order to be effective [59]. The advances of computing power and high-speed internet in the last half a decade has led to efficient and effective use of deep learning, particularly through web-based (super-)computing services such as Amazon Web Services, Google’s Cloud service, and Microsoft Azure.

Perhaps the most problematic issue with deep learning is the inability to identify precisely how the algorithm has determined the outcome, known colloquially as ‘black-boxing’ [61]. Black-boxing is an especially significant limitation in the medical context due to the implications on patient safety and ability to prove clinical reasoning [61, 62].

Unsupervised learning

In contrast to supervised learning, unsupervised learning does not require labels in order to work [40, 63]. However, whilst unsupervised algorithms provide strength of correlation between individual variables within a data set, they are unable to assign the potential biological relevance and/or plausibility of these patterns of correlation [40, 63]. Therefore, human input is required to assess the biological plausibility and the salience of any associated clusters identified by the algorithm [40, 63]. Common problems that unsupervised learning has been used for include clustering and association tasks [40]. Clustering, as the name suggests, clusters data points according to inherent groupings in the data. Common methods used in unsupervised learning include k-means clustering and hierarchical clustering, principle component analysis, and partial least squares discriminant analysis [64, 65]. The latter two methods are often utilised in dimensionality reduction, or the removal of random input variables to increase the performance of a model [66].

Within an epigenetic context, unsupervised learning can be used to detect DNA methylation patterns between diseased and non-diseased individuals, for example, between breast cancer brain metastases subtypes [38, 57]. Unsupervised learning algorithms are especially useful to detect patterns in data sets that have large amounts of data points, such as those in microarray and omics data sets [66, 67].

The main limitation of unsupervised learning is that the algorithms do not provide insight into the importance or relevance of clustering and/or associations [68]. The concept of ‘correlation does not mean causation’ is especially relevant to unsupervised ML. Due to the inability of unsupervised ML algorithms to prescribe meaning to associations, caution should be exercised when interpreting any associations identified by an unsupervised ML algorithm, as they may be data artefacts as opposed to true biological effects. Furthermore, unsupervised learning is sensitive to noise within the data [40]. If there is a large amount of irrelevant data within a data set, an unsupervised learning algorithm may cluster points erroneously. Therefore, data used for unsupervised learning must be carefully pre-processed to ensure it is of high quality prior to analysis. Deep learning approaches can also be used for unsupervised tasks. An example of a clinical application is a deep learning model that was trained on unlabelled mammography images to identify breast density scores which showed a very strong positive relationship with manual scores, predictive of breast cancer [69].

Epigenetics and machine learning: existing literature

Overall, 16 studies were identified that utilised ML to diagnose or classify diseases [39, 54–58, 71–80).

There was extensive heterogeneity in the disease outcomes, types of algorithms, performance measures, validation methods, and sample sizes between studies. Table 1 summarises the studies that have investigated the use of ML for diagnosis or classification in various cancers (n = 10), cerebral palsy (n = 1), neurodevelopmental syndromes (n = 1), coronary artery disease (n = 1), and BAFopathies (n = 1; disruption of the BRG1/BRM-associated factor (BAF) complex has been linked to several neurodevelopmental syndromes, commonly referred to as BAFopathies). A special case where the two identified deep learning approaches, DeepCpG and DeepMethyl, as they both predicted methylation status in the genome rather than a disease status [70, 71] (Table 2).

Table 2 Overview of the literature on machine learning and clinical epigenetics, including data type, machine learning method used, sample size, and performance measures.

The types of algorithms used have all been supervised learning, including support vector machines (n = 7), random forest (n = 7), LASSO regression (n = 1), non-metric multidimensional scaling (n = 1), logistic regression (n = 1), convolutional neural network (n = 1), and stacked denoising autoencoder (n = 1). Of note, some research used multiple models.

The types of epigenetic data include microarray techniques (n = 11), bisulphite sequencing (n = 3), and methyl-sensitive restricted endonuclease (n =1). Of these collection methods, most studies used one type of DNA methylation technique only (n = 9), whilst others combined measurement techniques, meaning Infinium HumanMethylation 450K and EPIC or CHIP-Seq from The Encyclopedia of DNA Elements (ENCODE) (n = 5).

From the selected publication, it appears that the two most popular methods were support vector machine and random forest. Based on the approaches identified, it seems the most successful combination is 10-fold cross-validation with either a random forest or support vector machine for array-based methods and deep learning-based models for prediction of the methylation status of the DNA.

Epigenetic data have traits that make it amenable to ML. Firstly, DNA methylation is usually both chemically and biologically stable over time [5]. Consequently, the measurement of DNA methylation allows for a reliable measure of the chemical composition of the epigenome at any given point in time. Secondly, large-scale, data-rich repositories such as The Cancer Genome Atlas (TCGA), ENCODE, and the BLUEPRINT consortium provide large amounts of samples to employ comprehensive, high-throughput statistical analyses of differentially methylated regions with biological relevance [80,81,82]. These repositories may provide for the training data for a ML algorithm, or an independent test set in order to determine the ML algorithm’s external validity and subsequent clinical utility [81, 83]. Since ML algorithms require large amounts of data to make accurate predictions, the establishment of these databanks is a significant milestone in the utility of AI in epigenetics. Finally, most datasets consist of DNA methylation profiles derived from peripheral blood, meaning that patients will only be required to provide a small blood sample. It should be noted that DNA methylation profiles are tissue-specific, and that the use of peripheral blood as a measure of DNA methylation may be less useful in diseases such as certain cancers [84], with more clinical utility in diseases like obesity [85, 86].

Challenges and future perspectives

Whilst there are advantages to combining epigenetics with ML to assist clinicians in the diagnostic process, there are significant challenges that must be addressed. First, very large datasets, requiring cross-jurisdiction collaboration are needed, especially if the diseases that need prediction are rare. This problem occurs 2-fold in epigenetic data: initially with the patient to healthy control ratio (with many datasets containing many more controls as compared to disease cases) and secondly within the individual methylomes, where there is a higher proportion of sections in the DNA that are densely methylated, referred to as differentially methylated regions (DMR), compared to the number of non-DMR sites [12, 87]. Second, most epigenetic data sets have more variables than samples, making it difficult for many ML algorithms to function effectively [88]. A potential solution is to collect more data, something that collaborative data repositories are providing. Concurrent, careful consideration of the type of algorithm and suitable performance measures of the prediction should be made to prevent erroneous data interpretations.

Third, not all associations in a DNA methylation dataset are linear. Several CpGs may be linked to the same gene which may influence other portions of the methylome and transcriptome, which has particularly been identified as an issue in gene set enrichment analysis [89, 90]. Additionally, the Illumina HumanMethylation450 array only covers 2% of all CpG sites in the methylome [27]. These challenges must be recognised before the full clinical potential of epigenetics is realised.

Fourth, for proper development, improvement and testing of novel machine learning approaches, it will be crucial to increase efforts to make large epigenetic datasets publicly available. This should include the raw data of different platforms, so research can be conducted into the effect of different normalisation methods on ML model performance and assessing which models work best for array-based and bisulphite sequencing-based data formats. One of the largest efforts in providing access to sequencing data is provided by The National Center for Biotechnology Information (NCBI). This includes databases such as the sequencing read archive (SRA) that are invaluable for research into new computational methods [91]. The SRA is operated by the International Nucleotide Sequence Database Collaboration (INSDC) and was initially started to publicly deposit sequencing reads [91]. Currently, more and more funding bodies and scientific journals request a deposition of experiment data in the SRA, which is not only beneficial for reproducibility of research, but also for efforts into the development of new analytical tools. Resources such as SRA made it possible to develop sequencing analysis tools such as Magic-BLAST (Basic Local Alignment Search Tool), which allows to align sequencing reads to a reference genome based on a sequencing database [92].

In an epigenetic context, deep learning has been used to classify genetic mutations in gliomas and prediction of single-cell DNA methylation status [71, 93]. Whilst still in its infancy, applications of deep learning to classification tasks using DNA methylation data may have benefits over traditional ML.

Another challenge for the field of ML is prediction bias. Several cases in facial recognition, especially relevant to deep learning because of their black box character, have shown that the predictive models are biased towards populations of European ancestry [94]. Therefore, the challenge of getting representative datasets that do not exacerbate existing health differences for disadvantaged populations is one of the biggest challenges that the ML community needs to address [95].


As an in-depth introduction to epigenetics and ML was out of the scope of this review, we aimed to give an overview of epigenetics and the potential of ML in clinical applications. The interested reader may refer to the cited literature on the different topics of epigenetics and machine learning.

ML is starting to find patterns in ever-growing genetic and epigenetic data sets that relate to the development of diseases. Although very accurate, deep learning methods will need to undergo further research to define what is going on within the “black box”, before clinicians can confidently make informed decisions whilst utilising such tools. In the meantime, interpretable ML algorithms are likely to be on the horizon with the potential to assist in more confident diagnoses. Whilst ML is sometimes depicted in the media and literature as a threat to the clinician’s profession and autonomy, clinicians should perhaps view its application as an assistive tool. ML can be used, just like evolving technologies across the ages (from the stethoscope, to X-Rays, to MRIs) as providing adjunctive information; it is a matter of being properly informed about limitations of the method of algorithm development and understanding where and to whom it is appropriate to apply.

Availability of data and materials

Not applicable



Artificial intelligence


Area under the curve


Cytosine phosphate guanine


Differentially methylated region


the Encyclopedia of DNA Elements


Least absolute shrinkage and selection operator


Machine learning


The Cancer Genome Atlas


  1. 1.

    Heyn H, Esteller M. DNA methylation profiling in the clinic: applications and challenges. Nat Rev Genet. 2012;13(10):679–92.

    CAS  PubMed  Article  Google Scholar 

  2. 2.

    Aslibekyan S, Claas SA, Arnett DK. Clinical applications of epigenetics in cardiovascular disease: the long road ahead. Translational research : the journal of laboratory and clinical medicine. 2015;165(1):143–53.

    CAS  Article  Google Scholar 

  3. 3.

    Mill J, Heijmans BT. From promises to practical strategies in epigenetic epidemiology. Nat Rev Genet. 2013;14(8):585–94.

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Jones PA, Issa J-PJ, Baylin S. Targeting the cancer epigenome for therapy. Nat Rev Genet. 2016;17:630.

    CAS  PubMed  Article  Google Scholar 

  5. 5.

    How Kit A, Nielsen HM, Tost J. DNA methylation based biomarkers: practical considerations and applications. Biochimie. 2012;94(11):2314–37.

    CAS  PubMed  Article  Google Scholar 

  6. 6.

    Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Information Science and Systems. 2014;2(1):3.

    PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Wang F, Casalino LP, Khullar D. Deep learning in medicine—promise, progress, and challenges Deep Learning in Medicine—Promise, Progress, and ChallengesDeep Learning in Medicine—Promise, Progress, and Challenges. JAMA Intern Med. 2019;179(3):293–4.

    PubMed  Article  Google Scholar 

  8. 8.

    Holzinger A, Jurisica I. Knowledge discovery and data mining in biomedical informatics: the future is in integrative, interactive machine learning solutions. Interactive knowledge discovery and data mining in biomedical informatics: Springer; 2014. p. 1-18.

  9. 9.

    Pfeiffer G, Baumgart S, Schröder J, Schimmler M, editors. A massively parallel architecture for bioinformatics. Computational Science – ICCS 2009; 2009 2009//; Berlin, Heidelberg: Springer Berlin Heidelberg.

  10. 10.

    Sarda S, Hannenhalli S. Next-generation sequencing and epigenomics research: a hammer in search of nails. Genomics & informatics. 2014;12(1):2–11.

    Article  Google Scholar 

  11. 11.

    Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. 2019;380(14):1347–58.

    PubMed  Article  Google Scholar 

  12. 12.

    Holder LB, Haque MM, Skinner MK. Machine learning for epigenetics and future medical applications. Epigenetics. 2017;12(7):505–14.

    PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Rodenhiser D, Mann M. Epigenetics and human disease: translating basic biology into clinical applications. Can Med Assoc J. 2006;174(3):341–8.

    Article  Google Scholar 

  14. 14.

    Joubert BR, Håberg SE, Nilsen RM, Wang X, Vollset SE, Murphy SK, et al. 450K epigenome-wide scan identifies differential DNA methylation in newborns related to maternal smoking during pregnancy. Environ Health Perspect. 2012;120(10):1425–31.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Joubert BR, Felix JF, Yousefi P, Bakulski KM, Just AC, Breton C, et al. DNA methylation in newborns and maternal smoking in pregnancy: genome-wide consortium meta-analysis. Am J Hum Genet. 2016;98(4):680–96.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Anderson OS, Sant KE, Dolinoy DC. Nutrition and epigenetics: an interplay of dietary methyl donors, one-carbon metabolism and DNA methylation. J Nutr Biochem. 2012;23(8):853–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    Alegría-Torres JA, Baccarelli A, Bollati V. Epigenetics and lifestyle. Epigenomics. 2011;3(3):267–77.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  18. 18.

    Felsenfeld G. A brief history of epigenetics. Cold Spring Harb Perspect Biol. 2014;6(1):a018200.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  19. 19.

    Robertson KD. DNA methylation and human disease. Nat Rev Genet. 2005;6(8):597.

    CAS  Article  Google Scholar 

  20. 20.

    Cui H, Cruz-Correa M, Giardiello FM, Hutcheon DF, Kafonek DR, Brandenburg S, et al. Loss of IGF2 imprinting: a potential marker of colorectal cancer risk. Science. 2003;299(5613):1753–5.

    CAS  PubMed  Article  Google Scholar 

  21. 21.

    Bhusari S, Yang B, Kueck J, Huang W, Jarrard DF. Insulin-like growth factor-2 (IGF2) loss of imprinting marks a field defect within human prostates containing cancer. Prostate. 2011;71(15):1621–30.

    CAS  PubMed  Article  Google Scholar 

  22. 22.

    Soubry A, Schildkraut JM, Murtha A, Wang F, Huang Z, Bernal A, et al. Paternal obesity is associated with IGF2 hypomethylation in newborns: results from a Newborn Epigenetics Study (NEST) cohort. BMC Med. 2013;11(1):29.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Gluckman PD, Hanson MA, Buklijas T, Low FM, Beedle AS. Epigenetic mechanisms that underpin metabolic and cardiovascular diseases. Nat Rev Endocrinol. 2009;5(7):401.

    CAS  PubMed  Article  Google Scholar 

  24. 24.

    Liang M. Epigenetic mechanisms and hypertension. Hypertension. 2018;72(6):1244–54.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16(1):6–21.

    CAS  PubMed  Article  Google Scholar 

  26. 26.

    Bernstein BE, Meissner A, Lander ES. The mammalian epigenome. Cell. 2007;128(4):669–81.

    CAS  PubMed  Article  Google Scholar 

  27. 27.

    Kurdyukov S, Bullock M. DNA methylation analysis: choosing the right method. Biology (Basel). 2016;5(1):3.

    Google Scholar 

  28. 28.

    Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, Shen R, et al. Genome-wide DNA methylation profiling using Infinium® assay. Epigenomics. 2009;1(1):177–200.

    CAS  PubMed  Article  Google Scholar 

  29. 29.

    Sandoval J, Heyn H, Moran S, Serra-Musach J, Pujana MA, Bibikova M, et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics. 2011;6(6):692–702.

    CAS  PubMed  Article  Google Scholar 

  30. 30.

    Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics. 2016;8(3):389–99.

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Dedeurwaerder S, Defrance M, Bizet M, Calonne E, Bontempi G, Fuks F. A comprehensive overview of Infinium HumanMethylation450 data processing. Brief Bioinform. 2013;15(6):929–41.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  32. 32.

    Berdasco M, Esteller M. Clinical epigenetics: seizing opportunities for translation. Nat Rev Genet. 2018;1.

  33. 33.

    Ong M-L, Lin X, Holbrook J. Measuring epigenetics as the mediator of gene/environment interactions in DOHaD. J Dev Orig Health Dis. 2015;6(1):10–6.

    CAS  PubMed  Article  Google Scholar 

  34. 34.

    Jang H, Serra C. Nutrition, epigenetics, and diseases. Clinical nutrition research. 2014;3(1):1–8.

    PubMed  PubMed Central  Article  Google Scholar 

  35. 35.

    Rauschert S, Melton P, Burdge G, Craig J, Godfrey K, Holbrook J, et al. Maternal smoking during pregnancy induces persistent epigenetic changes into adolescence, independent of postnatal smoke exposure and is associated with cardiometabolic risk. Front Genet. 2019;10:770.

    PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Bianco-Miotto T, Craig JM, Gasser YP, van Dijk SJ, Ozanne SE. Epigenetics and DOHaD: from basics to birth and beyond. J Dev Orig Health Dis. 2017;8(5):513–9.

    CAS  PubMed  Article  Google Scholar 

  37. 37.

    Payne SR. From discovery to the clinic: the novel DNA methylation biomarker m SEPT9 for the detection of colorectal cancer in blood. Epigenomics. 2010;2(4):575–85.

    CAS  PubMed  Article  Google Scholar 

  38. 38.

    Crowgey EL, Marsh AG, Robinson KG, Yeager SK, Akins RE. Epigenetic machine learning: utilizing DNA methylation patterns to predict spastic cerebral palsy. BMC bioinformatics. 2018;19(1):225.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  39. 39.

    Bari MG, Ung CY, Zhang C, Zhu S, Li H. Machine learning-assisted network inference approach to identify a new class of genes that coordinate the functionality of cancer networks. Sci Rep. 2017;7(1):6993.

    Article  CAS  Google Scholar 

  40. 40.

    Krittanawong C, Zhang H, Wang Z, Aydar M, Kitai T. Artificial intelligence in precision cardiovascular medicine. J Am Coll Cardiol. 2017;69(21):2657–64.

    PubMed  Article  Google Scholar 

  41. 41.

    Rech J, Althoff K-D. Artificial intelligence and software engineering: Status and future trends. KI. 2004;18(3):5–11.

    Google Scholar 

  42. 42.

    Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial intelligence in surgery: promises and perils. Ann Surg. 2018;268(1):70–6.

    PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44.

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Hamet P, Tremblay J. Artificial intelligence in medicine. Metabolism. 2017;69:S36–40.

    CAS  Article  Google Scholar 

  45. 45.

    Saria S, Butte A, Sheikh A. Better medicine through machine learning: what’s real, and what’s artificial? PLoS Med. 2019;15(12):e1002721.

    Article  Google Scholar 

  46. 46.

    Wong T-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48(9):2839–46.

    Article  Google Scholar 

  47. 47.

    Ben-David A. Comparison of classification accuracy using Cohen’s Weighted Kappa. Expert Syst Appl. 2008;34(2):825–32.

    Article  Google Scholar 

  48. 48.

    Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45(4):427–37.

    Article  Google Scholar 

  49. 49.

    Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl. 2017;73:220–39.

    Article  Google Scholar 

  50. 50.

    Kotsiantis SB, Zaharakis ID, Pintelas PE. Machine learning: a review of classification and combining techniques. Artif Intell Rev. 2006;26(3):159–90.

    Article  Google Scholar 

  51. 51.

    Cristianini N, Ricci E. Support Vector Machines. In: Kao M-Y, editor. Encyclopedia of Algorithms. Boston, MA: Springer US; 2008. p. 928–32.

    Google Scholar 

  52. 52.

    Breiman L. Random Forests. machine learning. 2001;45(1):5-32.

  53. 53.

    Aref-Eshghi E, Rodenhiser DI, Schenkel LC, Lin H, Skinner C, Ainsworth P, et al. Genomic DNA methylation signatures enable concurrent diagnosis and clinical genetic variant classification in neurodevelopmental syndromes. Am J Hum Genet. 2018;102(1):156–74.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Aref-Eshghi E, Schenkel LC, Ainsworth P, Lin H, Rodenhiser DI, Cutz J-C, et al. Genomic DNA methylation-derived algorithm enables accurate detection of malignant prostate tissues. Front Oncol. 2018;8.

  55. 55.

    Capper D, Jones DT, Sill M, Hovestadt V, Schrimpf D, Sturm D, et al. DNA methylation-based classification of central nervous system tumours. Nature. 2018;555(7697):469.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  56. 56.

    Dogan MV, Grumbach IM, Michaelson JJ, Philibert RA. Integrated genetic and epigenetic prediction of coronary heart disease in the Framingham Heart Study. PLoS One. 2018;13(1):e0190549.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  57. 57.

    Orozco JI, Knijnenburg TA, Manughian-Peter AO, Salomon MP, Barkhoudarian G, Jalas JR, et al. Epigenetic Profiling for the Molecular Classification of Metastatic Brain Tumors. bioRxiv. 2018:268193.

  58. 58.

    Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent data analysis. 2002;6(5):429–49.

    Article  Google Scholar 

  59. 59.

    LeCun Y, Bengio Y, Hinton G. Deep learning. nature. 2015;521(7553):436.

  60. 60.

    Jain AK, Mao J, Mohiuddin KM. Artificial neural networks: a tutorial. Computer. 1996;29(3):31–44.

    Article  Google Scholar 

  61. 61.

    Rudin C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence. 2019;1(5):206–15.

    Article  Google Scholar 

  62. 62.

    Zahid FM, Heumann C. Multiple imputation with sequential penalized regression. Statistical methods in medical research. 2018:962280218755574.

  63. 63.

    Alanazi HO, Abdullah AH, Qureshi KN. A critical review for developing accurate and dynamic predictive models using machine learning methods in medicine and health care. J Med Syst. 2017;41(4):69.

    PubMed  Article  Google Scholar 

  64. 64.

    Tarca AL, Carey VJ, Chen X-W, Romero R, Drăghici S. Machine learning and its applications to biology. PLoS Comput Biol. 2007;3(6):e116.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  65. 65.

    Boulesteix A-L, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2006;8(1):32–44.

    PubMed  Article  CAS  Google Scholar 

  66. 66.

    Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC. Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform. 2016;17(4):628–41.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  67. 67.

    Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics. 2002;18(1):39–50.

    CAS  PubMed  Article  Google Scholar 

  68. 68.

    Deo RC. Machine Learning in Medicine. Circulation. 2015;132(20):1920–30.

    PubMed  PubMed Central  Article  Google Scholar 

  69. 69.

    Kallenberg M, Petersen K, Nielsen M, Ng AY, Diao P, Igel C, et al. Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring. IEEE Trans Med Imaging. 2016;35(5):1322–31.

    PubMed  Article  Google Scholar 

  70. 70.

    Wang Y, Liu T, Xu D, Shi H, Zhang C, Mo Y-Y, et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci Rep. 2016;6:19598.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  71. 71.

    Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  72. 72.

    Aref-Eshghi E, Bend EG, Hood RL, Schenkel LC, Carere DA, Chakrabarti R, et al. BAFopathies’ DNA methylation epi-signatures demonstrate diagnostic utility and functional continuum of Coffin–Siris and Nicolaides–Baraitser syndromes. Nat Commun. 2018;9(1):4885.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  73. 73.

    Cai Z, Xu D, Zhang Q, Zhang J, Ngai S-M, Shao J. Classification of lung cancer using ensemble-based feature selection and machine learning methods. Mol BioSyst. 2015;11(3):791–800.

    CAS  PubMed  Article  Google Scholar 

  74. 74.

    Adorján P, Distler J, Lipscher E, Model F, Müller J, Pelet C, et al. Tumour class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Res. 2002;30(5):e21-e.

  75. 75.

    List M, Hauschild A-C, Tan Q, Kruse TA, Baumbach J, Batra R. Classification of breast cancer subtypes by combining gene expression and DNA methylation data. Journal of integrative bioinformatics. 2014;11(2):1–14.

    Article  Google Scholar 

  76. 76.

    Li J, Ching T, Huang S, Garmire LX, editors. Using epigenomics data to predict gene expression in lung cancer. BMC bioinformatics; 2015: BioMed Central.

  77. 77.

    Queiros AC, Villamor N, Clot G, Martinez-Trillos A, Kulis M, Navarro A, et al. A B-cell epigenetic signature defines three biologic subgroups of chronic lymphocytic leukemia with clinical impact. Leukemia. 2015;29(3):598–605.

    CAS  PubMed  Article  Google Scholar 

  78. 78.

    Bhoi S, Ljungström V, Baliakas P, Mattsson M, Smedby KE, Juliusson G, et al. Prognostic impact of epigenetic classification in chronic lymphocytic leukemia: the case of subset# 2. Epigenetics. 2016;11(6):449–55.

    PubMed  PubMed Central  Article  Google Scholar 

  79. 79.

    Malta TM, Sokolov A, Gentles AJ, Burzykowski T, Poisson L, Weinstein JN, et al. Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell. 2018;173(2):338–54. e15.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  80. 80.

    Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30(10):1363–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  81. 81.

    Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, et al. Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int J Epidemiol. 2012;41(1):200–9.

    PubMed  PubMed Central  Article  Google Scholar 

  82. 82.

    Silva TC, Colaprico A, Olsen C, D'Angelo F, Bontempi G, Ceccarelli M, et al. TCGA Workflow: analyze cancer genomics and epigenomics data using Bioconductor packages. F1000Res. 2016;5:1542.

    PubMed  PubMed Central  Article  Google Scholar 

  83. 83.

    Leung MK, Delong A, Alipanahi B, Frey BJ. Machine learning in genomic medicine: a review of computational problems and data sets. Proc IEEE. 2015;104(1):176–97.

    Article  Google Scholar 

  84. 84.

    Sina AAI, Carrascosa LG, Liang Z, Grewal YS, Wardiana A, Shiddiky MJA, et al. Epigenetically reprogrammed methylation landscape drives the DNA self-assembly and serves as a universal cancer biomarker. Nat Commun. 2018;9(1):4915.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  85. 85.

    Huang Y-T, Chu S, Loucks EB, Lin C-L, Eaton CB, Buka SL, et al. Epigenome-wide profiling of DNA methylation in paired samples of adipose tissue and blood. Epigenetics. 2016;11(3):227–36.

    PubMed  PubMed Central  Article  Google Scholar 

  86. 86.

    Hewitt AW, Januar V, Sexton-Oates A, Joo JE, Franchina M, Wang JJ, et al. DNA methylation landscape of ocular tissue relative to matched peripheral blood. Sci Rep. 2017;7:46330.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  87. 87.

    Haque MM, Skinner MK, Holder LB. Imbalanced class learning in epigenetics. J Comput Biol. 2014;21(7):492–507.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  88. 88.

    Kirpich A, Ainsworth EA, Wedow JM, Newman JR, Michailidis G, McIntyre LM. Variable selection in omics data: A practical evaluation of small sample sizes. PLoS One. 2018;13(6):e0197910.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  89. 89.

    Li S, He T, Pawlikowska I, Lin T. Correcting length-bias in gene set analysis for DNA methylation data. Statistics and Its Interface. 2017;10(2):279–89.

    Article  Google Scholar 

  90. 90.

    Deutsch CK, McIlvane WJ. Non-Mendelian etiologic factors in neuropsychiatric illness: pleiotropy, epigenetics, and convergence. Behav Brain Sci. 2012;35(5):363–4.

    PubMed  PubMed Central  Article  Google Scholar 

  91. 91.

    Leinonen R, Sugawara H, Shumway M. International nucleotide sequence database C. The sequence read archive. Nucleic Acids Res. 2011;39(Database issue):D19–21.

    CAS  PubMed  Article  Google Scholar 

  92. 92.

    Boratyn GM, Thierry-Mieg J, Thierry-Mieg D, Busby B, Madden TL. Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics. 2019;20(1):405.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  93. 93.

    Chang P, Grinband J, Weinberg B, Bardis M, Khy M, Cadena G, et al. Deep-learning convolutional eural Networks Accurately Classify Genetic Mutations in Gliomas. American Journal of Neuroradiology. 2018.

  94. 94.

    Phillips PJ, Jiang F, Narvekar A, Ayyad J, O'Toole AJ. An other-race effect for face recognition algorithms. ACM Trans Appl Percept. 2011;8(2):1–11.

    Article  Google Scholar 

  95. 95.

    Char DS, Shah NH, Magnus D. Implementing machine learning in health care—addressing ethical challenges. N Engl J Med. 2018;378(11):981–3.

    PubMed  PubMed Central  Article  Google Scholar 

Download references


We would like to acknowledge Professor Lawrence Beilin for reviewing the final manuscript.


SR received support from the European LifeCycle project through the fellowship call of June 2018, grant agreement no. 733206. RCH is supported by NHMRC Fellowship (grant number 1053384). RCH, PM, and SR received further support through the NHMRC EU-collaborative grant with the number APP1142858—early life stressors and lifecycle health.

Author information




SR and KR wrote the manuscript. KR performed the literature review. RCH and PM contributed to the conception of the study and revised the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

Corresponding author

Correspondence to S. Rauschert.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

No applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rauschert, S., Raubenheimer, K., Melton, P.E. et al. Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification. Clin Epigenet 12, 51 (2020).

Download citation