Skip to main content

Identifying CpG sites associated with eczema via random forest screening of epigenome-scale DNA methylation



The prevalence of eczema is increasing in industrialized nations. Limited evidence has shown the association of DNA methylation (DNA-M) with eczema. We explored this association at the epigenome-scale to better understand the role of DNA-M.

Data from the first generation (F1) of the Isle of Wight (IoW) birth cohort participants and the second generation (F2) were examined in our study. Epigenome-scale DNA methylation of F1 at age 18 years and F2 in cord blood was measured using the Illumina Infinium HumanMethylation450 Beadchip. A total of 307,357 cytosine-phosphate-guanine sites (CpGs) in the F1 generation were screened via recursive random forest (RF) for their potential association with eczema at age 18. Functional enrichment and pathway analysis of resulting genes were carried out using DAVID gene functional classification tool. Log-linear models were performed in F1 to corroborate the identified CpGs. Findings in F1 were further replicated in F2.


The recursive RF yielded 140 CpGs, 88 of which showed statistically significant associations with eczema at age 18, corroborated by log-linear models after controlling for false discovery rate (FDR) of 0.05. These CpGs were enriched among many biological pathways, including pathways related to creating transcriptional variety and pathways mechanistically linked to eczema such as cadherins, cell adhesion, gap junctions, tight junctions, melanogenesis, and apoptosis. In the F2 generation, about half of the 83 CpGs identified in F1 showed the same direction of association with eczema risk as in F1, of which two CpGs were significantly associated with eczema risk, cg04850479 of the PROZ gene (risk ratio (RR) = 15.1 in F1, 95 % confidence interval (CI) 1.71, 79.5; RR = 6.82 in F2, 95 % CI 1.52, 30.62) and cg01427769 of the NEU1 gene (RR = 0.13 in F1, 95 % CI 0.03, 0.46; RR = 0.09 in F2, 95 % CI 0.03, 0.36).


Via epigenome-scaled analyses using recursive RF followed by log-linear models, we identified 88 CpGs associated with eczema in F1, of which 41 were replicated in F2. Several identified CpGs are located within genes in biological pathways relating to skin barrier integrity, which is central to the pathogenesis of eczema. Novel genes associated with eczema risk were identified (e.g., the PROZ and NEU1 genes).


The increasing prevalence of allergic conditions including eczema is a major public health concern in industrialized nations [1]. The prevalence of eczema is reported to be 10–30 % in children and 1–3 % in adults of the developed world [2]. In addition to the physical discomfort to the affected individual and the social burden on their families, eczema has a huge economic impact on nations’ health care budgets [3].

Eczema is a chronic condition involving a complex interplay of genetic, epigenetic, and environmental factors [46]. So far, DNA methylation (DNA-M) remains the most studied mechanism with potential answers to epigenetic regulation of gene function [7, 8]. The Illumina Infinium HumanMethylation450 Beadchip has the ability to measure DNA methylation at more than 450 K cytosine-phosphate-guanine sites (CpGs), which provides rich information for various epigenetic studies. Epigenome-scale studies offer an immense opportunity to understand disease pathophysiology, but there are also concerns about the challenges associated with this type of studies. A recent review published in 2014 by Paul et al. highlighted the potential challenges in the field of epigenomics [9] such as study design, methodologies of obtaining biologic samples, high dimensionality, and highly correlated data [9, 10].

Random forest (RF) is a machine learning algorithm used for classification and has the ability to efficiently handle high dimensionality and highly correlated data [11]. The R package was used in this study to screen CpG sites potentially associated with eczema. RF is composed of classification trees with each tree constructed using randomly selected bootstrap samples. Misclassification rates calculated based on testing samples can be used to estimate the accuracy of the forests.

In this study, we utilized a method built upon RF to screen specific CpGs potentially associated with eczema using data in the first generation (F1) at age 18 years and functionally annotated the genes of the identified CpGs using DAVID [12] to understand the biological pathways. For the identified CpGs via the RF-based method, we further examined their statistical significance on their linear association with eczema risk at age 18 years using log-linear models and replicated the findings from the F1 in the second generation (F2).


Eczema frequencies in F1 (18 years) and in F2 (3, 6, and 12 months) generations stratified by sex indicated that females had higher eczema prevalence than males at 18 years of age in the F1 generation, and the prevalence switched in the newborns of the F2 generation (Table 1). This is consistent with the gender-reversal pattern of eczema reported in our earlier work [13].

Table 1 Eczema status in male and female cohort participants in the F1 and F2 generations (chi-square tests)

In the screening process using recursive RF [14, 15], the parameters (sampsize, mtry, and ntree—details are in the “Statistical analysis” section) in the randomForest() R package were selected to achieve stabilized error rates. In total, pre-processed DNA methylation of 307,357 CpGs in the F1 generation was included in the screening. The results of the recursive RF (Table 2, Fig. 1; details in the “Statistical Analysis” section) Indicated that a total of 140 CpGs (after excluding 8 CpGs located on the X chromosome) passed the screening showing potential association with eczema. The exclusion of the 8 CpGs were due to the potential bias measurement of DNA-M for different genders. Nevertheless, in the following analyses (log-linear models below), we assessed whether gender played a role in the association of DNA-M with eczema.

Table 2 The performance of recursive RF at each iteration
Fig. 1

Misclassification error rates at each iteration of the recursive RF. OOB out of bag error rate (overall error), YES eczema, No non-eczema

Further examination of these 140 CpGs from F1 using log-linear models indicated that 88 out of 140 CpGs had a statistically significant linear association with eczema at age 18 (FDR-adjusted P value <0.05) (Additional file 1: Table S1). We also tested the statistical significance of the interaction between DNA-M and gender; none of the FDR-adjusted P values were <0.05.

We assessed the biological pathways enriched within the genes annotated to those 140 CpGs using DAVID (Table 3). The most significantly enriched pathways related to the creation of transcriptional variety through genetic (e.g., polymorphism) and regulatory (e.g., alternative splicing) mechanisms. The remainder of the significantly enriched pathways included several pathways mechanistically linked to epithelial barrier integrity and cell adhesion, which are of key importance in eczema: examples include cadherins (protocadherin gamma, P = 1.8 × 10−16; cadherin 6 domain, P = 1.5 × 10−10; and cadherin N-terminal domain, P = 3.6 × 10−10), gap junctions (P = 2.6 × 10−6), cell adhesion (cell-cell adhesion, P = 2.7 × 10−10; cell adhesion, P = 2.2 × 10−5), tight junctions (P = 1.6 × 10−5), melanogenesis (P = 7.1 × 10−5), and apoptosis (P = 7.3 × 10−3).

Table 3 Terms significantly enriched in functional annotation and pathway analysis and genes present in the pathways potentially associated with eczema (FDR-adjusted P value; FDR = 0.05)

Replication results

We then replicated the findings from the F1 generation in the F2 generation. In total, 83 out of the 88 CpGs identified in the F1 were also present in the F2 dataset (the 5 CpG sites in the F2 were excluded after quality control). DNA methylation at 41 CpGs (out of 83) showed the same direction of changes with eczema in both the F1 and F2 generations (Table 4, Fig. 2). Of these 41 CpGs, two were statistically significantly associated with eczema risk in both generations (Table 4); cg04850479 in the PROZ gene showed adjusted risk ratio (RR) of 15.19 (95 % confidence interval (CI) 1.71 to 79.50) in the F1 and 6.82 (95 % CI 1.52 to 30.62) in the F2 and cg01427769 in the NEU1 gene showed adjusted RR of 0.13 (95 % CI 0.03 to 0.46) in the F1 and 0.09 (95 % CI 0.03 to 0.36) in the F2. We further assessed the association of DNA methylation of these 2 CpGs with corresponding gene expressions in the F2 generation. No statistically significant associations were identified. Among the remaining CpGs not replicated in the F2 generation, about 60 % CpGs (n = 25 CpGs) showed a statistically significant difference in DNA methylation between the two generations (based on two sample two sided t tests) after adjusting for multiple testing. Since some of the F2 generation are offsprings of subjects in the F1 generation, the findings tend to be conservative. The above analyses were adjusted for estimated cell type proportions [16].

Table 4 The 41 CpGs that had the same direction of effect with eczema in both F1 and F2 generations based on log-linear models
Fig. 2

The risk ratios of 83 eczema-associated CpGs sorted by chromosome from 1 to 21. The numbers in the textbox are chromosome indices, which are represented by different colors in the bar graphs. The horizontal red line represents the risk ratio of one


This is the first study to explore epigenome-scale DNA methylation patterns associated with eczema. Using data from two generations, our study based on data of the F1 generation identified CpGs potentially associated with eczema status using the RF technique, which was further corroborated via log-linear models. In total, 140 CpGs were identified via RF, which were further assessed using log-linear models with 88 CpGs being statistically significantly associated with eczema risk after adjusting for cell type proportions and controlling for multiple testing. The remaining 52 CpGs were not corroborated in log-linear models. This is likely due to two reasons. Firstly, the 140 CpGs were identified based on their importance values in terms of minimizing misclassification errors other than statistical testing [11]. It is possible that the identified CpG sites did not have a statistically significant main effect on eczema risk. Secondly, among the 140 CpGs, complex non-linear interactions are likely to exist between multiple CpGs which may be difficult to parametrically identify using log-linear models. Using F2 generation data, around 50 % (41 CpGs) of these 88 CpGs identified in the F1 generation were further replicated. In particular, two CpGs showed statistically significant results in both F1 and F2: cg04850479 in the PROZ gene and cg01427769 in the NEU1 gene. Although some studies have linked NEU1 gene with asthma [17] via Th2-mediated airway inflammation [18, 17], and it is known that the Th2 pathway is also important for eczema [19, 20], based on our knowledge, no study has so far spotted its role in eczema. The insignificant findings on the association of DNA methylation of cg04850479 (in the PROZ gene) and cg01427769 (in the NEU1 gene) is likely due to tissue-specific gene expression. That is, an early exposure has left a change in methylation in all tissues including blood but the gene is not expressed in blood but skin for eczema. It is also possible that the DNA methylation of these two CpGs is related to the production of dysfunctional transcripts.

Enrichment analysis of the CpG sites identified in the F1 generation highlighted pathways related to the creation of transcriptional variation and several biological pathways related to the epidermal barrier and involved in eczema (Table 3).

The skin barrier is crucial in maintaining skin integrity, and disruption of the epidermal barrier is one of the important mechanisms in the pathogenesis of eczema [21, 22]. Several studies reported that skin barrier dysfunction is a result of the impairment of tight junction function in eczema patients [2326]. Cadherins and protocadherins are transmembrane proteins important for cell-to-cell adhesion and epithelial integrity and have been associated with eczema and asthma in genetic studies [27]. Chronic eczema and several other dermatoses are also related to hyperpigmentation of the skin [28]. Our study detected differentially methylated CpGs within genes in pathways relating to epidermal barrier integrity and eczema pathogenesis, including cadherins, gap junction, cell adhesion, tight junction, melanogenesis, and apoptosis (Table 3). Their biological functions suggest these eczema-associated CpGs are of special interest, and they are potential epigenetic biomarkers for eczema. The detection of eczema-associated differential methylation within pathways already known to be associated with eczema is reasonable and suggests that epigenetic and genetic variation may work together to regulate eczema-associated gene expression in the genes identified here, as has already been observed in other eczema-associated genes [22].

Several limitations were identified in the process of our study. Although the 140 CpGs were chosen based on the least misclassification error rate, it is possible that some CpGs were incorrectly removed and vice versa. Also, cord blood contains a small amount of maternal cells [29], which may bias the measure of DNA methylation, but our cell type correction performed in this study was expected to reduce the bias. Findings from the F1 generation were partially replicated in the F2 generation. This could be due to age playing a role in the CpGs predicting eczema; adolescence transition has the potential to revise DNA methylation. This is supported by our comparison of DNA methylation between the F1 and F2 generations among the CpGs not replicated. Not all CpGs selected by random forests were involved in known eczema-associated biological pathways, which may be due to complex interactions between the CpGs hence requires further investigation. It is possible that some of the identified CpGs may be associated with the severity of eczema. Hence, there is a need to further examine potential associations of DNA methylation of those CpG sites with eczema severity. For multiple CpG sites, DNA methylation was associated with eczema in the F1 generation at age 18. These CpG sites could be risks or consequences of eczema. However, CpGs replicated in the F2 generation were measured in cord blood before the onset of eczema and thus have the potential to predict eczema.


This is the first epigenome-scale association study of eczema employing a classification technique (recursive RF), and we identified eczema-associated CpG sites. The findings added to the existing knowledge that recursive RF can be successfully employed in drawing actionable results from complex datasets. Genes annotated to eczema-associated CpGs were significantly enriched in pathways related to the creation of transcriptional variation and pathways relating to epidermal barrier function and eczema. Furthermore, the study identified for the first time that the PROZ and NEU1 genes are potential predictors of eczema.


Isle of Wight birth cohort

The Isle of Wight (IoW) birth cohort was established to study the natural history of allergic diseases among children who were born between January 1, 1989 and February 28, 1990 on the Isle of Wight, UK. The study was approved by the local research ethics committee and written informed consent was obtained from the parents. After exclusion of adoptions, perinatal deaths, and refusal, 1456 children (95 %) were enrolled. Children were followed-up at ages 1 (n = 1167), 2 (n = 1174), 4 (n = 1218), 10 (n = 1373), and 18 years (n = 1313); detailed questionnaires were administered at each follow-up. Details of the birth cohort have been described elsewhere [4, 30, 31]. A total of 244 women and 122 men at age 18 years were randomly selected from the cohort for epigenome-scale DNA methylation studies. Ethics approvals were obtained from the Isle of Wight Local Research Ethics Committee (now named the National Research Ethics Service, NRES Committee South Central – Southampton B) at recruitment and for the subsequent follow-ups (06/Q1701/34).

Outcome: eczema phenotype data collection

Eczema was defined as chronic or chronically relapsing itchy dermatitis lasting more than 6 weeks with characteristic morphology and distribution [32], following Hanifin and Rajka criteria [5].

DNA methylation

DNA was extracted from whole blood and umbilical cord blood using a standard salting out procedure [33]. DNA concentration was determined by Qubit quantitation. One microgram of DNA was bisulfite-treated using the EZ 96-DNA methylation kit (Zymo Research, Irvine, CA, USA) following the manufacturer’s standard protocol.

Epigenome-scale DNA methylation was assessed using the Illumina Infinium HumanMethylation450 Beadchip (Illumina, Inc., San Diego, CA, USA), which interrogates >484,000 CpGs associated with approximately 24,000 genes. Arrays were processed using a standard protocol as described elsewhere [7], with multiple identical control samples assigned to each batch to assess assay variability, and samples were randomly distributed on microarrays to control against batch effects. The methylation level (β value) for each CpG was determined using the Methylation module of GenomeStudio software (Illumina, Version 2011.1).

Methylation levels for each CpG site are recorded as beta (β) values, which represent the proportion of methylated (M) over methylated (M) plus unmethylated (U) probes (β = M/[c + M + U], with constant c introduced for the situation of too small M + U) and can be interpreted as percentage methylation. These values were utilized in the RF screening process described below; however, β values close to 0 or 1 tend to suffer from severe heteroscedasticity; therefore, logit-transformed β values (M values, approximated by log2(β / (1-β)) [34] were used in other analyses including log-linear models.

Pre-processing DNA methylation data

In our study, the detection P value reported by GenomeStudio was used as a QC measure of probe performance. Probes whose detection P values were >0.01 in >10 % of the samples were removed [35]. Methylation data were then pre-processed using the Bioconductor IMA (Illumina methylation analyzer) package and ComBat was used to perform peak correction and adjust for inter-array variation [36, 37]. To ensure that our findings were not biased by SNPs affecting measurement of methylation levels, we excluded all probes with a potential SNP in the probe sequence. After pre-processing, a total of 307,357 CpGs were retained in the DNA methylation dataset.

Statistical analysis

Pearson’s χ 2 tests were used to determine if prevalence of eczema differed between the sexes. P values were considered significant at a level of 0.05. To make sure that our findings are not a result of confounding due to cell types, we ran the analyses by adjusting for estimated proportions of CD8+ T cells, CD4+ T cells, natural killer cells, B cells, monocytes, and granulocytes. Cell type proportions were estimated as described previously [16].

The random forest package, randomForest(), in R was utilized to conduct the recursive RF analyses [38, 15, 14]. The parameter sampsize refers to the size of the sample of training data sets that is to be obtained for classification. The number of variables that are randomly sampled as predictors at each split is called mtry, whereas, ntree is a parameter referring to the total number of trees that are to be grown in the forest. In order to improve the prediction accuracy of the RF algorithm, these three parameters were repeatedly altered until the lowest misclassification rate was obtained. We decided whether to use a balanced sampsize of equal eczema and non-eczema cases such as 20 eczema and 20 non-eczema cases or 30/30 or 40/40. We also studied imbalanced RFs with sampsize such as 46/320 or 20/40 for the training sets by using the default values for mtry and ntree. We then tested the prediction accuracy of the RFs at different combinations of mtry (√p, 2*√p, 0.1p, 0.15p, 0.2p, and 0.25p) where p is number of variables and ntree (200, 500, 1000, and 1500). Once the optimal parameter values were selected, the recursive RF algorithm was implemented. Mean Decrease Gini (MDG) served as a variable importance measure (VIM) for our study as it was shown to be more robust in previous research [39].

DNA methylation at 307,357 CpGs along with sex and eczema status in the F1 generation served as input in randomForest(), and the CpGs were subjected to data reduction, repeatedly dropping 50 % of variables with the lowest VIMs until the misclassification rate showed a significant increase.

After testing for sampsize (both equal and unequal) with different combinations (both with and without eczema), we set sampsize = (31, 31), mtry = 0.2p (where p is the available number of variables) and ntree = 500. We applied RF to pre-processed DNA methylation data containing 307,357 CpGs in the F1 generation, ran a total of 17 iterations, and at each iteration, recorded the misclassification rate (Table 2, Fig. 1). The lowest overall misclassification error rate (of eczema and eczema-free) was 5.2 %, with a corresponding least misclassification rate of 17.4 % for eczema at the 12th iteration. The overall misclassification rate dropped from 18.6 % in the first iteration to 5.2 % in the 12th iteration, and the eczema misclassification error rate dropped to 17.4 % at the end of 12th iteration from 95.7 % in the first iteration.

The CpGs identified from the recursive RF [40] were assessed for enrichment of biological pathways using DAVID [12] bioinformatics tool and examined for their association with eczema at age of 18 years by use of log-linear models. Multiple testing was adjusted by controlling false discovery rate of 0.05 in the pathway analysis and log-linear models. Since differential cell types in the peripheral blood are known to have confounding effect on the final result [16], we adjusted for cell type correction. For genes of particular interest (e.g., showing statistical significance in both generations in log-linear models), robust regressions are applied to assess the association of DNA methylation and corresponding gene expressions in the F2 generation. For this last test, multiple testing is adjusted within genes based on the number of CpG sites available of that gene.

Replication cohort

The IoW F2 generation cohort includes the offspring of the IoW 1989 birth cohort. In the F2 generation, repeated measures of eczema at ages 3, 6, and 12 months were recorded in a sample of n = 116 children. DNA methylation was measured in umbilical cord blood. To replicate the findings from the F1 generation, log-linear models with repeated measures of eczema were used in F2 generation analyses. Figure 3 represents the summary of statistical analysis and sample size for each analysis conducted in this study.

Fig. 3

Flow chart of statistical analyses and the number of CpG sites after each analysis in the F1 and F2 generations. RRF: recursive random forest





DNA methylation


Isle of Wight


random forest


risk ratio


  1. 1.

    Bieber T. Atopic dermatitis. Ann Dermatol. 2010;22(2):125–37.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  2. 2.

    Leung DYM, Boguniewicz M, Howell MD, Nomura I, Hamid QA. New insights into atopic dermatitis. J Clin Invest. 2004;113(5):651–7. doi:10.1172/JCI21060.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  3. 3.

    Gupta R, Sheikh A, Strachan D, Anderson H. Burden of allergic disease in the UK: secondary analyses of national databases. Clin Exp Allergy. 2004;34(4):520–6.

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Arshad SH, Hide DW. Effect of environmental factors on the development of allergic disorders in infancy. J Allergy Clin Immunol. 1992;90(2):235–41.

    CAS  PubMed  Article  Google Scholar 

  5. 5.

    Hanifin J, Rajka G. Diagnostic features of atopic eczema. Acta Dermatol Venereol (Stockh). 1980;92:44–7.

    Google Scholar 

  6. 6.

    Orteu C, Rustin M, O’toole E, Sabin C, Salmon M, Poulter L, et al. The inhibition of cutaneous T cell apoptosis may prevent resolution of inflammation in atopic eczema. Clin Exp Immunol. 2000;122(2):150–6.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  7. 7.

    Bibikova M, Fan J-B. GoldenGate® Assay for DNA Methylation Profiling. DNA Methylation. Springer; 2009. p. 149–63.

  8. 8.

    Sandoval J, Heyn H, Moran S, Serra-Musach J, Pujana MA, Bibikova M, et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics. 2011;6(6):692–702.

    CAS  PubMed  Article  Google Scholar 

  9. 9.

    Paul DS, Beck S. Advances in epigenome-wide association studies for common diseases. Trends Mol Med. 2014;20(10):541–3.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  10. 10.

    Xiong M, Ma L. An efficient sufficient dimension reduction method for identifying genetic variants of clinical significance. arXiv preprint arXiv:13013528. 2013.

  11. 11.

    Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  12. 12.

    Da Wei Huang BTS, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2008;4(1):44–57.

    Article  Google Scholar 

  13. 13.

    Ziyab A, Raza A, Karmaus W, Tongue N, Zhang H, Matthews S, et al. Trends in eczema in the first 18 years of life: results from the Isle of Wight 1989 birth cohort study. Clin Exp Allergy. 2010;40(12):1776–84.

    CAS  PubMed  Article  Google Scholar 

  14. 14.

    Hapfelmeier A, Ulm K. A new variable selection approach using random forests. Comput Stat Data Anal. 2013;60:50–69.

    Article  Google Scholar 

  15. 15.

    Goldstein BA, Hubbard AE, Cutler A, Barcellos LF. An application of random forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genet. 2010;11(1):49.

    PubMed Central  PubMed  Article  Google Scholar 

  16. 16.

    Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):86.

    PubMed Central  PubMed  Article  Google Scholar 

  17. 17.

    Lu LQ, Liao W. Screening and functional pathway analysis of genes associated with pediatric allergic asthma using a DNA microarray. Mol Med Rep. 2015. doi:10.3892/mmr.2015.3277.

  18. 18.

    Katoh S, Maeda S, Fukuoka H, Wada T, Moriya S, Mori A, et al. A crucial role of sialidase Neu1 in hyaluronan receptor function of CD44 in T helper type 2-mediated airway inflammation of murine acute asthmatic model. Clin Exp Immunol. 2010;161(2):233–41. doi:10.1111/j.1365-2249.2010.04165.x.

    CAS  PubMed Central  PubMed  Google Scholar 

  19. 19.

    Abrahamsson TR, Sandberg Abelius M, Forsberg A, Björkstén B, Jenmalm MC. A Th1/Th2-associated chemokine imbalance during infancy in children developing eczema, wheeze and sensitization. Clin Exp Allergy. 2011;41(12):1729–39. doi:10.1111/j.1365-2222.2011.03827.x.

    CAS  PubMed  Article  Google Scholar 

  20. 20.

    Eyerich K, Novak N. Immunology of atopic eczema: overcoming the Th1/Th2 paradigm. Allergy. 2013;68(8):974–82. doi:10.1111/all.12184.

    CAS  PubMed  Article  Google Scholar 

  21. 21.

    Cork MJ, Danby SG, Vasilopoulos Y, Hadgraft J, Lane ME, Moustafa M, et al. Epidermal barrier dysfunction in atopic dermatitis. J Investig Dermatol. 2009;129(8):1892–908.

    CAS  PubMed  Article  Google Scholar 

  22. 22.

    Ziyab AH, Karmaus W, Holloway JW, Zhang H, Ewart S, Arshad SH. DNA methylation of the filaggrin gene adds to the risk of eczema associated with loss-of-function variants. J Eur Acad Dermatol Venereol. 2013;27(3):e420–3. doi:10.1111/jdv.12000.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  23. 23.

    Wapenaar MC, Monsuur A, van Bodegraven A, Weersma RK, Bevova M, Linskens R, et al. Associations with tight junction genes PARD3 and MAGI2 in Dutch patients point to a common barrier defect for celiac disease and ulcerative colitis. Gut. 2007;57(4):463–7.

    PubMed  Article  Google Scholar 

  24. 24.

    De Benedetto A, Rafaels NM, McGirt LY, Ivanov AI, Georas SN, Cheadle C, et al. Tight junction defects in patients with atopic dermatitis. J Allergy Clin Immunol. 2011;127(3):773–86. e7.

    PubMed Central  PubMed  Article  Google Scholar 

  25. 25.

    Furuse M, Hata M, Furuse K, Yoshida Y, Haratake A, Sugitani Y, et al. Claudin-based tight junctions are crucial for the mammalian epidermal barrier a lesson from claudin-1–deficient mice. J Cell Biol. 2002;156(6):1099–111.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  26. 26.

    Niessen CM. Tight junctions/adherens junctions: basic structure and function. J Investig Dermatol. 2007;127(11):2525–32.

    CAS  PubMed  Article  Google Scholar 

  27. 27.

    Faura Tellez G, Nawijn MC, Koppelman GH. Protocadherin-1: epithelial barrier dysfunction in asthma and eczema. Eur Respir J. 2014;43(3):671–4. doi:10.1183/09031936.00179713.

    PubMed  Article  Google Scholar 

  28. 28.

    Harmon LE. Melanogenesis and pigmentary disturbances. J Natl Med Assoc. 1964;56(6):501.

    CAS  PubMed Central  PubMed  Google Scholar 

  29. 29.

    Hall JM, Lingenfelter P, Adams S, Lasser D, Hansen J, Bean M. Detection of maternal cells in human umbilical cord blood using fluorescence in situ hybridization. Blood. 1995;86(7):2829–32.

    CAS  PubMed  Google Scholar 

  30. 30.

    Kurukulaaratchy R, Fenn M, Waterhouse L, Matthews S, Holgate S, Arshad S. Characterization of wheezing phenotypes in the first 10 years of life. Clin Exp Allergy. 2003;33(5):573–8.

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Soto-Ramírez N, Ziyab AH, Karmaus W, Zhang H, Kurukulaaratchy RJ, Ewart S, et al. Epidemiologic methods of assessing asthma and wheezing episodes in longitudinal studies: measures of change and stability. J Epidemiol. 2013;23(6):399.

    PubMed Central  PubMed  Article  Google Scholar 

  32. 32.

    Arshad S, Karmaus W, Kurukulaaratchy R, Sadeghnejad A, Huebner M, Ewart S. Polymorphisms in the interleukin 13 and GATA binding protein 3 genes and the development of eczema during childhood. Br J Dermatol. 2008;158(6):1315–22.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  33. 33.

    Miller S, Dykes D, Polesky H. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res. 1988;16(3):1215.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  34. 34.

    Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, et al. Comparison of beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11(1):587.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  35. 35.

    Hernandez-Vargas H, Lambert M-P, Le Calvez-Kelm F, Gouysse G, McKay-Chopin S, Tavtigian SV, et al. Hepatocellular carcinoma displays distinct DNA methylation signatures with potential as clinical predictors. PLoS One. 2010;5(3):e9749.

    PubMed Central  PubMed  Article  Google Scholar 

  36. 36.

    Wang D, Yan L, Hu Q, Sucheston LE, Higgins MJ, Ambrosone CB, et al. IMA: an R package for high-throughput analysis of Illumina’s 450K Infinium methylation data. Bioinformatics. 2012;28(5):729–30.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  37. 37.

    Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

    PubMed  Article  Google Scholar 

  38. 38.

    Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2(3):18–22.

    Google Scholar 

  39. 39.

    Calle ML, Urrea V. Letter to the editor: stability of random forest importance measures. Brief Bioinform. 2011;12(1):86–9.

    PubMed  Article  Google Scholar 

  40. 40.

    Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, et al. Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol. 2005;28(2):171–82.

    PubMed  Article  Google Scholar 

Download references


The study conveyed in this publication was supported by the National Institute of Allergy and Infectious Diseases under award number R01 AI091905-01 (PI: Wilfried Karmaus). The 10-year follow-up of this study was funded by the National Asthma Campaign, UK (Grant No 364) and the 18-year follow-up by NIH/NHLBI R01 HL082925-01 (PI: S. Hasan Arshad). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors gratefully acknowledge the cooperation of the children and parents who participated in this study and appreciate the hard work of Mrs. Sharon Matthews and the Isle of Wight research team in collecting data and Nikki Graham for technical support. We thank the High-Throughput Genomics Group at the Wellcome Trust Centre for Human Genetics (funded by Wellcome Trust grant reference 090532/Z/09/Z and MRC Hub grant G0900747 91070) for the generation of the methylation data.

Author information



Corresponding author

Correspondence to H. Zhang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

GAL, HZ, and BMQ cleaned the methylation data. BMQ conducted the statistical analyses, interpreted the data, and drafted the manuscript. HZ directed the analysis and guided interpretation of the results. FIR pre-processed the DNA methylation data. TME supervised the assessment of DNA methylation. MR, GAL, SRT, AK, and JWH helped in conducting analyses. SHA was responsible for cohort assessments, eczema phenotype data, and sample collection and provided critical revision. WK designed the study, reviewed the data quality, and helped with statistical analyses. All authors contributed to manuscript drafting and final proofreading. All authors read and approved the final manuscript.

Additional file

Additional file 1: Table S1.

Beta coefficients and P values after log-linear model of the 140 CpGs present in the 12th iteration of the RF algorithm output and their genetic details. Software necessary to view: Adobe Reader.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Quraishi, B.M., Zhang, H., Everson, T.M. et al. Identifying CpG sites associated with eczema via random forest screening of epigenome-scale DNA methylation. Clin Epigenet 7, 68 (2015).

Download citation


  • Eczema
  • Allergic disease
  • DNA methylation
  • Epigenome-scale
  • Epigenetics
  • Random forest
  • F1 and F2 generations
  • CpG