Identifying CpG sites associated with eczema via random forest screening of epigenome-scale DNA methylation

Background The prevalence of eczema is increasing in industrialized nations. Limited evidence has shown the association of DNA methylation (DNA-M) with eczema. We explored this association at the epigenome-scale to better understand the role of DNA-M. Data from the first generation (F1) of the Isle of Wight (IoW) birth cohort participants and the second generation (F2) were examined in our study. Epigenome-scale DNA methylation of F1 at age 18 years and F2 in cord blood was measured using the Illumina Infinium HumanMethylation450 Beadchip. A total of 307,357 cytosine-phosphate-guanine sites (CpGs) in the F1 generation were screened via recursive random forest (RF) for their potential association with eczema at age 18. Functional enrichment and pathway analysis of resulting genes were carried out using DAVID gene functional classification tool. Log-linear models were performed in F1 to corroborate the identified CpGs. Findings in F1 were further replicated in F2. Results The recursive RF yielded 140 CpGs, 88 of which showed statistically significant associations with eczema at age 18, corroborated by log-linear models after controlling for false discovery rate (FDR) of 0.05. These CpGs were enriched among many biological pathways, including pathways related to creating transcriptional variety and pathways mechanistically linked to eczema such as cadherins, cell adhesion, gap junctions, tight junctions, melanogenesis, and apoptosis. In the F2 generation, about half of the 83 CpGs identified in F1 showed the same direction of association with eczema risk as in F1, of which two CpGs were significantly associated with eczema risk, cg04850479 of the PROZ gene (risk ratio (RR) = 15.1 in F1, 95 % confidence interval (CI) 1.71, 79.5; RR = 6.82 in F2, 95 % CI 1.52, 30.62) and cg01427769 of the NEU1 gene (RR = 0.13 in F1, 95 % CI 0.03, 0.46; RR = 0.09 in F2, 95 % CI 0.03, 0.36). Conclusions Via epigenome-scaled analyses using recursive RF followed by log-linear models, we identified 88 CpGs associated with eczema in F1, of which 41 were replicated in F2. Several identified CpGs are located within genes in biological pathways relating to skin barrier integrity, which is central to the pathogenesis of eczema. Novel genes associated with eczema risk were identified (e.g., the PROZ and NEU1 genes). Electronic supplementary material The online version of this article (doi:10.1186/s13148-015-0108-y) contains supplementary material, which is available to authorized users.


Background
The increasing prevalence of allergic conditions including eczema is a major public health concern in industrialized nations [1]. The prevalence of eczema is reported to be 10-30 % in children and 1-3 % in adults of the developed world [2]. In addition to the physical discomfort to the affected individual and the social burden on their families, eczema has a huge economic impact on nations' health care budgets [3].
Eczema is a chronic condition involving a complex interplay of genetic, epigenetic, and environmental factors [4][5][6]. So far, DNA methylation (DNA-M) remains the most studied mechanism with potential answers to epigenetic regulation of gene function [7,8]. The Illumina Infinium HumanMethylation450 Beadchip has the ability to measure DNA methylation at more than 450 K cytosine-phosphate-guanine sites (CpGs), which provides rich information for various epigenetic studies. Epigenome-scale studies offer an immense opportunity to understand disease pathophysiology, but there are also concerns about the challenges associated with this type of studies. A recent review published in 2014 by Paul et al. highlighted the potential challenges in the field of epigenomics [9] such as study design, methodologies of obtaining biologic samples, high dimensionality, and highly correlated data [9,10].
Random forest (RF) is a machine learning algorithm used for classification and has the ability to efficiently handle high dimensionality and highly correlated data [11]. The R package was used in this study to screen CpG sites potentially associated with eczema. RF is composed of classification trees with each tree constructed using randomly selected bootstrap samples. Misclassification rates calculated based on testing samples can be used to estimate the accuracy of the forests.
In this study, we utilized a method built upon RF to screen specific CpGs potentially associated with eczema using data in the first generation (F 1 ) at age 18 years and functionally annotated the genes of the identified CpGs using DAVID [12] to understand the biological pathways. For the identified CpGs via the RF-based method, we further examined their statistical significance on their linear association with eczema risk at age 18 years using log-linear models and replicated the findings from the F 1 in the second generation (F 2 ).

Results
Eczema frequencies in F 1 (18 years) and in F 2 (3, 6, and 12 months) generations stratified by sex indicated that females had higher eczema prevalence than males at 18 years of age in the F 1 generation, and the prevalence switched in the newborns of the F 2 generation (Table 1). This is consistent with the gender-reversal pattern of eczema reported in our earlier work [13].
In the screening process using recursive RF [14,15], the parameters (sampsize, mtry, and ntree-details are in the "Statistical analysis" section) in the randomForest() R package were selected to achieve stabilized error rates. In total, pre-processed DNA methylation of 307,357 CpGs in the F 1 generation was included in the screening. The results of the recursive RF (Table 2, Fig. 1; details in the "Statistical Analysis" section) Indicated that a total of 140 CpGs (after excluding 8 CpGs located on the X chromosome) passed the screening showing potential association with eczema. The exclusion of the 8 CpGs were due to the potential bias measurement of DNA-M for different genders. Nevertheless, in the following analyses (log-linear models below), we assessed whether gender played a role in the association of DNA-M with eczema.
Further examination of these 140 CpGs from F 1 using log-linear models indicated that 88 out of 140 CpGs had a statistically significant linear association with eczema at age 18 (FDR-adjusted P value <0.05) (Additional file 1: Table S1). We also tested the statistical significance of the interaction between DNA-M and gender; none of the FDR-adjusted P values were <0.05.
We assessed the biological pathways enriched within the genes annotated to those 140 CpGs using DAVID ( Table 3). The most significantly enriched pathways related to the creation of transcriptional variety through genetic (e.g., polymorphism) and regulatory (e.g., alternative splicing) mechanisms. The remainder of the significantly enriched pathways included several pathways mechanistically linked to epithelial barrier integrity and

Replication results
We then replicated the findings from the F 1 generation in the F 2 generation. In total, 83 out of the 88 CpGs identified in the F 1 were also present in the F 2 dataset (the 5 CpG sites in the F2 were excluded after quality control). DNA methylation at 41 CpGs (out of 83) showed the same direction of changes with eczema in both the F 1 and F 2 generations (Table 4, Fig. 2). Of these 41 CpGs, two were statistically significantly associated with eczema risk in both generations ( generation, about 60 % CpGs (n = 25 CpGs) showed a statistically significant difference in DNA methylation between the two generations (based on two sample two sided t tests) after adjusting for multiple testing. Since some of the F2 generation are offsprings of subjects in the F1 generation, the findings tend to be conservative. The above analyses were adjusted for estimated cell type proportions [16].

Discussion
This is the first study to explore epigenome-scale DNA methylation patterns associated with eczema. Using data from two generations, our study based on data of the F 1 generation identified CpGs potentially associated with eczema status using the RF technique, which was further corroborated via log-linear models. In total, 140 CpGs were identified via RF, which were further assessed using log-linear models with 88 CpGs being statistically significantly associated with eczema risk after adjusting for cell type proportions and controlling for multiple testing.
The remaining 52 CpGs were not corroborated in loglinear models. This is likely due to two reasons. Firstly, the 140 CpGs were identified based on their importance values in terms of minimizing misclassification errors other than statistical testing [11]. It is possible that the identified CpG sites did not have a statistically significant main effect on eczema risk. Secondly, among the 140 CpGs, complex non-linear interactions are likely to exist between multiple CpGs which may be difficult to parametrically identify using log-linear models. Using F 2 generation data, around 50 % (41 CpGs) of these 88 CpGs identified in the F 1 generation were further replicated. In particular, two CpGs showed statistically significant results in both F 1 and F 2 : cg04850479 in the PROZ gene and cg01427769 in the NEU1 gene. Although some studies have linked NEU1 gene with asthma [17] via Th2-mediated airway inflammation [18,17], and it is known that the Th2 pathway is also important for eczema [19,20], based on our knowledge, no study has so far spotted its role in eczema. The insignificant findings on the association of DNA methylation of cg04850479 (in the PROZ gene) and cg01427769 (in the NEU1 gene) is likely due to tissue-specific gene expression. That is, an early exposure has left a change in methylation in all tissues including blood but the gene is not expressed in blood but skin for eczema. It is also possible that the DNA methylation of these two CpGs is related to the production of dysfunctional transcripts. Enrichment analysis of the CpG sites identified in the F 1 generation highlighted pathways related to the creation of transcriptional variation and several biological pathways related to the epidermal barrier and involved in eczema (Table 3). The skin barrier is crucial in maintaining skin integrity, and disruption of the epidermal barrier is one of the important mechanisms in the pathogenesis of eczema [21,22]. Several studies reported that skin barrier dysfunction is a result of the impairment of tight junction function in eczema patients [23][24][25][26]. Cadherins and protocadherins are transmembrane proteins important for cell-to-cell adhesion and epithelial integrity and have been associated with eczema and asthma in genetic studies [27]. Chronic eczema and several other dermatoses are also related to hyperpigmentation of the skin [28]. Our study detected differentially methylated CpGs within genes in pathways relating to epidermal barrier integrity and eczema pathogenesis, including cadherins, gap junction, cell adhesion, tight junction, melanogenesis, and apoptosis (Table 3). Their biological functions suggest these eczema-associated CpGs are of special interest, and they are potential epigenetic biomarkers for eczema. The detection of eczema-associated differential methylation within pathways already known to be associated with eczema is reasonable and suggests that epigenetic and genetic variation may work together to regulate eczema-associated gene expression in the genes identified here, as has already been observed in other eczema-associated genes [22].
Several limitations were identified in the process of our study. Although the 140 CpGs were chosen based on the least misclassification error rate, it is possible that some CpGs were incorrectly removed and vice versa. Also, cord blood contains a small amount of maternal cells [29], which may bias the measure of DNA methylation, but our cell type correction performed in this study was expected to reduce the bias. Findings from the F 1 generation were partially replicated in the F 2 generation. This could be due to age playing a role in the CpGs predicting eczema; adolescence transition has the potential to revise DNA methylation. This is supported by our comparison of DNA methylation between the F1 and F2 generations among the CpGs not replicated. Not all CpGs selected by random forests were involved in known eczema-associated biological pathways, which may be due to complex interactions between the CpGs hence requires further investigation. It is possible that some of the identified CpGs may be associated with the severity of eczema. Hence, there is a need to further examine potential associations of DNA methylation of those CpG sites with  eczema severity. For multiple CpG sites, DNA methylation was associated with eczema in the F 1 generation at age 18. These CpG sites could be risks or consequences of eczema. However, CpGs replicated in the F 2 generation were measured in cord blood before the onset of eczema and thus have the potential to predict eczema.

Conclusions
This is the first epigenome-scale association study of eczema employing a classification technique (recursive RF), and we identified eczema-associated CpG sites. The findings added to the existing knowledge that recursive RF can be successfully employed in drawing actionable results from complex datasets. Genes annotated to eczema-associated CpGs were significantly enriched in pathways related to the creation of transcriptional variation and pathways relating to epidermal barrier function and eczema. Furthermore, the study identified for the first time that the PROZ and NEU1 genes are potential predictors of eczema.

Isle of Wight birth cohort
The Isle of Wight (IoW) birth cohort was established to study the natural history of allergic diseases among children who were born between January 1, 1989 and February 28, 1990 on the Isle of Wight, UK. The study was approved by the local research ethics committee and written informed consent was obtained from the parents. After exclusion of adoptions, perinatal deaths, and refusal, 1456 children (95 %) were enrolled. Children were followed-up at ages 1 (n = 1167), 2 (n = 1174), 4 (n = 1218), 10 (n = 1373), and 18 years (n = 1313); detailed questionnaires were administered at each follow-up. Details of the birth cohort have been described elsewhere [4,30,31]. A total of 244 women and 122 men at age 18 years were randomly selected from the cohort for epigenome-scale DNA methylation studies. Ethics approvals were obtained from the Isle of Wight Local Research Ethics Committee (now named the National Research Ethics Service, NRES Committee South Central -Southampton B) at recruitment and for the subsequent follow-ups (06/Q1701/34).

Outcome: eczema phenotype data collection
Eczema was defined as chronic or chronically relapsing itchy dermatitis lasting more than 6 weeks with characteristic morphology and distribution [32], following Hanifin and Rajka criteria [5].

DNA methylation
DNA was extracted from whole blood and umbilical cord blood using a standard salting out procedure [33]. DNA concentration was determined by Qubit quantitation. One microgram of DNA was bisulfite-treated using the EZ 96-DNA methylation kit (Zymo Research, Irvine, CA, USA) following the manufacturer's standard protocol.   Epigenome-scale DNA methylation was assessed using the Illumina Infinium HumanMethylation450 Beadchip (Illumina, Inc., San Diego, CA, USA), which interrogates >484,000 CpGs associated with approximately 24,000 genes. Arrays were processed using a standard protocol as described elsewhere [7], with multiple identical control samples assigned to each batch to assess assay variability, and samples were randomly distributed on microarrays to control against batch effects. The methylation level (β value) for each CpG was determined using the Methylation module of GenomeStudio software (Illumina, Version 2011.1).
Methylation levels for each CpG site are recorded as beta (β) values, which represent the proportion of methylated (M) over methylated (M) plus unmethylated (U) probes (β = M/[c + M + U], with constant c introduced for the situation of too small M + U) and can be interpreted as percentage methylation. These values were utilized in the RF screening process described below; however, β values close to 0 or 1 tend to suffer from severe heteroscedasticity; therefore, logit-transformed β values (M values, approximated by log 2 (β / (1-β)) [34] were used in other analyses including log-linear models.

Pre-processing DNA methylation data
In our study, the detection P value reported by Geno-meStudio was used as a QC measure of probe performance. Probes whose detection P values were >0.01 in >10 % of the samples were removed [35]. Methylation data were then pre-processed using the Bioconductor IMA (Illumina methylation analyzer) package and Com-Bat was used to perform peak correction and adjust for inter-array variation [36,37]. To ensure that our findings were not biased by SNPs affecting measurement of methylation levels, we excluded all probes with a potential SNP in the probe sequence. After pre-processing, a total of 307,357 CpGs were retained in the DNA methylation dataset.

Statistical analysis
Pearson's χ 2 tests were used to determine if prevalence of eczema differed between the sexes. P values were considered significant at a level of 0.05. To make sure that our findings are not a result of confounding due to cell types, we ran the analyses by adjusting for estimated proportions of CD8 + T cells, CD4 + T cells, natural killer cells, B cells, monocytes, and granulocytes. Cell type proportions were estimated as described previously [16].
The random forest package, randomForest(), in R was utilized to conduct the recursive RF analyses [38,15,14]. The parameter sampsize refers to the size of the sample of training data sets that is to be obtained for classification. The number of variables that are randomly sampled as predictors at each split is called mtry, whereas, ntree is a parameter referring to the total number of trees that are to be grown in the forest. In order to improve the prediction accuracy of the RF algorithm, these three parameters were repeatedly altered until the lowest misclassification rate was obtained. We decided whether to use a balanced sampsize of equal eczema and non-eczema cases such as 20 eczema and 20 non-eczema cases or 30/30 or 40/40. We also studied imbalanced RFs with sampsize such as 46/320 or 20/40 for the training sets by using the default values for mtry and ntree. We then tested the prediction accuracy of the RFs at different combinations of mtry (√p, 2*√p, 0.1p, 0.15p, 0.2p, and 0.25p) where p is number of variables and ntree (200, 500, 1000, and 1500). Once the optimal parameter values were selected, the recursive RF algorithm was implemented. Mean Decrease Gini (MDG) served as a variable importance measure (VIM) for our study as it was shown to be more robust in previous research [39].
DNA methylation at 307,357 CpGs along with sex and eczema status in the F 1 generation served as input in randomForest(), and the CpGs were subjected to data reduction, repeatedly dropping 50 % of variables with the lowest VIMs until the misclassification rate showed a significant increase. After testing for sampsize (both equal and unequal) with different combinations (both with and without eczema), we set sampsize = (31, 31), mtry = 0.2p (where p is the available number of variables) and ntree = 500. We applied RF to pre-processed DNA methylation data containing 307,357 CpGs in the F 1 generation, ran a total of 17 iterations, and at each iteration, recorded the misclassification rate (Table 2, Fig. 1). The lowest overall misclassification error rate (of eczema and eczema-free) was 5.2 %, with a corresponding least misclassification rate of 17.4 % for eczema at the 12th iteration. The overall misclassification rate dropped from 18.6 % in the first iteration to 5.2 % in the 12th iteration, and the eczema misclassification error rate dropped to 17.4 % at the end of 12th iteration from 95.7 % in the first iteration.
The CpGs identified from the recursive RF [40] were assessed for enrichment of biological pathways using DA-VID [12] bioinformatics tool and examined for their association with eczema at age of 18 years by use of log-linear models. Multiple testing was adjusted by controlling false discovery rate of 0.05 in the pathway analysis and loglinear models. Since differential cell types in the peripheral blood are known to have confounding effect on the final result [16], we adjusted for cell type correction. For genes of particular interest (e.g., showing statistical significance in both generations in log-linear models), robust regressions are applied to assess the association of DNA methylation and corresponding gene expressions in the F 2 generation. For this last test, multiple testing is adjusted within genes based on the number of CpG sites available of that gene.

Replication cohort
The IoW F 2 generation cohort includes the offspring of the IoW 1989 birth cohort. In the F 2 generation, repeated measures of eczema at ages 3, 6, and 12 months were recorded in a sample of n = 116 children. DNA methylation was measured in umbilical cord blood. To replicate the findings from the F 1 generation, log-linear models with repeated measures of eczema were used in F 2 generation analyses. Figure 3 represents the summary of statistical analysis and sample size for each analysis conducted in this study.