Using data from the HM450 methylation arrays (see “Methods”), we aimed to better characterize the full extent of epigenetic drift occurring in BE and the temporal dynamics of epigenetic drift and to explore the impact of advanced drift on gene expression in EAC for which we have both gene expression and methylation data. Methylation levels are measured either in terms of a β value (methylation fraction) or M value (logit2(β)) as indicated in the text.
Genomic scope of drift
Out of 146,029 hypomethylated CpG probes in normal squamous (NS) tissue, we identified 18,013 (12%) probes that have significant positive correlation and 560 (0.4%) that have a significant negative correlation (q < 0.01) with the mean differential drift (relative to NS levels) of 67 previously validated drift CpGs in 64 BE samples from patients without a diagnosis of dysplasia or cancer (see “Methods”). In contrast, out of 133,857 CpG probes that were hypermethylated in NS tissue, only 795 (0.6%) probes correlated positively and 3402 (2.5%) probes correlated negatively with the mean methylation drift levels of our 67 probe reference clock (Fig. 1). Thus, significant differential drift in BE involves thousands of CpGs, occurring predominantly in hypomethylated regions that are associated with CpG islands, affecting 4024 (24%) of the 16,984 hypomethylated CpG islands in NS tissue. In contrast, we found only 7% of the identified drift CpG probes to be “open sea,” i.e., isolated in the genome [17], compared to about 10% in the hypomethylated normal background.
The majority (63%) of islands that include drift CpGs are associated with gene promoter regions, i.e., they involve a transcription start site (TSS200 or TSS1500), while only 11% of islands that undergo drift overlap with the gene body, compared to 73% (TSS-associated) and 10% (body-associated) CpG islands on the HM450 array, respectively. In contrast, the relative abundance of intergenic CpG islands is significantly higher among CpG islands that undergo drift compared to the fraction of intergenic CpG islands found on the array (24 vs. 16%, see Fig. 1).
Out of 16,832 island-based (BE-specific) drift CpGs we identified, 1317 unique CpG islands with 5 or more drift CpGs per island (comprising a total of 11,425 drift CpGs). Figure 2 presents a karyograph of 4 autosomes indicating the genomic locations and mean β values for these islands across the 64 BE samples (Additional file 1: Figure S4 for all 22 autosomes). Figure 3 shows a heatmap of the island-level mean β values of the 1317 drift-associated CpG islands for the first 10 NS tissue samples, and all 64 BE samples used to identify CpG probes undergoing drift. For this map, both CpG islands and tissue samples were ordered by their respective mean values. As expected, all 10 normal control tissue samples show no island-level drift. In contrast, we see significant heterogeneity in mean methylation levels of these CpG islands ranging from < 20 to > 80% methylation across the 64 BE samples. With the notable exception of a group of samples that have undergone minimal drift, most BE samples show bimodal patterns of drift where some islands appear to linger at low levels and others show advanced drift. We later use the following categorization for the observed drift patterns in BE and EAC: unimodal low drift (group L), bimodal high drift with a major mode β > 50% (group H) and the remaining bimodal intermediate drift (group I).
Pairwise correlations between island-associated CpGs
CpG islands are considered functional genomic units that may exert transcriptional control by their collective state of methylation rather than through individual CpG sites. To demonstrate this collective behavior in an island-level DNA methylation, we evaluated the pairwise correlations between all island CpGs that are hypomethylated in NS tissue. In general, for static islands (that do not show significant drift), pairwise correlations are moderate (< 0.5) across the span of an island and exhibit anti-correlations near and beyond the island boundaries (Fig. 4). In contrast, island CpGs that drift have stronger pairwise correlations reflecting a collective response of these CpGs to drift consistent with processive DNA methylation maintenance [18, 19].
Figure 4 also shows that the pairwise correlations decay with genomic distance and, for drift CpGs, extend further into the island shelves than static CpGs. Islands that show tissue age-related drift are also significantly larger (in terms of genomic length) than static islands. The mean sizes of the static vs. drift CpG islands were 0.9 vs 1.1 kb, respectively (p = 2 × 10−10, two-sided t test).
Bimodal nature of epigenetic drift in BE and EAC
To see whether methylomic drift is uniformly distributed within our BE and EAC samples, we examined β value distributions for a subset of island-associated drift CpGs with a minimum of 5 detected drift CpGs per island (11,425 drift CpGs in total) for 64 BE and 24 EAC samples from the BETRNet (see “Methods”). Consistent with the patterns seen in Fig. 3, these individual-level distributions show signatures that fall into the arbitrary three types: with unimodal distributions showing low or no drift (group L), distinctly bimodal distributions with intermediate drift (both modes β < 0.5, group I), or bimodal with a major mode near or above β = 0.5 and a minor mode at lower levels (group H). (See Fig. 5a, for an aggregated view of the samples in these groups). While the distributions are similar for BE and EAC, EAC show more advanced drift in the third group (bimodal high) which may be attributed to EAC patients being on average older than the BE patients (68 vs 62 years, respectively), or to the fact that EAC undergoes more frequent stem cell divisions thereby increasing replication-coupled de novo methylation, or to the possibility that BE arises earlier in patients with EAC compared to patients who have not progressed to dysplasia or EAC. We found similar unimodal/bimodal drift signatures in 87 EAC from TCGA and in a combined set of 19 BE and 47 EAC tissue samples provided by Krause et al. [16] (GEO accession number: GSE72874).
Advanced drift is associated with low tumor stage
Using tumor stage information from the TCGA, we found a statistically significant association (p value = 0.024; Fisher’s exact test) of low tumor stage (AJCC stage I) vs advanced stage (AJCC stage III and higher) with the type of drift pattern (group H vs group L + I). Specifically, low-stage tumors are more prevalent in group H compared with group L + I among the 74 TCGA EAC for which tumor stage information was available (odds ratio 6.0 (1.1–63.3)).
Differential gene expression by drift group
To see whether gene expression patterns (at the mRNA level) differed between EAC samples that showed minor (unimodal low) drift and samples that showed advanced methylomic drift on a gene by gene basis, we matched 1240 drift CpG islands (out of 1317 CpG islands with 5 or more drift CpGs per island) with one or more (overlapping) genes to evaluate the relationship between gene expression and island-level methylation for the TCGA and the Krause et al. data sets. Specifically, we identified differentially expressed genes for which expression differed significantly between low- and high-drift samples by setting a threshold of β = 0.2 to delineate the two groups and using a two-sided Mann-Whitney-Wilcoxon test (q < 0.01) on normalized gene expression data.
In total, we identified 200 genes that were significantly underexpressed in the advanced drift group while only 10 genes were significantly overexpressed (Additional file 2: Table S1). Independently, we found 35 genes that were significantly repressed and none that were significantly overexpressed among 51 (47 EAC + 4 BE) samples provided by Krause et al. [16]. Importantly, several genes (20/35) that were found repressed in the smaller study by Krause et al. were also found repressed in TCGA (Additional file 2: Table S1). In particular, the gene most significantly repressed in TCGA (q = 5 × 10−9) was also ranked most significantly repressed in the data provided by Krause et al., CHFR (checkpoint with forkhead and ring finger domains), a mitotic stress checkpoint gene with tumor suppressive function that has been identified in a wide range of cancers [20, 21], and most recently as a significantly silenced gene in a large clustering analysis of esophageal adenocarcinoma [22]. This striking asymmetry between gene expression changes and methylomic drift is consistent with parallel findings that CpG promoter hypermethylation in cancers often is correlated with gene-silencing [6]. A Gene Ontology (GO)-based over-representation analysis using the Database for Annotation, Visualization and Integrated Discovery (DAVID) shows a highly significant greater than threefold enrichment of sequence-specific DNA binding transcription factor activity (p = 2 × 10−9, Additional file 2: Table S2). The most prominent group identified by this analysis is a family of repressive Krueppel-associated box (KRAB) domain zinc finger (ZNF) transcription factors (greater than sixfold, p = 1.3 × 10−15, Additional file 2: Table S2). KRAB-mediated transcriptional repression involves the binding of the KRAB domain to co-repressors potentially resulting in heterochromatin formation and silencing of endogenous retroviruses [23, 24].
Evidence for a threshold effect leading to bimodal drift
The drift patterns shown in Fig. 5a for BE and EAC samples suggest nonlinear drift dynamics in BE tissues. Specifically, the presence of a persistent mode in the drift distribution at low levels (β < 0.2) is indicative of a threshold below which drift is suppressed but advances rapidly once the mean level is surmounted. To validate that epigenetic drift occurs in our longitudinal samples (20 patients with 2 biopsies each separated by at least 3–4 years), we determined for each individual at two time points the number of drift CpGs that remained below (n11), respectively, the number that remained above (n22), and the number of drift CpGs that had crossed the threshold from low to high at β = 0.2 (n12) and, vice versa, from high to low (n21). The results, including the % fraction of drift CpGs advancing, n12/(n12 + n11), and the % fraction retarding between the two time points, n21/(n21 + n22), are listed in Additional file 2: Table S3 and shown as annual rates in Fig. 6. For 19/20 patients, we detect greater methylation flow from sub-threshold levels to higher levels at the second (later) biopsy compared to flow in the opposite direction.
We note that these findings are surprisingly consistent with the unimodal-to-bimodal epigenetic drift predictions made by Sontag et al. [8] who proposed a mathematical model that included a nonlinear relationship between de novo methylation and the ambient level of methylation present in a region of CpGs. To demonstrate that such a model results in unimodal-to-bimodal drift transitions over time, we explicitly simulated sporadic de novo methylation on an island of 50–100 CpGs, independently in 1000 cells, mimicking crudely the cell population in the tissue samples. Each CpG was assumed to be in a binary state (0/1 of being (un)methylated), and the states of the CpGs initially (at time t = 0) were sampled from a binomial distribution with probability 0.06 which equals the mean methylation level in our NS tissue samples. The CpG states were then propagated stochastically with a rate (probability per time step) of becoming methylated that increases 100-fold from a background of 10−4 to 10−2 when the mean level of methylation on the island crosses a threshold of β = 0.2.
Without a mathematical exploration of this Markov model, but straightforward in silico experimentation with the baseline distribution of methylation rates (specifically, a gamma distribution with mean 10−4 and variance 4 × 10−8) and threshold value, our simulations show that this simple model generates methylation density trajectories that typically bifurcate and strikingly resemble the observed drift signatures in our samples. Figure 5b shows a typical density trajectory for a region of 50 CpGs, an arrayed population of 1000 cells, every 100 time steps, for a total duration of 3000 time steps. Although our model differs in functional form from the model described in Sontag et al., it shares important features, including a suppression of de novo methylation at low levels and a nonlinear acceleration as the ambient (regional) level of methylation increases. In contrast, models that do not include this ambient methylation feedback on the local (site-specific) rate of methylation do not, in general, lead to bifurcations in the main (initial) mode of the evolving drift pattern, but still exhibit a weak bimodality as shown in Additional file 1: Figure S2. Additional file 1: Figure S3 further illustrates the stochastic nonlinear behavior of our model via simulated time course trajectories of the mean methylation levels for 10 CpG islands that share an identical drift rate distribution across their CpGs.