Normalization and preprocessing of tissue EAC datasets
TCGA EAC DNAm, mRNA expression and CNV dataset
We used the DNAm, mRNA expression and CNV datasets from the TCGA [19]. These datasets were normalized as described by us previously [36, 37]. Briefly, in the case of DNAm data, missing values in the level-3 data were imputed using impute.knn [38] with k = 5 and an imputation threshold of 30% (i.e. only probes with less than 30% missing values imputed, rest of probes were removed), followed by BMIQ normalization [39]. There were 12 normal-adjacent + 50 EAC samples measured with Illumina 450 k beadarrays, and 8 normal-adjacent + 79 EAC samples measured with RNA-Seq.
Validation bulk tissue DNAm datasets
-
Krause et al. [40] is an Illumina 450 k DNAm dataset comprising genome-wide DNA methylation profiles for 250 samples including 125 EAC and 64 normal adjacent squamous samples. Data is available from GEO (http://www.ncbi.nlm.nih.gov/geo/ under accession number GSE72872). Raw idat files were processed with minfi [41]. Probes with SNPs or with more than 25% missing values were removed. Rest of probes were imputed with impute.knn [38] with k = 5. Type-2 probe bias was corrected using BMIQ [39]. One sample was removed due to low quality as assessed using BMIQ, resulting in 63 normal adjacent and 125 EAC samples.
-
Kaz et al. [42] is an Illumina 450 k DNAm dataset comprising genome-wide DNA methylation profile of 127 esophageal samples, including 24 EAC and 11 normal adjacent squamous. Data is available from GEO (http://www.ncbi.nlm.nih.gov/geo/ under accession number GSE89181). Raw idat files were processed as described for Krause et al.
Validation bulk tissue gene expression datasets
-
Krause et al. [40] performed Illumina gene expression (HumanHT-12 V4.0 bEAChip) profiling of 65 esophageal samples (48 EAC, 17 normal adjacent squamous). We downloaded the provided log-normalized dataset from GEO (http://www.ncbi.nlm.nih.gov/geo/ under accession number GSE72874) and performed inter-sample normalization using quantile normalization with limma R package [43].
-
Lu et al. [44] used Affymetrix Human Gene 1.0 ST Arrays to profile 10 normal adjacent squamous and 12 EAC. Raw data was downloaded from GEO (http://www.ncbi.nlm.nih.gov/geo/under accession number GSE92396). We successively applied intra-sample and inter-sample normalization in the dataset using affy and limma R packages, respectively.
scRNA-Seq dataset of Barrett’s Esophagus (BE)
We analysed a scRNA-Seq Smart-Seq2 dataset from Owen et al. [25] which profiled BE specimens as well as normal-adjacent tissue for a number of different patients. We downloaded the gene-count matrix from the website provided with the publication, and processed it with the Seurat pipeline [45]. We selected cells with at least 200 expressed genes, and selected genes expressed in at least 3 cells. Counts were then log-normalized using a scale factor of 104. In total we obtained 2009 cells from 4 different patients and from 4 distinct tissues: normal esophagus (n = 450), duodenum (n = 349), gastric (n = 622) and BE (n = 588). To assess the sources of variability in this dataset we ran the standard Seurat pipeline, including variable feature selection (with variance stabilization), PCA, graph-based clustering and tSNE visualization. Graph-based clustering was done over 8 components (inferred with ElbowPlot function) and at a resolution of 0.1, which resulted in 6 clusters.
Inference of FEM-modules from TCGA EAC dataset
To identify epigenetically deregulated gene-modules in EAC, we applied our Functional Epigenetic Module (FEM) algorithm [14]. Briefly, this approach integrates DNA methylation with gene expression data in the context of a protein–protein-interaction (PPI) network to identify hotspots (gene-modules) where there is substantial coordinated DNAm and gene expression changes. The algorithm consists of three steps: (1) quantify t-statistics of differential DNAm (DM) and gene expression (DE) between adjacent normal and EAC for each gene in the PPI network, (2) the edges of the network are weighted according to an integrated statistic for each gene defining the edge, where the integrated statistic per gene is obtained from the corresponding differential DNAm and differential expression statistics, and (3) inference of hotspots as subnetworks of particularly high modularity, i.e. subgraphs where the average weighted edge density is high compared to the rest of the network. The PPI network consists of 11,751 genes annotated to NCBI Entrez identifiers and is derived from the Pathway Commons resource [46, 47]. In step-(1), in order to arrive at a single statistic for DM, we summarize DNAm for each gene as the average over TSS200 probes (i.e. probes within 200 bp of the TSS). If such probes are not available, we use 1st Exon probes instead, and if also not available then we use the average over TS1500 probes. This strategy is motivated by the fact that for these regions, there is generally an inverse relation between DNAm and gene expression, and the algorithm seeks to identify gene modules where this pattern is observed frequently. To derive the statistics for DM and DE we then use the limma R-package [43, 48, 49]. In step-(2), the integrated statistic per gene is then constructed as
$$t_{g}^{\left( I \right)} = \left\{ {H\left( {t_{g}^{{\left( {DM} \right)}} } \right)H\left( { - t_{g}^{{\left( {DE} \right)}} } \right) + H\left( { - t_{g}^{{\left( {DM} \right)}} } \right)H\left( {t_{g}^{{\left( {DE} \right)}} } \right)} \right\}\left| {t_{g}^{{\left( {DM} \right)}} - t_{g}^{{\left( {DE} \right)}} } \right|$$
where H(x) is a Heaviside function, H(x) = 1 if x > 0 and H(x) = 0 if x < 0. If genes g and h are connected in the PPI network, we then assign the edge weight by taking the average of the corresponding integrated statistics of the genes, i.e.
$$w_{gh} = 0.5*\left( {t_{g}^{\left( I \right)} + t_{h}^{\left( I \right)} } \right)$$
To infer the modules (“FEM-modules”) in step-3, we then use a local greedy adaptation of a powerful spin-glass algorithm [23, 50]. The spin-glass algorithm tries to minimize the following Hamiltonian energy function
$$H\left( {\left\{ s \right\}} \right) = - \mathop \sum \limits_{g \ne h} \left\{ {w_{gh} - \gamma p_{gh} } \right\}\delta \left( {s_{g} ,s_{h} } \right)$$
where \(s_{g}\) is the spin-state (i.e., module) the gene g belongs to, \(\delta \left( {x,y} \right) = 1\) iff x = y and 0 otherwise, and \(p_{gh}{ \sim}w_{g.} w_{h.}\) is the null probability (once normalized) with \(w_{g.}\) denoting the weighted degree of gene g. In the local greedy version we try to grow modules around a number of “seed-genes” defined as the highest ranked genes by the integrated statistic \(t_{g}^{\left( I \right)}\). We choose on the order of 100 seed genes, to ensure that most of the network is explored. In previous studies we have found that this number of seed genes works well and already leads to redundant modules. The spin-glass parameter \(\gamma\) was chosen to be 0.5, as this parameter choice typically leads to modules in the size range of 10–100, which is the optimal size-range as shown by us previously [23]. The statistical significance of the modules is determined in two complementary ways: a module is grown from a given seed-gene in a deterministic fashion by adding the neighboring gene that minimizes the Hamiltonian energy and this process is continued until no further additions can decrease the energy function. This assesses the significance of the edge-weight density of the module in relation to the rest of the network and is strongly influenced by the topology of the network. As a second topology-independent test, we assess statistical significance of the inferred modules using a Monte Carlo (MC) randomization procedure (1000 Monte-Carlo runs), where we randomize the statistics over the network (thus keeping the topology fixed), recomputing modularity scores for each module and subsequently comparing the observed modularity to that of this empirically generated null distribution. The final FEM-modules are those with a P < 0.05.
Computation of FEM-score in validation tissue EAC datasets
In the case of the DNAm validation sets, we first summarized the Illumina 850 k/450 k DNAm values to the promoters of the genes involved in a given module. This is done following our previously validated procedure [14]. This procedure averages the DNAm values of CpG probes mapping to within 200 bp of the TSS of each gene. If no probes map to within 200 bp, we use probes mapping to the 1st Exon, and if not available, we resort to probes mapping 1.5 kb upstream of the TSS. Subsequently, we select the genes in the module that were significantly differentially methylated in the original EAC TCGA cohort. For each of these genes, we then z-score normalize their DNAm profile over all samples within the given validation cohort, using the mean and standard deviation as estimated over the samples from normal/healthy individuals. That is, if \(x_{gs}\) denotes the DNAm value for gene g in sample s, we compute
$$z_{gs} = \frac{{x_{gs} - \mu_{gN} }}{{\sigma_{gN} }}$$
where \(\mu_{gN} , \sigma_{gN}\) denote the mean and standard deviation DNAm of gene g across the normal samples. The FEM-score for module m in sample s is then obtained as
$$FEMscore_{ms} = \frac{1}{\left| m \right|}\mathop \sum \limits_{g \in m} Sign\left( {t_{g} } \right) z_{gs}$$
where \(\left| m \right|\) is the number of significantly differentially methylated genes in the module (as determined by the discovery TCGA EAC dataset) and \(Sign\left( {t_{g} } \right)\) is the sign (+ 1/− 1) of the corresponding t-statistic of differential DNAm from the TCGA EAC cohort. Thus, the FEMscore assesses whether the deviation in DNAm relative to the normal state is consistent with the pattern observed in the TCGA EAC cohort, with a higher FEMscore in cancer patients indicating a coordinated epigenetic deregulation consistent with that seen in the discovery TCGA set.
In the case of the gene-expression datasets, the FEMscore of the module was computed by z-score normalizing the expression values of the module genes, using the mean and standard deviation of the gene over the normal samples, in direct analogy to the DNAm case. Here, only genes that were significantly differentially expressed in the discovery TCGA EAC cohort are used. Given the z-scores, the FEMscore is then obtained as the weighted average over all significant module genes, with the weight being + 1 if the gene is overexpressed in cancer relative to normal according to the TCGA dataset, and − 1 if underexpressed.
Computation of FEM-score in scRNA-Seq data of BE
Here we selected cells from normal esophagus (n = 450) and BE (n = 588), and ran the above analysis to compute FEM-scores in each single cell, using the normal cells to define the z-score transformation. FEM-scores could only be reliably computed for the CTNND2-module: the number of significantly differentially expressed genes (from the TCGA EAC cohort) in each module that were also variable in the scRNA-Seq dataset was less than 5 for all modules except CTNND2 for which the number of variable genes was 11. FEM-scores were then compared between the normal and BE cells using Wilcoxon rank sum tests, stratified by patient.
Computation of FEM-score in TCGA ESCA data
The procedure for computing the FEM-score in the TCGA data was slightly different from the previously described one, because for the TCGA we have both DNAm and RNA-Seq data. Briefly, the FEM-score for module m in sample s was computed as
$$FEMscore_{ms} = \frac{1}{\left| m \right|}\mathop \sum \limits_{g \in m} \left| {z_{gs}^{\left( M \right)} - z_{gs}^{\left( R \right)} } \right|$$
where \(\left| m \right|\) is the number of significantly and consistently differentially methylated and differentially expressed genes in the module (as determined in the TCGA dataset itself), and where \(z_{gs}^{\left( M \right)}\) and \(z_{gs}^{\left( R \right)}\) are the corresponding z-scores for the DNAm and RNA-Seq datasets. By consistently differentially methylated and differentially expressed we mean that the pattern is anti-correlative (i.e., promoter hypermethylation and underexpression, or promoter hypomethylation and overexpression). The z-scores were computed as described previously, with the exception of the scores for the DNAm data where we added a constant offset term to the standard deviation, since the number of normal-adjacent samples with both DNAm and mRNA data is small (n = 6) and this can lead to spurious low variances and inflated scores in the DNAm-case. Specifically, for the DNAm-data we defined
$$z_{gs}^{\left( M \right)} = \frac{{x_{gs} - \mu_{gN} }}{{\sigma_{gN} + \gamma }}$$
where \(\gamma\) was determined by requiring that the standard deviation of the z-scores for DNAm, as evaluated genome-wide over all genes in the experiment, equals the standard deviation of the z-scores for RNA-Seq data.
Saliva sample collection
Saliva was collected from two groups of patients: the primary “training” cohort of 192 patients was collected through the SPIT Study (Saliva to predict risk of disease using transcriptomics and epigenetics, ISRCTN11921553) at 15 hospitals in the United Kingdom. This study was ethically approved by the West Midlands-Coventry and Warwickshire Research Ethics Committee (REC Reference No: 17/WM/0079). The validation cohort of 49 patients were collected through the BOOST Study (Barrett’s oesophagus surveillance with optical biopsy using spectroscopy and enhanced endoscopic imaging to target high-risk lesions, ISRCTN58235785) at a single hospital site (UCLH). This study was ethically approved by the London-Dulwich Research Ethics Committee (REC Reference No: 08/H0808/8). All participants gave written consent.
Patients were instructed to fast for a minimum of one hour and then to spit repeatedly into a saliva DNA collection device to a total of 2 mL. The primary cohort used Oragene DNA OG-500 (DNAGenotek, Ottawa, Canada) and the validation cohort used SimplOFy™ (Oasis Diagnostics® Corporation, Vancouver, Canada). Saliva was sent to the central laboratory at UCL and stored at − 80 °C until extraction.
DNA Extraction was carried out using the Zymo Quick-DNA™ Miniprep Plus Kit (Zymo Research, Irvine CA, USA). DNA quality was assessed using the Agilent Bioanalyser (Agilent Inc, Santa Clara, CA, USA). Only samples with a 260/280 nm ratio of > 1.5 were used. Bisulphite conversion was undertaken using the Zymo EZ-96 DNA methylation kit (Zymo Research, Irvine CA, USA) and samples containing 500 ng DNA were then DNAm profiled using the Illumina Infinium HumanMethylationEPIC BeadChip (Illumina San Diego, CA, USA).
Analysis of saliva DNAm datasets
Raw signal intensities were processed from idat files through a standard pipeline that we have extensively validated [18]. Briefly, we processed idat files with the minfi R-package with no background correction and using the Illumina definition of beta-values. Probes with P values of detection < 0.05 were declared reliable measurements, the rest being set to NA. We removed cross-reactive probes, probes with more than 4 SNPs, probes with a SNP at the interrogated CpG, and probes with less than 90% coverage. Remaining NAs were imputed with the impute R-package using impute.knn function with k = 5 [38]. Type-2 probe bias was corrected using our BMIQ algorithm [39]. The final beta-valued data matrix for the discovery cohort contained 689,033 probes and 163 samples, of which 65 were from healthy controls, 33 from patients diagnosed with non-dysplastic Barrett’s Esophagus (NDBE), 14 with high-grade dysplasia (HGD) and 51 with EAC. In the case of the validation cohort, a total of 49 samples passed QC, of which 5 were normal, 15 NDBE, 15 HGD and 14 EAC.
Computation of FEM-score in saliva EAC DNAm datasets
We applied the same procedure as for the tissue DNAm validation sets described above. Here, the normal samples used for performing the z-score transformation are the saliva samples from the age-matched healthy controls.
Estimation of cell-type fractions in tissue and saliva DNAm datasets
In the case of the TCGA EAC cohort we estimated cell-type fractions from the Illumina 450 k DNAm data with our HEpiDISH algorithm [13, 18]. Briefly, this uses a DNAm reference matrix defined over 3 broad cell-types (total epithelial, total fibroblast and total immune-cell), in conjunction with Robust Partial Correlations (RPC) to yield estimates for the total epithelial, total fibroblast and total immune-cell fractions in each of the TCGA samples. In the case of the saliva DNAm datasets, the same procedure was applied to obtain total epithelial and total immune-cells fractions (the fibroblast fraction in saliva is negligible, which was also confirmed by HEpiDISH).
To assess the source of epithelial cells in saliva, we built a separate DNAm reference matrix defined over 3 main cell-types: squamous epithelial cells, epithelial cells representative of EAC cell-lines, and a generic immune-cell. In detail, the Illumina 450 k DNAm data from 48 purified immune-cells (6 × 8 cell-types) was derived from Reinius et al. [51]. From Iorio et al. [52] we obtained Illumina 450 k DNAm profiles for a total of 7 EAC cell-lines (ESO26, ESO51, FLO-1, KYAE-1, OACM5-1, OACP4C an SK-GT-4). For the squamous epithelial cells we first identified buccal swabs with over 95% purity, as determined with our HEpiDISH algorithm with the DNAm reference matrix defined above, and as applied to our large buccal swab 450 k DNAm dataset of 790 buccal swabs [32]. We only considered never-smokers to avoid DNAm changes induced by smoking. This resulted in 10 buccal swab samples of high epithelial purity, which we took as representative for putative squamous epithelial cells in saliva. To select the features (CpGs) defining our DNAm reference matrix, we performed differential DNAm analysis between the 48 IC, 7 EAC cell-line and 10 squamous epithelial cell samples using the limma empirical Bayes framework [49, 53]. Specifically, we performed 3 pairwise comparisons (IC vs. EAC + SqEpi), (SqEpi vs IC + EAC) and (EAC vs. IC + SqEpi). In all cases, we selected features with an FDR < 0.05 and a difference in mean DNAm greater than 0.9, except for the second comparison where we relaxed the threshold a little to 0.875. This ensured similar numbers of differentially methylated CpGs (DMCs): 136, 123 and 153 for the 3 comparisons, respectively. This resulted in a DNAm reference matrix defined over 412 unique CpGs and 3 cell-types (total IC, squamous epithelial, EAC). To estimate cell-type fractions in saliva, we then applied our HEpiDISH/RPC algorithm with this DNAm reference matrix.