Human samples
Matched prefrontal cortex (PFC) and buccal samples were obtained in two batches from the neuropathology unit at the Massachusetts Alzheimer’s Disease Research Center (MADRC), Boston, MA, USA. Samples were shipped to our laboratory in two batches: “MADRC-1” and “MADRC-2”, encompassing 48 and 80 matched brain–buccal pairs, respectively. Buccal swabs were obtained from patients with neurodegenerative disease conditions and controls at the time of autopsy following an IRB-approved informed consent with specific inclusion of genetic studies. Consent forms were completed by next-of-kin or other legal representatives as specified by Massachusetts state law. Buccal-Prep Plus DNA Isolation Kit (Isohelix, UK) swabs were utilized to obtain buccal swabs; these were held at − 80 °C without dehydration until DNA extraction (see below). One hemisphere of each harvested brain was coronally sectioned, flash-frozen on dry ice, cryopreserved at − 80 °C, and used for subsequent PFC isolation and DNA methylation (DNAm) profiling (see below). The remaining hemisphere was fixed in 10% weight/volume formalin and subjected to detailed neuropathologic evaluation. Detailed descriptions of all MADRC buccal–brain samples used in this study can be found in Additional file 1: Table 1.
DNA extraction and processing
DNA was extracted and processed in two laboratory batches according to their shipment charge (i.e. MADRC-1 and MADRC-2; Additional file 1: Table 1). Importantly, paired brain and buccal samples from the same shipment were processed simultaneously (incl. DNAm profiling, see below). For brain samples, genomic DNA was extracted from approximately 50 mg of frozen tissue using the DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany), while DNA from the buccal swabs was extracted using Buccal-Prep Plus DNA Isolation Kit (Isohelix, UK). All steps in the extraction procedure were conducted according to manufacturer’s instructions. The quantity and the quality of obtained DNA were assessed using a NanoDrop ONE spectrophotometer (Thermo Fisher Scientific, USA).
EPIC array profiling
DNAm profiling was performed using the “Infinium MethylationEPIC” array (Illumina, Inc., USA), as described previously [8]. In brief, experiments were performed on aliquots of DNA extracts diluted to ~ 50 ng/µl concentration. Bisulphite conversion of DNA samples was performed using the EZ DNA Methylation kit (Zymo Research, USA), following the alternative incubation conditions for the Illumina Infinium MethylationEPIC Array as recommended by the supplier. After hybridization to the EPIC array, scanning was performed on an iScan instrument (Illumina, Inc.) according to the manufacturer’s instructions (Document#1000000077299v0). DNA samples from both shipment charges (MADRC-1 and MADRC-2) were processed in consecutive laboratory experiments to minimize batch effects. Raw DNAm intensities were determined using the iScan control software (v2.3.0.0; Illumina, Inc.) and exported in .idat format for downstream processing and analysis.
DNA methylation data processing and quality control
DNAm data from each batch (MADRC-1 and MADRC-2) and tissue (PFC and buccal) were loaded into R and pre-processed separately. DNAm data pre-processing and quality control (QC) was performed using the same procedures as described previously [8] unless noted otherwise. In brief, this entailed using the R (v. 3.6.1) package bigmelon with default settings [20]. Samples were excluded when bisulphite conversion efficiency was below 80%. Outliers were removed using the outlyx function in bigmelon applying a threshold of 0.15. CpG sites on the X or Y chromosome, or those aligning to SNPs [21] or multiple locations in the genome [22] were removed from the analysis. The final analysis included a total of 120 matched PFC and buccal samples, with 44 samples from MADRC-1, and 76 samples from MADRC-2. Overall, a total of 730,157 QC’ed CpG sites were available in all four datasets and were used for the analyses.
To compare our results with those from the Braun et al. study evaluating the correlation of DNAm between buccal and brain samples [17], we downloaded the publicly available .idat files of that study (GEO accession number GSE111165), and loaded and pre-processed them with the R-package ChAMP [23] using default settings unless otherwise noted. Brain and buccal samples were loaded and pre-processed separately. Briefly, DNAm values with a detection p-value above 0.01 were set to N/A and CpG sites were completely removed if there were less than 3 beads in more than 5% of the samples, if they were on an X or Y chromosome, or if they aligned to SNPs [21] or multiple locations in the genome [22]. Normalization was performed with the BMIQ method. The analysis of this dataset comprised 21 pairs of matched buccal and brain samples, with 740,507 CpG sites. We note that 1,513 (65%) of the 2,367 significantly correlated CpGs according to the Braun et al. study [17] were not included in our reanalysis of the dataset as they were removed during QC (Additional file 1: Table 2). Almost 50% of excluded CpGs (n = 931) were removed from our re-analysis of the data due to aligning to or being influenced by SNPs according to Zhou et al. [21], while ~ 350 CpG sites (23%) were removed due to their location on the X- or Y-chromosome. Despite these differences, we note that the vast majority (713; 83%) of the 854 remaining correlated CpG sites from Braun et al. [17] were also significantly correlated in our re-analysis of the Braun et al. data after multiple testing adjustment using a false discovery rate (FDR) q-value threshold of 0.05 (Additional file 1: Table 2).
Determination of and correction for DNAm covariates
First, cell-type composition estimates were obtained with the R package EpiDISH [14] for buccal samples and the estimateCellCounts function in the R package Minfi [24] for brain samples. Next, to assess the effects of potential confounders on the DNAm data, we used an adaptation of the singular value decomposition (SVD) approach described previously [25]. In short, SVD attempts to identify and correct for relevant variables that have a significant impact on genome-wide DNAm patterns and could act as confounders in subsequent analyses. Accordingly, we tested whether variation in cell type composition, bisulphite conversion efficiency, EPIC array ID, diagnosis, extraction date, and position on the EPIC array significantly associated with the variance in DNAm data in our data (as determined by principal component analysis [PCA], see below). These analyses were performed separately for buccal and brain datasets. To this end, we performed a PCA on the DNAm beta values after QC using the R base function prcomp. For this PCA, we first generated a subset of uncorrelated CpG sites by dividing the genome into 100 kb bins and using one random CpG site from each bin, resulting in 25,746 CpG sites included in each PCA. For the determination of relevant covariates for subsequent analyses, PCs explaining a substantial amount of variance in the DNAm data, as determined by scree plots (MADRC: Additional file 2: Fig. 1; Braun et al. data: Additional file 2: Fig. 2) were used. For numerical variables (bisulphite conversion efficiency and cell type composition estimates), a Pearson correlation test between the centred variables and the centred DNAm PCs was calculated with the R base function cor.test. For categorical variables (extraction date, EPIC array ID, position on the EPIC array, and diagnosis), a one-way ANOVA between the covariates and the DNAm PCs was performed with the R base function aov. Effects of variables explaining variance of at least one included DNAm PC with a p-value < 0.01 were removed from the DNAm beta values using the removeBatchEffect function in the R package limma [26]. The results of these analyses, as well as the number of DNAm PC eigenvalues (PCs) included for each dataset, can be found in Additional file 1: Table 3. The covariate-adjusted DNAm beta values of the two batches of PFC samples and buccal samples were combined in a “brain” and “buccal” data matrix, respectively. Lastly, the batch-defining variable (i.e. indicating either dataset MADRC-1 or MADRC-2) was removed from these combined matrices with the removeBatchEffect function in the R package limma [26]. All subsequent analyses were performed on these combined DNAm values adjusted for both covariates and batch.
To ensure that our results were not impacted by confounders not included in our adaptation of the SVD method described above, we repeated our analyses by correcting the DNAm-values directly for the first three DNAm PCs. We used the DNAm PCs as described above and corrected the DNAm-values using the removeBatchEffect function in the R package limma [26].
Identification of CpG sites with correlated DNAm values between paired brain and buccal samples
Spearman rank correlations were calculated for each CpG site in a pair-wise manner across buccal and PFC samples using the R base function cor.test. The resulting p-values were adjusted for multiple testing using the FDR approach with the R base function p.adjust. FDR q-values < 0.05 were considered genome-wide significant in the context of this study. These analyses were performed for the matrices with the two batches (MADRC-1 and MADRC-2) combined, and for each batch separately. Our choice of using nonparametric Spearman rank (as opposed to parametric Pearson’s correlation analyses) was motivated by the fact that neither the buccal nor the brain derived DNAm data were normally distributed, and computation of Pearson’s r assumes linearity of the correlation, an assumption that is not necessarily fulfilled here in all instances.
Identification of mQTLs in buccal and blood samples from an independent dataset
To check for buccal-specific mQTLs, we used an independent in-house dataset with 837 buccal samples and 1,058 blood samples ascertained from the Berlin Aging Study II (BASE-II) [27, 28]. These data are referenced here as “unpublished data”, since a manuscript from our group with more details on this analysis is in preparation. In brief, for the buccal dataset both genome-wide QC’ed DNAm profiles (761,034 CpG-sites) and SNP genotyping data (7,663,257 SNPs) were available for mQTL analysis. DNAm data were derived from buccal-swab samples and generated and processed using the same procedures described above. For the blood dataset, genome-wide QC’ed DNAm profiles (763,828 CpG-sites) and SNP genotyping data (7,663,257 SNPs) were available for mQTL analysis. DNAm data was derived from blood samples which were extracted using commercial kits (Plus XL manual kit, LGC, UK). Genome-wide SNP genotyping data were generated from the same samples using the “Global Screening Array” (GSA) with shared custom content (Illumina, Inc.) using procedures outlined in Hong et al. [29]. To compute cis mQTLs (defined as within ± 1 Mb of the CpG site) in this dataset we used the matrix eQTL software [30], which performed an additive linear model with sex, genetic PC 1 to 5, DNAm PCs 1 to 10, and genotyping batch as covariates. Before association analysis, genome-wide DNAm profiles were adjusted for cell type composition estimates. Only the DNAm and SNP effects that were below a p-value of 5.00E-02 were reported. Cis mQTLs with FDR q < 0.05 were defined as genome-wide significant for this arm of our analyses. Enrichment analyses for mQTLs within CpG sites correlated between PFC and buccal-swab samples were performed with the R base function chisq.test, using a subset of uncorrelated CpG sites according to 100 kb bins, as described above for the PCA.
Annotation of genomic regions to CpG sites
To assess whether there was a significant enrichment or depletion of CpG sites located in specific genomic regions, we used the annotation from the R package IlluminaHumanMethylationEPICmanifest to assign CpG sites to one of the following genomic regions: 1st exon, 3’ untranslated region (UTR), 5’-UTR, gene body, exon boundary, intergenic region (IGR), the region from transcription start site (TSS) to 200 nucleotides upstream (TSS200), and the region from 200 nucleotides upstream of the TSS to 1500 nucleotides upstream (TSS1500). Enrichment analyses were performed with the R base function chisq.test.
Gene ontology (GO) analysis
To further characterize the correlated CpG sites, a Gene Ontology (GO) enrichment analysis was performed with the gometh function in the R package missMethyl [31] using the significantly correlated (FDR q < 0.05) CpGs between PFC and buccal samples. We hypothesized that correlated CpG sites between buccal and brain might show an enrichment for “housekeeping” functions, which would explain the correlated DNAm-values. Nominally significant GO terms were subsequently submitted to the REViGO tool (http://revigo.irb.hr/) [32] to identify and remove redundancy using Resnik’s measure while allowing a terms similarity of 0.7.