Overview of scanned mQTLs in different datasets
See Methods section and Additional file 4: Table S1 for a description of datasets and generation of array data used in this study. Briefly, four datasets each containing non-Latino Whites (NLW), and Latinos (LAT) were included. Each dataset has both DNA methylation data (Illumina 450 K or EPIC arrays) and a genome scan for SNPs (Illumina Omni Express or Affymetrix Precision Medicine Diversity Array). See Methods section for scanning and detection of mQTLs in each dataset by ethnicity. For datasets assessed on the 450 K methylation array (Set 1 NLW, Set 1 LAT, Set 2 NLW, and Set 2 LAT), an average of 15.35% CpGs have a matched mQTL at adjusted significance level (P < 0.05), while for datasets assessed on the EPIC methylation array (Set 3 NLW, Set 3 LAT, Set 4 NLW, and Set 4 LAT), an average of 22.95% CpGs are matched to mQTLs with adjusted P values < 0.05 as its mQTL (Fig. 1A). The larger proportions of mQTLs identified in EPIC arrays are likely due to the larger sample size (n = 971) than in sets run on 450 K arrays (n = 667). Interestingly, proportions of mQTL-matched CpGs differ by ethnicity. In all four datasets, LAT consistently have a higher proportion of mQTL-matched CpGs comparing to NLW (Fig. 1A), likely due to differences in statistical power. More specifically, sample sizes of LAT subjects were larger than NLW in all four datasets, and CpG-SNP pairs with similar effect sizes could have a higher statistical significance in the LAT cohort because of their larger sample sizes. To investigate if this was the case, we meta-analyzed mQTLs of each ethnicity (NLW: Set 1 NLW, Set 2 NLW, Set 3 NLW, Set 4 NLW; LAT: Set 1 LAT, Set 2 LAT, Set 3 LAT, Set 4 LAT), and for shared CpG-SNP pairs between these ethnicities, we compared effect sizes and P values of mQTL effects (Additional file 3: Figure S1). As a result, effect sizes were similar, but P values were stronger in Latino subjects, suggesting that the discrepancy was mostly due to sample size differences.
We also examined whether a CpG tends to be paired with the same SNP, or SNPs in linkage disequilibrium (LD) (R2 > 0.5), across different datasets. We made 8 pair-wise comparisons for datasets assessed on the same DNA methylation platform (Additional file 4: Table S2). For 450 K arrays, SNPs matched to the same CpG across dataset have a higher probability to be in LD with each other than identical. However, for EPIC array, SNPs tend to be identical rather than in LD (Fig. 1B). All overlaps of shared CpG-SNP pairs on either 450 K array or EPIC array datasets are shown in Additional file 3: Figure S2.
Generation of CpG-mQTL pair databases by DNA methylation arrays
We then combined results from multiple datasets to create an mQTL database for 450 K and EPIC arrays separately, which include CpG-SNP pairs with the strongest associations, which can be used as mQTL covariates in subsequent EWAS analyses. These mQTL databases are used subsequently to account for genetic effects in epigenome-wide association analysis. More specifically, we combined mQTL scanning from Set 1 NLW and LAT, Set 2 NLW and LAT datasets to create a database for the 450 K array, while Set 3 NLW and LAT, Set 4 NLW and LAT datasets were combined to create a database for the EPIC array.
The following combining scheme for each platform was adopted: For each CpG, all SNPs identified as mQTLs in each dataset were recorded, sorted by P value, which was computed by meta-analyzing effect sizes from different datasets (between both ethnic groups, and between both datasets) weighted by inverse of standard errors using the “SCHEME STDERR” mode of METAL [10]. As one CpG could be matched to different mQTLs in different datasets, there could be several CpG-SNP pairs for a particular CpG. As a result, on the 450 K array, 243,450 such CpG-SNP pairs were identified for a total of 150,333 CpGs. On the EPIC array, 630,971 CpG-SNP pairs were in the database for a total of 358,325 CpGs.
Genome-wide distribution of mQTL
We next investigated the location of CpGs with the significant CpG-SNP correlation from each array, and whether they have a higher chance of localizing within regions that play a key biological role.
On the 450 K array, CpGs with matched mQTLs had a higher probability to be located in northern Shore (N_shore), southern shore (S_shore), and OpenSea regions, comparing to genome-wide CpG distributions. On the EPIC array, mQTL-matched CpGs were less likely to be in the OpenSea region, but enriched in all other regions (Fig. 2A, 2B).
Interestingly, on both arrays, CpGs with matched mQTLs are significantly enriched in CpG island shore regions (either the N_shore or S_shore) comparing to whole-genome distributions (chi-squared test P values < 2.2 × 10–16 on both arrays) (Additional file 3: Figure S3).
We also conducted enrichment analysis to investigate if CpGs matched to mQTLs are more or less likely to be in regulatory regions. For transcription factors, we gathered all available TFs in the Encyclopedia of DNA Elements (ENCODE) [11] ChiP-seq database for (N = 161) in 91 cell types combined; interestingly, 149 were significantly less enriched on the 450 K array while no TF site is more enriched. On the EPIC array, 148 were significantly less enriched, while 8 were significantly more enriched. Among the available CpG sites on the arrays we can conclude that TF binding sites are less likely to harbor mQTLs. For histone modification sites, H3K4me3 sites were significantly less enriched for CpGs matched to mQTLs both on 450 K array (fold enrichment = 0.934, P value = 1.27 × 10–80) and EPIC array (fold change = 0.654, P value < 1.00 × 10–160). On the contrary, H3K27me3 sites were significantly more enriched for CpGs matched to mQTLs both on 450 K array (fold enrichment = 1.108, P value = 1.414 × 10–85) and EPIC array (fold change = 1.0395, P value = 6.6476 × 10–22). Lastly, in terms of enhancer regions for three HSC cell lines, CpG sites in enhancer regions were less enriched for both 450 K and EPIC arrays (450 K array, fold change = 0.895, P value = 1.3563 × 10–122; EPIC array, fold change = 0.8022, P value < 1.00 × 10–160).
Comparison with published cis and trans mQTL effects
Min et al. [12] reported a database of in cis mQTLs using large cohort of subjects of NLW ethnicity. We compared both the effect sizes and P values of mQTLs from our databases with Min et al.’s (Additional file 3: Figure S4). In all datasets and in both NLW and LAT populations, there is very low consistency in terms of both effect sizes and P values.
It was reported that in cis mQTLs have much bigger effect sizes than in trans mQTLs [12]. While trans mQTLs should be relevant for our purposes, we were underpowered to capture trans mQTL effects. Therefore, we limited our mQTL scanning to in cis SNP-CpG pairs. Nevertheless, we still investigated if we could leverage published trans mQTL databases to account for in trans genetic effects. Recently, studies were published discussing trans-mQTL effects in large cohorts [12, 13]; however, most of these studies focus on subjects of non-Latino white populations, and it is the applicability of trans-mQTL databases across ethnicities is not validated.
We aimed to test if trans-mQTLs published by Min et al. could be replicated on our Latino and non-Latino White subjects. To achieve this, we extracted 46,148 CpG along with the most significant SNP matched to each CpG from their analysis within our dataset and tested whether the same mQTL effects were present in our dataset. We were able to test 43,464 such CpG-SNP pairs from Set 1 (NLW and LAT), 43,386 from Set 2 (NLW and LAT), 40,247 from Set 3 (NLW and LAT), and 40,360 pairs from Set 4 (NLW and LAT). The replicability of trans-mQTL effects was, however, poor in all NLW and LAT datasets, both in terms of regression coefficients and significance. Additional file 3: Figure S5 shows Set 1 NLW, Set 1 LAT, Set 2 NLW, and Set 2 LAT as examples. We therefore do not consider trans mQTLs effects in the current manuscript.
EWAS reveals significant CpGs associated with birthweight
Birthweight has been reported to have significant associations with neonatal DNA methylation [7, 14, 15]. However, previous reports did not take into consideration possible effects from genetics. To address this, using the same dataset where mQTL scanning was performed, we conducted an EWAS analysis to investigate the correlation between neonatal DNA methylation and birthweight, while accounting for platform-specific mQTLs. This analysis was performed in all four datasets separately, and meta-analysis was conducted for each array (450 K and EPIC separately). We also conducted the same regression models without accounting for SNPs for comparison.
On the 450 K array
In the 450 K array dataset (n = 667), using EWAS models accounting for mQTL effects (mQTL-model), we discovered a total of 33 CpGs significantly associated with birthweight after Bonferroni correction, of which 19 CpGs were previously associated with birthweight [7], for example, in the ARID family genes ARID3A and ARID5B (Additional file 4: Table S3), and 14 CpGs were novel associations. Of the 33 significant CpGs, 15 (45.45%) were corrected for significant mQTLs in the model, and three of these CpGs were located in the transcription start site region of the genes.
We also conducted the same EWAS without accounting for mQTL effects in Sets 1 and 2 (non-mQTL-models), followed by the same meta-analysis like that of models including mQTLs. As a result, none of these hits would have been identified if we did not account for mQTL effects, as only 3 CpGs were significant when mQTLs were not included in the model. This suggested the importance of taking genetic effects into consideration in epigenome-wide association analysis.
We investigated how controlling for mQTL’s genetic effects changed results of our EWAS models. We were able to compare a total of 123,012 CpGs which were matched to mQTLs. Their matched mQTLs’ effects were accounted for in mQTL-models but not in non-mQTL-models. While regression coefficients from EWAS models with or without controlling for mQTLs were in general similar (Fig. 3A), however, inconsistencies were also seen for some CpGs, as for these CpGs, controlling for mQTLs affected EWAS effect sizes to a large extent. However, the regression coefficients for the 15 significant mQTL-matched CpGs seemed to be similar with or without controlling for mQTLs (Fig. 3A).
In terms of P values, controlling for genetic effects also affected significance for many CpGs, to a much greater extent than regression coefficients (Fig. 3B). This included the 15 significant mQTL-matched CpGs from the model, and the majority of these significant CpGs became more significant after adjusting for mQTL effects. However, opposite from that of the significant hits, the majority of 450 K CpGs became less significant after adjusting for SNP effects. As shown in Fig. 3B, the majority (80,384 out of 122,997, 65.35%) of non-significant hits (shown in turquoise) were below the y = x line.
The 450 K results suggest that, in general, controlling for genetic effects can help to identify additional CpG loci that are associated with variation in birthweight.
We also conducted an enrichment test on the 26 genes overlapping significant CpGs identified in the association analysis using Gene Set Enrichment Analysis (GSEA) [16, 17] (Additional file 4: Table S5). “Early-TGFB1 signature” gene set was identified to be strongly enriched (FDR q value = 1.92 × 10–2), among other sets.
On the EPIC array
There has not been a report of large-scale multi-cohort birthweight EWAS on the EPIC array as was performed previously for the 450 K array [7]. In the larger EPIC array sample set (n = 971), we identified 3,294 significant CpGs after Bonferroni correction associated with birthweight (Additional file 4: Table S4) from mQTL-model, many of which were not available in the 450 K platform (2112 out of 3294, 64.12%). For example, one of the top hit CpGs cg09797037 in EXOSC10 (P = 8.42 × 10–31, direction: + +) is significantly associated with birthweight in our EPIC array data; however, Küpers et al. did not identify this gene in their multi-cohort meta-analysis.
In our EWAS results, 2004 (60.84%) of the significant hits were corrected for mQTL’s genetic effects, at a CpG-SNP mQTL cutoff adjusted P value of 0.05. 425 (12.90%) of these CpGs are annotated as transcription starting region (TSS) in the Illumina’s EPIC methylation arrays annotation [18].
Similar to 450 K, we repeated the same EWAS model in both Sets 3 and 4 without controlling for mQTL’s effects (non-mQTL-models). Running the same pipeline (meta-analysis, and multiple correction using Bonferroni), 2216 of significant hits from models including mQTLs as covariates would not have been identified if mQTL effects were not adjusted, accounting for 67.27% of all the significant hits. Similar to that of the 450 K array, this illustrates that adjusting for mQTLs in this EWAS model significantly affected the fundamental landscape of results.
Adding the mQTLs as additional variables also did not alter regression effect sizes appreciably for the 3294 significant CpGs, similarly to the result with the 450 K (Fig. 4A). P values were also significantly different after controlling for mQTL, increasing the significance for most significant hits, further suggesting the importance of controlling for genetic effects when conducting EWAS analysis (Fig. 4B). For the majority of all EPIC CpGs, after adjusting for SNP effects, the majority (154,489 out of 294,446, 52.47%) also became more significant (Fig. 4B), opposite from the trend observed on the 450 K array.
We also tested for model fit to test for the effect of mQTL using partial F-tests. For CpGs matched to mQTLs, full model refers to mQTL-model, and reduced model refers to non-mQTL-model (see Methods section for equations of these models). In Set 4, for example, 199,625 out of 296,157 (67.41%) models have a significant partial F-test P value. This illustrated for a majority of CpGs, after controlling for matched mQTLs, model fit significantly improved.
GSEA was also performed using top 50 genes from the EWAS results similar to that of 450 K (Additional file 4: Table S6). Enriched gene sets include tissue maturation (FDR q value = 3.87 × 10–04) and abnormal inflammatory response (FDR q value = 3.87 × 10–04).
Shared CpGs
For shared, or in-common CpGs (CpGs that are on both 450 K and EPIC array), we controlled for mQTLs for in-common CpGs described in the method section for each dataset and meta-analyzed all four datasets. Results were similar: controlling for mQTL effects seemed to increase significance of birthweight effects (Additional file 3: Figure S6).
Investigating CpGs whose regression coefficients were significantly affected by mQTLs
On the 450 K array
We found that after correcting for mQTLs, regression coefficients of some CpGs changed more significantly than others, suggesting that the association between these CpGs and birthweight was more heavily distorted by mQTLs. Taking genetic effects into consideration is of vital importance for this group of CpGs.
Among all the significant CpGs in the association analysis after FDR correction, there are in total 93 CpGs whose effect sizes changed more than 20% in either direction after correcting for mQTL effects. Pathway analyses suggest these CpGs are enriched in cell proliferation and energy related pathways. For example, KEGG of these CpGs showed top pathways include glycerophospholipid metabolism (KEGG ID 00,564, adjusted P value = 1.08 × 10–80) and GnRH signaling pathway (KEGG ID 004,912, adjusted P value = 1.08 × 10–80) among top pathways (Additional file 4: Table S7).
On the EPIC array
The EPIC array was able to identify 1711 CpGs with significantly changed effect sizes after correcting for mQTL, using the same filtering as that of 450 K. Pathway analyses also revealed that these CpGs are enriched in energy and metabolism functions. In GO pathway analysis, ATPase regulator activity (GO ID 0060590, adjusted P value = 3.20 × 10–08) and chemokine activity (GO ID 0045600, adjusted P value = 0.044) are among significantly enriched pathways (Additional file 4: Table S8).
These results suggested that, for CpGs overlapping with genes in pathways associated with birthweight, including biosynthesis, metabolism, and leukemia (high birthweight was a risk factor for leukemia [19]), their association analyses with birthweight were distorted heavily by genetic effects.
Correcting for maternal weight gain as an additional covariate did not significantly change EWAS results
It has been reported before that excessive maternal gain was associated with birthweight [20,21,22]. Therefore, maternal weight gain during pregnancy could also influence birthweight EWAS models. As a sensitivity analysis, we controlled for maternal weight gain as an additional covariate. As a result, neither P values nor regression coefficients were significantly altered. (Set 4 results were shown as an example in Additional file 3: Figure S7.)