Epigenome-wide association study for lifetime estrogen exposure identifies an epigenetic signature associated with breast cancer risk

Background It is well established that estrogens and other hormonal factors influence breast cancer susceptibility. We hypothesized that a woman’s total lifetime estrogen exposure accumulates changes in DNA methylation, detectable in the blood, which could be used in risk assessment for breast cancer. Methods An estimated lifetime estrogen exposure (ELEE) model was defined using epidemiological data from EPIC-Italy (n = 31,864). An epigenome-wide association study (EWAS) of ELEE was performed using existing Illumina HumanMethylation450K Beadchip (HM450K) methylation data obtained from EPIC-Italy blood DNA samples (n = 216). A methylation index (MI) of ELEE based on 31 CpG sites was developed using HM450K data from EPIC-Italy and the Generations Study and evaluated for association with breast cancer risk in an independent dataset from the Generations Study (n = 440 incident breast cancer cases matched to 440 healthy controls) using targeted bisulfite sequencing. Lastly, a meta-analysis was conducted including three additional cohorts, consisting of 1187 case-control pairs. Results We observed an estimated 5% increase in breast cancer risk per 1-year longer ELEE (OR = 1.05, 95% CI 1.04–1.07, P = 3 × 10−12) in EPIC-Italy. The EWAS identified 694 CpG sites associated with ELEE (FDR Q < 0.05). We report a DNA methylation index (MI) associated with breast cancer risk that is validated in the Generations Study targeted bisulfite sequencing data (ORQ4_vs_Q1 = 1.77, 95% CI 1.07–2.93, P = 0.027) and in the meta-analysis (ORQ4_vs_Q1 = 1.43, 95% CI 1.05–2.00, P = 0.024); however, the correlation between the MI and ELEE was not validated across study cohorts. Conclusion We have identified a blood DNA methylation signature associated with breast cancer risk in this study. Further investigation is required to confirm the interaction between estrogen exposure and DNA methylation in the blood. Electronic supplementary material The online version of this article (10.1186/s13148-019-0664-7) contains supplementary material, which is available to authorized users.


Background
Breast cancer is the most common women's cancer, with an estimated 1.67 million cases diagnosed globally in 2012 [1]. Its crude incidence is rising due to an aging population and population-level changes in reproductive and lifestyle factors that affect breast cancer risk [2]. Up to 23% of breast cancer cases are considered preventable by lifestyle changes, such as maintaining a healthy weight and reducing alcohol consumption [3]. There is a need for improved risk assessment methods to target prevention and early detection to women at increased risk.
It is well established that estrogens play a role in breast cancer etiology, and women with higher circulating estrogen concentrations have an increased risk of breast cancer [4][5][6]. Several hormonal breast cancer risk factors contribute to a woman's lifetime estrogen exposure. These include a younger age at menarche and an older age at menopause, which together define the reproductive span during which a woman is exposed to high levels of endogenous estrogens produced by the ovaries [7,8]. The number of pregnancies is associated with a long-term decrease in both estrogen exposures and breast cancer risk [9,10]. A small decrease in risk is seen for women who breastfeed for longer, and this decreases the total lifetime estrogen exposure [11,12]. Exogenous hormones provided by oral contraceptives (OC) and hormone replacement therapy (HRT) increase risk during use, but risk returns to that for unexposed women 5-10 years after cessation [13][14][15][16][17]. Additionally, lifestyle risk factors for breast cancer such as higher postmenopausal body mass index (BMI), alcohol consumption, physical inactivity, and smoking can each affect circulating estrogen concentrations [4,[18][19][20][21][22].
Epigenetic mechanisms such as DNA methylation control gene expression and may be influenced by environmental and lifestyle exposures. Epigenome-wide association studies (EWAS) of blood DNA methylation for breast cancer risk measured have identified associations with global hypomethylation and several candidate genes, but these have generally not been replicated across studies [23][24][25][26]. Numerous large EWAS have identified epigenetic signatures for smoking [27], alcohol consumption [28], BMI [29], and aging [30,31], and hypomethylation signatures associated with smoking can improve the prediction of lung cancer [32,33]. We propose that an EWAS of breast cancer risk factors may identify CpG sites that could be used in risk prediction models [23]. We hypothesize that estrogen exposures over the lifetime give rise to accumulated changes in DNA methylation, detectable in the blood, which might add useful information to breast cancer risk prediction.
The aims of this study were to identify a DNA methylation signature reflecting a woman's lifetime estrogen exposure and assess the signatures' association with breast cancer risk. There is no standard model to estimate a woman's total lifetime estrogen exposure, and numerous approaches have been used [34][35][36][37][38][39]. For example, the Pike model reflecting the "breast tissue aging" in relation to breast cancer risk includes a woman's age at menarche, age at first full-term pregnancy, and age at menopause modeling changes over time [38]. In this study, we have used an estimated lifetime estrogen exposure (ELEE) model that reflects the reproductive span, comprising a woman's time between age at menarche and age at menopause minus 1 year for each pregnancy and duration of breastfeeding, calculated at the time at recruitment. We performed an EWAS and identified 694 CpG sites associated with ELEE. We then developed a methylation signature of ELEE that showed association with breast cancer risk and was further validated in a large independent study cohort using targeted bisulfite sequencing, and a meta-analysis of three additional independent study cohorts.

Study cohorts
Data from two independent prospective cohort studies were used for the primary analysis: the Italian cohort from the European Prospective Investigation into Cancer and Nutrition study (EPIC-Italy) and the UK-based Generations Study. Questionnaire data and blood samples were collected at the time of study enrollment. EPIC-Italy included epidemiological questionnaire data from 32,059 women (dataset 1) and peripheral blood DNA methylation data measured using the HM450K array for 162 matched pairs of incident breast cancer cases and controls (dataset 2) [24]. The Generations Study cohort included HM450K peripheral blood DNA array data for a subset of 92 healthy women (dataset 3) [40] and 440 matched pairs of incident breast cancer cases and controls (independent from the HM450K dataset), who provided blood DNA samples used for targeted bisulfite sequencing (dataset 4). Inclusion criteria for incident breast cancer cases in the Generations Study, with blood samples taken prior to diagnosis, were the following: invasive ER-positive breast cancer with no previous history of (non-breast) cancer, white ethnicity, and completeness of epidemiological data. Controls were individually matched to cases on age at blood draw ± 5 years. Additional replication cohorts used in the meta-analysis included an additional 118 case-control pairs from EPIC-Italy (dataset 5), 435 case-control pairs from EPIC-IARC (dataset 6) [41], and 310 case-control pairs from the MCCS (dataset 7) [42]. Further information for these study cohorts is provided in Additional file 1: Supplementary material and methods.

EWAS of ELEE
Different ELEE models including a woman's reproductive span (age at menopause minus age at menarche for postmenopausal women and age at recruitment minus age at menarche for premenopausal women), number of pregnancies, and breastfeeding duration were considered. The models were assessed for association with breast cancer risk in EPIC-Italy (n = 1193 cases and 30,671 controls, Additional file 2: Table S1) using age-adjusted Cox regression. The ELEE model selected for the EWAS included as many of these risk factors as possible without reducing the significance of the association with breast cancer for pre-and postmenopausal women. An EWAS with ELEE as the exposure and DNA methylation as the outcome was conducted for EPIC-Italy using a beta regression model on HM450K beta values. Subjects with missing information for ELEE were excluded (n = 87), as were cases with age at diagnosis < 50 (n = 28), to enrich for ER-positive disease, leaving 216 women for the EWAS (Additional file 2: Table S2). Potential confounders, known to influence methylation or estrogen levels, were adjusted for; these included age, BMI, alcohol consumption, and smoking duration, all reported at recruitment, as well as technical confounders including batch, position on batch, and white blood cell (WBC) composition [43]. Multiple testing was accounted for using the false discovery rate (FDR) Q values in R function "p.adjust." Beta regression coefficients are not interpretable as methylation percentage changes. Therefore, to get interpretable estimates for the significantly associated CpG sites, i.e., percentage change in DNA methylation per unit longer ELEE, a linear mixed-effects regression model adjusted for the same variables, including random effects for batch and position on chip, was applied to beta values multiplied by 100.

Laboratory analysis
HM450K array data generation has been described previously [24,[40][41][42]. For validation, targeted bisulfite sequencing was conducted in the Generations Study (n = 880) using the Fluidigm 48.48 Access Array. Forty-two CpG sites for validation were selected from the EWAS based on the magnitude of change in DNA methylation (> 0.1%) per 1-year longer ELEE (mixed-effects linear regression model coefficient) and statistical significance of association with ELEE (beta regression model P value < 7 × 10 −5 ). To estimate WBC composition, five HM450K CpG probes that showed independent correlation with five different WBC types were included in the target panel (Additional file 3: Figure  S1). The 880 Generations Study samples were sequenced on the Illumina MiSeq in 20 batches in four sequencing pools (6,5,5, and 4 batches). After quality control, two batches were rerun (batches 12 and 19) due to poor sequencing data, likely a result of decreased performance for the barcodes used for these batches. In the results, only the sequencing data from the new batches 12 and 19 are used, referring to sequencing pool 5. DNA methylation levels were extracted using Bismark [44] and analyzed in R version 3.3.2.

Statistical analysis
Quality control (QC) of the targeted sequencing data was conducted to exclude CpG sites (n = 42) with low coverage or with large difference in DNA methylation levels between duplicated pairs (n = 31 CpG sites passing quality control). A methylation index (MI) of ELEE was developed using the same 31 CpG sites in the HM450K data with complete information for ELEE from both EPIC-Italy (dataset 2, n = 237) and the Generations Study (dataset 3, n = 65, Additional file 2: Table S2). To develop the MI, ridge regression was conducted in a 10-fold cross-validation repeated 100 times, using the R package "glmnet" in "train" in the R package "caret" to estimate the following parameters: penalty coefficient (lambda) for ridge regression and regression coefficients for the model. The final MI model for predicted ELEE was calculated as a linear function, i.e., intercept plus the sum of the DNA methylation levels at the CpG sites included in the model weighted by their coefficient. The correlation between the MI and ELEE was evaluated in the Generations Study targeted sequencing data (dataset 4, pairs with coverage > 30 sequence reads for at least 10% of the remaining CpG sites) using the Pearson's correlation coefficient. The association between the MI and risk of breast cancer was assessed using conditional logistic regression (R function "clogit" in the package "survival") for matched case-control pairs in EPIC-Italy HM450K development data (dataset 2, n = 162 pairs), the Generations Study targeted sequencing validation data (dataset 4), and in each of the additional cohorts for the meta-analysis including EPIC-Italy (dataset 5, n = 118 pairs), EPIC-IARC (dataset 6, n = 420 pairs), and MCCS (dataset 7, n = 310 pairs). The meta-analysis of log odds ratios and standard errors was conducted using a weighted random-effects model, applying the restricted-maximum likelihood method (function "rma.uni" in the R package "metafor"). The Cochran's Q statistics and I 2 statistics were used to estimate heterogeneity between the studies; Q < 0.05 and I 2 > 50% were defined as heterogeneous estimations. If the heterogeneity I 2 was equal to 0, the meta-analysis behaved as a fixed-effects model. The ORs for all models were adjusted for baseline age, BMI, alcohol consumption, smoking duration, and estimated WBC composition in a multivariable model. Additionally, for each of the target CpG sites passing QC, the association with breast cancer risk was investigated using conditional logistic regression. Further details on methods and workflow are described in Additional file 1: Supplementary material and methods and Additional file 3: Figure S2.

EWAS identifies CpG sites associated with ELEE in EPIC-Italy
The EWAS of ELEE conducted in EPIC-Italy (dataset 2, n = 216, Additional file 2: Table S2) gave P values that were moderately inflated (Additional file 3: Figure S3), with a genomic inflation factor lambda of 1.13. After correction for multiple testing, the EWAS identified 694 CpG probes associated with ELEE (FDR Q < 0.05, Fig. 1a) with a mix of hypo-and hypermethylated CpG probes (Fig. 1b). Two sensitivity analyses were conducted firstly, including all cases and controls with complete information of ELEE (n = 237), and secondly for controls only (n = 119), and estimates from each analysis were highly correlated (r = 0.997 and r = 0.963, respectively, Additional file 3: Figure S4). All CpG probes were associated with ELEE (Q < 0.05) in the first analysis and 563 in the second analysis with controls only. Out of the 694 CpG probes associated with ELEE, CpG sites were selected for the targeted bisulfite sequencing. The selection was based on the largest magnitude of change in DNA methylation (> 0.1%) and statistical significance (P < 7 × 10 −5 ). Furthermore, poorly performing assays were excluded before the sequencing in the Generations Study. A total of 42 CpG sites were included in the final target panel for targeted bisulfite sequencing in the Generations Study using the Fluidigm 48.48 Access Array (Table 1).
Quality control of targeted sequencing data prior to model development Targeted bisulfite sequencing of 42 target regions using the Fluidigm 48.48 Access Array was conducted for 880 samples (440 matched case-control pairs) from the Generations Study (dataset 4). The targeted sequencing data was of high quality: average sequencing depth per CpG site per sample was 1740, > 97% of the reads were assigned to a sample (i.e., had a barcode sequence), and > 99% of the paired reads were aligned to target regions (Additional file 2: Table S4). Additionally, a high correlation in DNA methylation values was observed between batches, but with some variability in some of the assays (mean r = 0.88). Eleven CpG sites were excluded from the analysis due either to low coverage across batches or to high variation in DNA methylation levels between duplicated pairs. Out of the 880 samples, subjects were excluded due to incorrect case-control status (1 pair) or low coverage in > 10% remaining target CpG sites (n = 100 matched pairs where at least one sample of the pair had low coverage), leaving a total of 678 samples (339 matched case-control pairs,  analyses. In the 328 matched case-control pairs in the Generations Study with coverage > 30 for all five WBC CpG sites, no difference (P > 0.05) in DNA methylation levels between cases and controls were observed (Additional file 3: Figure S1).

The MI is associated with breast cancer risk
A methylation index (MI) to predict ELEE was developed in HM450K data from EPIC-Italy (dataset 2, n = 237) and the Generations Study (dataset 3, n = 65, Additional file 2: Table S2) using ridge regression on the 31 target CpG sites passing QC in the targeted bisulfite sequencing data. As expected, the MI correlated with ELEE in the development data; a high correlation was observed in EPIC-Italy (r = 0.60, P = 6 × 10 −25 ) and moderately correlated in the Generations Study HM450K data (r = 0.27, P = 0.027, Fig. 2a). The correlation between the MI and ELEE, however, was not replicated in the Generations Study targeted sequencing data (n = 678, r = − 0.04, P = 0.340, Fig. 2b). We also observed no association between the ELEE and breast cancer risk in the Generations Study (n = 339 matched case-control pairs, age-adjusted odds ratio (OR) = 1.01, 95% CI 0.98-1.04, P = 0.562), in contrast to EPIC-Italy subjects from the HM450K dataset (n = 162 matched case-control pairs, age-adjusted OR = 1.10, 95% CI 1.03-1.17, P = 0.007). The correlations between the MI and ELEE were similar for pre-and postmenopausal women in the Generations Study, but stronger for postmenopausal women (r = 0.72, P = 2 × 10 −22 ) than for premenopausal women (r = 0.53, P = 7 × 10 −7 ) in EPIC-Italy (test for heterogeneity between pre-and postmenopausal women in EPIC-Italy: Q = 0.79, I 2 = 0 (no heterogeneity); Generations Study: Q = 0.42, I 2 = 0 (no heterogeneity)). The association between the MI and breast cancer risk was tested using matched case-control pairs from EPIC-Italy HM450K data and the Generations Study targeted sequencing data (n = 162 and 339 pairs respectively, Table 2). In a multivariable model, each unit increase in the MI, ranging from 27.6 to 39.1, was associated with a 51% increase in breast cancer risk in EPIC-Italy (OR = 1.51, 95% CI 1.28-1.82, P = 1 × 10 −5 , Fig. 2c). The association was validated in the Generations Study targeted sequencing data: A comparable analysis of the Generations Study data gave an estimated 4% increase in risk per unit increase in MI, which ranged from 20.3 to 53.1 (OR = 1.04, 95% CI 1.01-1.08, P = 0.022, Fig. 2d). The OR estimate was greater for postmenopausal women in both EPIC-Italy and the Generations Study (OR = 1.91, 95% CI 1.29-2.82, P = 0.001 and OR = 1.07, 95% CI 1.02-1.12, P = 0.006, respectively) than for premenopausal women (OR = 1.61, 95% CI 1.17-2.22, P = 0.004 and OR = 1.01, 95% CI 0.94-1.09, P = 0.713, respectively). However, the test for heterogeneity between pre-and postmenopausal women in EPIC-Italy, Q = 0.19 and I 2 = 43, and Generations Study, Q = 0.24 and I 2 = 27, suggested no significant heterogeneity.
Sensitivity analysis on the model development was also conducted using controls only in the combined EPIC-Italy and the Generations Study HM450K data (datasets 2 and 3, n = 184). The control-only MI model showed a very similar association with breast cancer risk as previously when tested on the case-control pairs from the Generations Study (n = 339 pairs, OR = 1.04 per unit increase in MI, 95% CI 1.01-1.08, P = 0.012).

Meta-analysis of the association between MI and breast cancer risk
The association between the MI and breast cancer risk was examined in a meta-analysis, excluding the discovery data from EPIC-Italy. The analysis included 2374 women (1187 matched case-control pairs) from four prospective study cohorts, the Generations Study targeted sequencing data, additional subjects from EPIC-Italy, EPIC-IARC, and A B C D Fig. 2 The MI is associated with breast cancer risk. The MI was developed in combined HM450K data from EPIC-Italy (dataset 2, n = 237) and the Generations Study (dataset 3, n = 65) using ridge regression. The correlation between the MI and ELEE and the association between the MI and breast cancer risk were evaluated. a The correlations between the MI and ELEE in the development of HM450K data were as follows: r = 0.60 and P = 6 × 10 −25 for EPIC-Italy and r = 0.27 and P = 0.027 for the Generations Study b The MI and ELEE were not correlated in the Generations Study targeted sequencing data (r =− 0.04, P = 0.340). c Density plot of the MI values in controls and cases in EPIC-Italy HM450K data. The MI was significantly associated with breast cancer risk in EPIC-Italy (n = 162 pairs, OR = 1.51, 95% CI 1.26-1.82, P = 1 × 10 −5 ). d Density plot of the MI values in controls and cases in the Generations Study targeted sequencing data. The MI was significantly associated with breast cancer risk in the Generations Study (n = 339 pairs, OR = 1.04, 95% CI 1.01-1.08, P = 0.022). ORs were adjusted for age, BMI, alcohol consumption, and smoking duration (all variables reported at recruitment) and WBC composition MCCS, with mean time to diagnosis of 4.0, 8.5, 7.5, and 7.9 years, respectively (Additional file 2: Table S7). All estimates were adjusted for baseline age, BMI, smoking duration, alcohol consumption, and WBC composition. The combined meta-analysis for MI as a continuous variable showed low heterogeneity across study cohorts (Q = 0.45, I 2 = 0%) and an association with breast cancer risk, with 4% increase in risk per one unit increase in the MI (OR = 1.04, 95% CI 1.00-1.07, P = 0.024, Fig. 3a). Women in the highest quartile compared with the lowest quartile of MI had higher breast cancer risk with a combined OR of 1.45 (OR = 1.45, 95% CI 1.05-2.00, P = 0.024, Fig. 3b) and low heterogeneity (Q = 0.44, I 2 = 16%). There was no significant association between the MI and breast cancer risk between pairs with a shorter time to diagnosis (less than median) in the combined meta-analysis (OR = 1.03, 95% CI 0.98-1.08, P = 0.241), but there was a significant association in pairs with time to diagnosis above the median (OR = 1.05, 95% CI 1.01-1.10, P = 0.021) (Additional file 2: Table S8). Lastly, the MI did not correlate with the ELEE in any of these four study cohorts (Additional file 2: Table S9).
To explore non-linearity in the association between the MI and breast cancer risk, the MI was stratified into quartiles defined by the distribution in controls in each cohort. A higher breast cancer risk was observed for women in the highest quartile compared with those in the lowest quartile in EPIC-Italy (OR Q4_vs_Q1 = 5.45, 95% CI 2.17-13.67, P = 3 × 10 −4 ) and in the Generations Study (OR Q4_vs_Q1 = 1.77, 95% CI 1.07-2.93, P = 0.027, Additional file 2: Table S5), but not in three additional cohorts. In the meta-analysis, excluding the EPIC-Italy development data, increased breast cancer risk was observed for women in the highest quartile compared with the lowest quartile (OR = 1.45, 95% CI 1.05-2.00, P = 0.024) and modest heterogeneity (Q = 0.44; I 2 = 16%, Additional file 2: Table S5).
Reverse causation would be indicated if the association between MI and breast cancer risk was higher in cases with a short time to diagnosis. To explore this, we investigated the association between the MI and breast cancer risk stratified by median time to diagnosis in cases. In the EPIC-Italy development data, and in the meta-analysis, the association between the MI and breast cancer risk appeared to be stronger with longer time to diagnosis. In EPIC-Italy, the association between the MI and breast cancer risk was significant for both groups but with a higher OR for pairs with a longer time to diagnosis (n = 81 pairs in both groups, OR = 1.47, 95% CI 1.12-1.93, P = 0.005 vs OR = 1.84, 95% CI 1.30-2.61, P = 0.001; Fig. 4a, b). In the meta-analysis, for pairs with shorter time to diagnosis, there was no significant association with breast cancer risk (n = 721 pairs, OR = 1.03, 95% CI 0.98-1.08, P = 0.241, Fig. 4c); however, the MI was associated with breast cancer risk for pairs with a time to diagnosis above the median (n = 804 pairs, OR = 1.05, 95% CI 1.01-1.10, P = 0.021, Fig. 4d). Therefore, the data do not support reverse causation A B Fig. 3 Meta-analysis of the association between MI and breast cancer risk. The association between MI and risk for breast cancer, as a continuous variable (a) or as a categorical variable (b), was estimated in the four studies included in the meta-analysis using conditional logistic regression adjusted for age, BMI, alcohol consumption, and smoking duration (all variables reported at recruitment) and WBC composition. The log odds ratios were combined in a meta-analysis using restricted-maximum likelihood model. The square boxes represent the odds ratios (ORs) and the lines the 95% confidence intervals (CIs). a EPIC-Italy corresponds to the new EPIC-Italy samples, not included in the development of the MI as a mechanism for the association with DNA methylation.
Individual associations between each of the 31 target CpG sites and breast cancer risk were further examined in the Generations Study (dataset 4). Four CpG sites located nearest the genes CTNNA2, GRB10, RPH3AL, and TINCR showed individual associations with breast cancer risk in a multivariable model (P < 0.05, Additional file 2: Table S6) and were also associated with breast cancer risk in matched case-control pairs in EPIC-Italy (dataset 2).

Discussion
In this study, we have performed an EWAS of lifetime estrogen exposure using the HM450K array and identified 694 CpG sites (FDR Q < 0.05) associated with ELEE in the EPIC-Italy study cohort. In addition to this, we have conducted a validation step in a case-control study nested within a large independent cohort, the Generations Study, using targeted bisulfite sequencing. We have developed a methylation index (MI) to predict ELEE using DNA methylation levels at 31 CpG sites and tested the MI for association with breast cancer risk. Although the MI did not correlate with ELEE in the validation cohorts, it was associated with breast cancer risk. Women in the highest quartile of the MI in the Generations Study had 77% higher risk for breast cancer compared with women in the lowest quartile (Additional file 2: Table S5). In the meta-analysis, including three other independent datasets, the highest quartile had a 45% higher risk compared with the lowest quartile (Fig. 4).
There are several potential explanations why the correlation between the MI and ELEE was only seen in the discovery EPIC-Italy dataset and not in the additional validation cohorts. Firstly, it is possible that the observed association between MI and ELEE in EPIC-Italy is a false positive and that the MI was over fitted in this study cohort. It may also be possible that the measured methylation index could capture the biological effects of unknown confounders that are not included in the calculated ELEE model. Alternatively, population-specific differences between the cohorts (or between sub-cohorts of EPIC-Italy) cannot be excluded. For example, there are more smokers in the EPIC studies compared with the Generations Study and MCCS, which considerably affects DNA methylation. Also, there are different patterns in breastfeeding, number of pregnancies, and OC and HRT use across the studies, which may all affect the total lifetime estrogen exposure and DNA methylation. With the current evidence, we conclude that the methylation index developed does not directly predict ELEE.
We observed no evidence for reverse causation with a higher association between the MI and breast cancer risk for cases with a longer time to diagnosis (Fig. 4). In the meta-analysis, the association with breast cancer risk was only significant for the cases with the time to diagnosis greater than the median in the continuous MI model (OR = 1.05, 95% CI 1.01-1.10, P = 0.021, Additional file 2: Table S8). More work is needed using longitudinal studies to understand the dynamics of this MI over time.
In this study, we used an ELEE model based on a woman's age at recruitment (premenopausal women) or age at menopause (postmenopausal women), age at menarche, number of pregnancies, and breastfeeding duration. The model does not include all the variables that can affect estrogen exposure, for example, menstrual cycle regularity, long-term pregnancies with miscarriage or abortion as outcome, and current use of HRT or OC. Hormonal risk factors are difficult to assess comprehensively via questionnaires; for example, the short-term outcome of being pregnant is increased estrogen levels and breast cancer risk, but in the long term, the estrogen levels and risk are reduced compared with nulliparous women [9,10,45]. Subtracting 1 year for each pregnancy instead of 9 months for each pregnancy did not materially change the results and might reflect the lifetime estrogen exposure better because there is a delay of 3 months on average before ovulatory cycling resumes. OC and HRT use is associated with elevated risk in current users, but it is not known how it will contribute to the lifetime estrogen exposure or DNA methylation. For this reason, we have not included OC and HRT use in our ELEE model. Other hormonal exposures accompanied by estrogens, such as progesterones, were not considered in this study. We acknowledge that the ELEE model is not a perfect model for cumulative estrogen exposure; however, it was hypothesized that if DNA methylation signature could be identified as an intermediate biological phenotype for the exposure, this might be more accurate measurement than questionnaire-based estimates.
We acknowledge the small sample size of the discovery EWAS in EPIC-Italy (n = 216) and potential false positive hits due to artefactual inflation of test statistics. We attempted to reduce the likelihood of false positive associations by correcting for multiple testing and restricting the MI signature to CpGs showing the largest changes in DNA methylation. Replication and validation are important steps to identify valid DNA methylation biomarkers. For validation of the MI and breast cancer risk, we used an independent method, targeted bisulfite sequencing, on a case-control study nested within a large independent study cohort (the Generations Study, n = 880) and a meta-analysis across four independent study cohorts (n = 2374). Another limitation that we identified was that not all 31 CpG sites are present on the updated Illumina HumanMethylation EPIC (850K) array, which precludes the possibility of including 850K studies in this analysis without changing the model. Breast cancer risk assessment needs further improvement to be able to identify women at low or high risk of developing breast cancer that would warrant a preventive intervention. It remains to be explored if epigenetic signatures, in combination with other existing risk models, polygenic risk scores and breast density measurements, will improve breast cancer risk prediction and stratification. Furthermore, blood sampling is an accessible and less invasive method that is relatively easy to include into population screening. Targeted prevention approaches, including chemoprevention or lifestyle changes, for high-risk women might reduce the breast cancer incidence rate.

Conclusion
In this study, a DNA methylation signature in blood associated with breast cancer risk was identified. However, the methylation signature, although developed from ELEE associations in the EPIC-Italy cohort, was not associated with lifetime estrogen exposure in the subsequent cohorts A B C D Fig. 4 Time to diagnosis and the association between the MI and breast cancer risk. Matched case-control pairs were stratified on median time to diagnosis in EPIC-Italy HM450K data (dataset 2) and in the four study cohorts included in the meta-analysis. The association between the MI and breast cancer risk was analyzed in the two groups. a The MI was significantly associated with breast cancer risk in EPIC-Italy pairs with a shorter time to diagnosis (n = 81 pairs, OR = 1.47, 95% CI 1.12-1.93, P = 0.005). b The MI was significantly associated with breast cancer risk in EPIC-Italy pairs with a longer time to diagnosis (n = 81 pairs, OR = 1.84, 95% CI 1.30-2.61, P = 0.001). c The combined meta-analysis including pairs with shorter time to diagnosis showed no significant association between the MI and breast cancer risk (OR = 1.03, 95% CI 0.98-1.08, P = 0.241). d The combined meta-analysis including pairs with shorter time to diagnosis showed no significant association between the MI and breast cancer risk (OR = 1.05, 95% CI 1.01-1.10, P = 0.021). The log odds ratios were combined in the meta-analyses using restricted-maximum likelihood model. ORs were adjusted for age, BMI, alcohol consumption, and smoking duration (all variables reported at recruitment) and WBC composition analyzed. Further investigation is required to confirm the interaction between estrogen exposure and DNA methylation in blood, and how epigenetic signatures might improve risk assessment models.