Subject recruitment, demographic/lifestyle variables acquisition, and DNA methylation measurements
EPIC Italy
Study participants were drawn from the Italian component of the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort [53], a large general population cohort consisting of ~ 520,000 individuals, with standardised lifestyle and personal history questionnaires, measured anthropometric data and blood samples collected for DNA extraction. Smoking habits data were collected at study enrolment using a questionnaire, and participants were categorised as ‘never’, ‘former’ and ‘current’ smokers. Height and weight were measured at enrolment with a standardised protocol, and body mass index (BMI) was calculated as the ratio between weight in kg and squared height in metres, treated as a continuous variable. Measurements methods for blood pressure, cholesterol levels, triglycerides, and PAI1, D-dimer, and CRP are reported elsewhere [54].
This study sample includes individuals from five nested case–control studies on breast, colon, and lung cancer, lymphomas, and myocardial infarction [55, 56]. Participants were sampled from the 47,749 participants of the EPIC Italy cohort and included 354 incident breast cancer cases, 169 incident colon cancer cases, 192 incident lung cancer cases, 72 incident lymphoma cases, 295 incident myocardial infarction cases and their 1079 matched controls. Controls were individually matched on age (± 5 years), sex, the season of blood collection, centre, and length of follow-up. Since the disease diagnoses were made years after the blood draw, all the subjects were treated as healthy at recruitment. In the time-to-CVD event analyses, the follow-up time was (right) censored at the time of diagnosis for incident cancer cases. Overall, after DNA methylation data quality controls and sample filtering 1,803 EPIC Italian subjects were used in this analysis.
EXPOsOMICS CVD
Study participants pertain to the EPIC Italy cohort. 160 incident CVD cases and one-to-one matched controls (not overlapping with the EPIC Italy data set described hereafter) were extracted using the incident density sampling method [30]. After DNAm data quality control and sample filtering, 315 individuals were included in this study.
For the microarray (in EPIC Italy and EXPOsOMICS CVD), DNA samples were extracted from buffy coats using the QIAsymphony DNA Midi Kit (Qiagen, Crawley, UK). Bisulphite conversion of 500 ng of each sample was performed using the EZ-96 DNA Methylation-Gold™ Kit according to the manufacturer’s protocol (Zymo Research, Orange, CA). Then, bisulphite-converted DNA was used for hybridisation on the Infinium HumanMethylation450 BeadChip, following the Illumina Infinium HD Methylation protocol. Briefly, a whole-genome amplification step was followed by enzymatic end-point fragmentation and hybridisation to HumanMethylation 450 BeadChips at 48 °C for 17 h, followed by single nucleotide extension. The incorporated nucleotides were labelled with biotin (ddCTP and ddGTP) and 2,4-dinitrophenol (DNP) (ddATP and ddTTP). After the extension step and staining, the BeadChip was washed and scanned using the Illumina HiScan SQ scanner. The intensities of the images were extracted using the GenomeStudio (v.2011.1) Methylation module (1.9.0) software, which normalises within-sample data using different internal controls that are present on the HumanMethylation 450 BeadChip and internal background probes. The methylation score for each CpG was represented as a β value according to the fluorescent intensity ratio representing any value between 0 (unmethylated) and 1 (completely methylated).
The Irish longitudinal study on ageing (TILDA)
Is a large prospective cohort study examining the social, economic and health circumstances of 8,175 community-dwelling older adults aged 50 years and over resident in the Republic of Ireland. The sample was generated using a 3-stage selection process and the Irish Geodirectory as the sampling frame. The Irish Geodirectory is a comprehensive listing of all addresses in the Republic of Ireland, which is compiled by the national post service and ordnance survey Ireland. Subdivisions of district electoral divisions pre-stratified by socio-economic status, age, and geographical location, served as the primary sampling units. The second stage involved the selection of a random sample of 40 addresses from within each PSU resulting in an initial sample of 25,600 addresses. The third stage involved the recruitment of all members of the household aged 50 years and over. Consequently, the response rate was defined as the proportion of households including an eligible participant from whom an interview was successfully obtained. A response rate of 62% was achieved at the household level. There were three components to the survey. Respondents completed a computer-assisted personal interview and a separate self-completion paper and pencil module which collected information that was considered sensitive. All participants were invited to undergo an independent health assessment at one of two national centres using trained nursing staff. Blood samples were taken during the clinical assessment with the consent of participants. A more detailed exposition of study design, sample selection and protocol are available elsewhere [19]. The present study sample included 500 healthy individuals: 125 for each of the four socio-economic classes: stable professional, any downward mobility, any upward mobility, and stable unskilled. Buffy coat or peripheral blood mononuclear cells (PBMC) samples were available for all the individuals. Overall, after DNA methylation data quality controls and sample filtering, 490 subjects were analysed in this study.
For the microarray, DNA samples were extracted from buffy coats using the QIAGEN GENTRA AUTOPURE LS (Qiagen, Crawley, UK). Bisulphite conversion of 500 ng of each sample was performed using the EZ DNA Methylation-Lightning™ Kit according to the manufacturer’s protocol (Zymo Research, Orange, CA). Then, bisulphite-converted DNA was used for hybridisation on the Infinium HumanMethylation 850 k BeadChip, following the Illumina Infinium HD Methylation protocol. Briefly, a whole-genome amplification step was followed by enzymatic end-point fragmentation and hybridisation to HumanMethylation EPIC Chip at 48 °C for 17 h, followed by single nucleotide extension. The incorporated nucleotides were labelled with biotin (ddCTP and ddGTP) and 2,4-dinitrophenol (DNP) (ddATP and ddTTP). After the extension step and staining, the BeadChip was washed and scanned using the Illumina HiScan SQ scanner. The intensities of the images were extracted using the GenomeStudio (v.2011.1) Methylation module (1.9.0) software, which normalises within-sample data using different internal controls that are present on the HumanMethylation 850 k BeadChip and internal background probes. The methylation score for each CpG was represented as a β value according to the fluorescent intensity ratio representing any value between 0 (unmethylated) and 1 (completely methylated).
Understanding society
The study sample consisted of participants from the United Kingdom Household Panel Study (UKHLS), also known as Understanding Society [57], an ongoing longitudinal, nationally representative study of the UK, designed as a two-stage stratified random sample of the general population. While Understanding Society is a panel survey, the data used here consist of two pooled cross-sectional waves where a nurse collected blood samples from the respondents, among other physiological measures. The eligibility criteria for collecting blood samples were: (a) participation in the previous main interviews in England (had participated in all annual interviews between 1999 (BHPS wave 9) and 2011–2013 (Understanding Society wave 2 and 3); (b) age 16 and over; (c) living in England, Wales, or Scotland. From the potential pool of 6337 survey respondents, eligibility requirements for epigenetic analyses meant that the samples for DNA methylation measurement were restricted to participants of white ethnicity, resulting in 1175 subjects; more details can be found elsewhere [31]. Details about laboratory analyses for DNAm and how to access raw data can be found at the Understanding Society web site.
(https://www.understandingsociety.ac.uk/documentation/mainstage/dataset-documentation/variable/epigenetics).
For the GSE174818 (Covid-19 case–control) study, details of sample characteristics and laboratory methods for DNAm and biomarker analyses are described in the original publication [32].
The Health and Retirement Study (HRS) is a nationally representative longitudinal survey of more than 37,000 individuals U.S.A. [35]. The survey has been fielded every 2 years since 1992 and was established to provide a national resource for data on the changing health and economic circumstances associated with ageing at both individual and population levels. The cohort study is focussed on four broad topics: income and wealth; health, cognition, and use of healthcare services; work and retirement; and family connections. HRS data are also linked at the individual level to administrative records from Social Security and Medicare, Veteran’s Administration, the National Death Index, and employer-provided pension plan information. Since 2006, data collection has expanded to include genetic and epigenetic biomarkers. DNA methylation assays were done on a non-random sub-sample (N = 4018) of people who participated in the Health and Retirement 2016 Venous Blood Study. In this study, we used the sub-sample in which health assessment at the follow-up was available (N = 2146). The sample is 60% female and has a median age of 67 years and ranges in age from 50 to 100. It has racial diversity: non-Hispanic White and others (81.1%), non-Hispanic Black (10.0%), and Hispanic (8.9%). The sample is weighted to be representative of the U.S. population. DNA methylation data are based on assays using the Infinium Methylation EPIC BeadChip completed at the Advanced Research and Diagnostics Laboratory at the University of Minnesota. Samples were randomised across plates by key demographic variables (age, cohort, sex, education, race/ethnicity) with 39 pairs of blinded duplicates. Analysis of duplicate samples showed a correlation > 0.97 for all CpG sites. The minfi package in R software was used for data pre-processing, and quality control; 3.4% of the methylation probes (n = 29,431 out of 866,091) were removed from the final data set due to suboptimal performance (using a detection p value threshold of 0.01). Analysis for detection p value failed samples was done after removal of detection p value failed probes. Using a 5% cut-off we removed 58 samples. We also removed sex-mismatched samples and any controls (cell lines, blinded duplicates).
The Northern Ireland Cohort for the Longitudinal Study of Ageing (NICOLA) is a longitudinal cohort representative of the non-institutionalised population of Northern Ireland aged 50 years and older (N = 8504) [34]. The study, which was established in 2013, has three main components: a computer aided personal interview (CAPI), a self-completion questionnaire and health assessment. Dietary intake was also assessed by a food frequency questionnaire. The CAPI was extensive in scope and included assessment of demographic, social and health-related factors. Measures of cardiovascular, physical, cognitive, and visual function were determined, and a biobank of biological samples collected. DNA samples were extracted from buffy coats by Eurofins Scientific and normalised using PicoGreen quantitation. Bisulphite conversion of 500 ng of each sample was performed using the EX Zymo Methylation Kit (Zymo Research, Orange, CA) using the alternative overnight incubation conditions provided in the published protocol for use with the Illumina Infinium MethylationEPIC kit (Illumina USA). Then, bisulphite-converted DNA was used for hybridisation on the Infinium MethylationEPIC BeadChip array (Illumina, USA) following the manufacturer’s instructions, with arrays run on an Illumina HiScan. The intensities of the images were background adjusted and extracted as beta values using the GenomeStudio (v.2011.1) Methylation module (1.9.0) software.
DNA methylation data pre-processing and quality controls
For all the studies but HRS (pre-processing procedure described previously), raw DNAm data were pre-processed and normalised using in-house software written for the R statistical computing environment, including background and colour bias correction, quantile normalisation, and BMIQ procedure to remove type I/type II probes bias, as described elsewhere [58]. DNAm levels were expressed as the ratio of the intensities of methylated cytosines over the total intensities (β values). Samples were excluded if the bisulphite conversion control fluorescence intensity was less than 10,000 for both type I and type II probes. Methylation measures were set to missing if the detection P value was greater than 0.01. Additionally, the set of cross-reactive and/or polymorphic (with minor allele frequency greater than 0.01 in Europeans) CpGs (N = 39,238) described by Chen et al. [59] was excluded due to the low reliability of methylation measure.
The Fernández-Sanlés methylation risk score (MRS) was computed as a standardised weighted sum of 34 CpG sites, with weights defined by the estimates described by the authors in the Supplementary material of their original publication [1]. DNAmGrimAge and other epigenetic clocks were computed using Steve Horvath online DNAmAge calculator.
(https://horvath.genetics.ucla.edu/html/dnamage/).
Outcome definition
In EPIC Italy and EXPOsOMICS, incident CVD cases were identified from hospital discharge databases when the clinical record reported the International Classification of Diseases (ICD), Ninth Revision, Clinical Modification code 410, or ICD 410 plus the procedure codes for coronary revascularisation (e.g. percutaneous trans-luminal coronary angioplasty and coronary artery bypass surgery), including. Suspect CHD events were confirmed when myocardial infarction (MI), acute coronary syndrome, ischaemic cardiomyopathy, coronary or carotid revascularisation, and ischaemic or haemorrhagic stroke were reported in the records, supported by information on onset symptoms, levels of cardiac enzymes and troponins, and electrocardiographic data coded according to the Minnesota Code. Cases were cross-checked with mortality files to identify fatal and nonfatal cases (the latter defined as alive 28 days after diagnosis). Study participants with CHD at cohort entry were identified from the baseline questionnaire, from linkage with hospital discharge records, or by direct examination of clinical records, and were excluded from this study. In NICOLA and HRS, we defined CVD events accordingly with the definition used in the training set (EPIC Italy).
Statistical analyses
Development and validation of DNAm surrogates
We developed DNAm surrogates for BMI, systolic and diastolic blood pressure, and ten blood measured biomarkers. We used the EPIC Italy data set randomly split into training (N = 1352; 75% of the sample) and test set (N = 451; 25% of the sample). For each risk factor/biomarker, we created a DNAm surrogate through a three-step procedure:
-
(1)
We identified risk factors/biomarkers showing significant differences across EPIC Italy centres (Turin, Varese, Naples, Ragusa) via ANOVA analyses. We employed a linear model with a random intercept component, accounting for differences across centres for this subset of biomarkers, consisting of all but PAI-1, CRP, D-dimer, and triglycerides. We used a fixed-effect linear model for the other biomarkers.
-
(2)
Log-transformed risk factors/biomarkers were regressed on DNAm through a linear model adjusted for age, gender (fixed effect), and centre of recruitment (random effect, where necessary) to identify the top 1% ranked CpGs based on the P value.
-
(3)
DNAm surrogates of risk factors/biomarkers were constructed, regressing the response variables on the top 1% CpG sites, adjusting for sex and age. Finally, we applied L1 penalised estimation for enforcing sparsity in the regression coefficients employing the LASSO procedure [60] or the corresponding penalised mixed model [61] (for the biomarkers showing difference by centre) depending on the biomarker. For the latter method, ad hoc R routines were devised: the source code is freely available in the form of an R package at https://github.com/AndreaCappozzo/mixedelnet.
We validated the DNAm surrogates investigating their association (Pearson correlation coefficients) with the corresponding measured risk factor/biomarker in the EPIC Italy testing set (N = 451, 25% of the sample), and four additional independent studies: Understanding Society (N = 1174), TILDA (N = 490), EXPOsOMICS CVD (N = 315), and GSE174818 (N = 128). We used fixed-effect meta-analysis (inverse variance weights) to combine the results across the four validation data sets into a single estimate. As a result, we defined as ‘validated’ DNAm surrogates with significant associations (P < 0.05) in both EPIC Italy and the combined validation sets. As further validation, we investigated the correlation of our newly developed DNAm surrogates with those previously developed for BMI, HDL cholesterol [23], and PAI-1 [11].
Derivation of DNAmCVDscore
We developed a blood DNAm-based biomarker (that integrates several DNAm surrogates) for predicting the risk of future CVD events named DNAmCVDscore. We used a Cox regression model with elastic net regularisation to regress the time from recruitment to CVD event, and for selecting the most critical features from 60 (standardised: mean = 0, sd = 1) previously described blood DNAm surrogates.
The best λ parameter was derived from tenfold cross-validation to minimise the Harrell’s concordance C-index. The overall procedure includes 1,000 permutations using 80% of the whole EPIC Italy data set each time (n = 1443). The DNAm surrogates comprising the DNAmCVDscore were selected among those with nonzero coefficients in at least half of the permutations. Finally, DNAmCVDscore was computed as a linear combination of the selected DNAm surrogates where weights correspond to the average (nonzero) coefficient among the 1,000 permutations.
Validation of DNAmCVDscore and comparison with MRS, SCORE2 and DNAmGrimAge
We validated the DNAmCVDscore in three independent data sets: EXPOsOMICS CVD, NICOLA, and HRS. Since the EXPOsOMICS CVD set is designed as a case–control study nested in a cohort, we ran logistic regression analyses, and we evaluated the predictive performance of DNAmCVDscore through ROC curve analysis. Contrarily, in NICOLA and HRS (cohort studies) we run Cox proportional hazard regression models and we evaluated the prediction performance through the Harrell’s C-index.
We compared the performance of five models:
-
(1)
Based on DNAmCVDscore (adjusted for matching parameters in EXPOsOMICS CVD).
-
(2)
Based on MRS (adjusted for matching parameters in EXPOsOMICS CVD).
-
(3)
The SCORE2 prediction model based on chronological age, sex, diabetes, smoking, systolic blood pressure, total and HDL cholesterol, adjusting for matching parameters.
-
(4)
An enriched version of SCORE2, denoted with SCORE2 + DNAmCVDscore, in which DNAmCVDscore is included in the set of covariates.
-
(5)
Based on DNAmGrimAge (adjusted for matching parameters in EXPOsOMICS CVD).
In EXPOsOMICS CVD, to investigate the predictive performance of the five composite biomarkers at different time points, we computed the area under the ROC curve (AUC), sensitivity, and specificity as a function of the time from recruitment to diagnosis, right-censoring follow-up at constant intervals of one year from 18 to 2 years. Confidence intervals for AUC were computed according to De Long et al. [62].
DNAm surrogates and DNAmCVDscore versus COVID-19 case–control status and severity
As an additional sensitivity analysis, despite being out of the main scope of this work, we investigated the usefulness of using DNAm surrogate biomarkers in epidemiological studies on COVID-19 using the GSE174818 data set (101 patients with COVID-19 infection and 26 controls hospitalised with respiratory problems). Specifically, we investigated the association of BMI and blood measured CRP with COVID-19 case–control status and severity (using the GRAM score as a proxy), and we compared the results with those obtained using their DNAm surrogates (DNAmBMI and DNAmCRP). Finally, since CVDs and COVID-19 share several risk factors [50] we investigated the association of the DNAmCVDscore with COVID-19 case–control status and severity. We used logistic and linear regression models adjusted for age and gender to investigate the association with case–control status and GRAM score, respectively.