Methylome and transcriptome data in FHS
The dbGaP database authorized us to obtain FHS data (accession number: phs000007) and the Medical Ethics Committee of Zhongnan Hospital of Wuhan University approved all analyses described in this study (approval number: 2018017). The methylome and transcriptome data were collected from individuals of the Offspring cohort, who attended the eighth examination cycle of FHS. The CHD designation was evaluated and determined by a panel of three investigators from the Framingham Endpoint Review Committee. The basic criterion for CHD diagnosis was that at least one coronary artery had more than 50% stenosis. The CHD status of each individual was assessed in the eighth and ninth examination cycles from 2005 to 2008 and 2011 to 2014, respectively.
The DNA methylome data were obtained from PBLs of 2724 individuals and were detected by Illumina Infinium HumanMethylation450 BeadChip. The PBLs transcriptome data were available for 2441 individuals and were profiled by Affymetrix Human Exon 1.0 ST Array. There were a total of 2321 individuals provided both PBLs methylome and transcriptome data. In order to prevent the interference of age and gender between control group and CHD group, we adopted stratified sampling among the 2321 individuals to select age-gender-matched individuals. Through sampling, we selected 2117 age-gender-matched individuals (p = 0.0700 for age; p = 0.0552 for gender) to conduct subsequent quality control process.
Quality control of methylome and transcriptome data
In the quality control process for methylome data, probes with detection p > 0.01 calculated from ChAMP package (version 2.21.1) were considered as failed probes and were excluded. Non-CpG probes, probes located on sexual chromosomes, and probes with bead count < 3 in at least 5% of the samples were also removed. Then, 6 samples were recognized as outliers by multi-dimensional scaling (MDS) plot and these samples were eliminated. The methylome data of the rest samples were normalized through PBC method to adjust the methylated and unmethylated intensities. We explored for batch effect using MDS plots, and the batch effect was controlled by the ComBat function of the ChAMP package. The RefBase function of the ChAMP package was used to obtain methylation-based estimates of the blood cell-type counts. And the cell heterogeneities among samples were further adjusted by the RefBase function.
As for the analysis of transcriptome data, probes without relevant gene names were removed, and the maximum values were taken in the circumstance that plural probes corresponded to the same gene name. We removed 26 samples as they were identified as outliers by MDS plot. The batch effect in the rest samples was adjusted by regressing out the batch variable with the ComBat function of the SVA package (version 3.42.0). We utilized the digital sorting algorithm package (DSA, version 1.0) to estimate the blood cell-type counts and adjust the cell heterogeneities of the expression data [30, 31].
A total of 2085 samples passed the quality control process for methylation and expression data. These samples were randomly divided into training set and validation set. The training set contained 1563 individuals, of which 1364 individuals were controls and 199 individuals were CHD patients. The rest 522 individuals were classified into the validation set, of which 455 individuals were controls and 67 individuals were CHD patients. The clinical characteristics of the training set and the validation set were listed in Table 1. The subsequent bioinformatics analyses were based on the methylome and transcriptome data of the training set.
Differential analysis of methylome and transcriptome data
DMPs between CHD patients and controls in the training set were identified with threshold of adj. p < 0.05 and delta beta > 0.015. The adj.p was calculated by Benjamini and Hochberg (BH) method. Considering that the regions of 1500 and 200 bases upstream of the transcriptional start site (TSS1500; TSS200) are the paramount regulation regions [32,33,34], we took genes that had DMPs in the TSS1500 and TSS200 as DMGs. Further, the average beta values of DMPs in TSS1500 and TSS200 were regarded as the synthetic methylation statuses of the corresponding DMGs.
Empirical Bayes algorithm from Limma package (version 3.50.3) was utilized to recognize DEGs in the training set. Since using adj. p < 0.05 as threshold yielded very few DEGs, we chose p < 0.05 as threshold for DEGs screening.
Further, among the intersection of DMGs and DEGs, genes with significant correlation between methylation and expression status were considered as DNA methylation-regulated genes. Spearman correlation test was applied in the evaluation of the correlation between methylation and expression status, and the threshold was p < 0.05.
Enrichment analysis and WGCNA
GO and KEGG enrichment analyses were performed using the ClusterProfiler package (version 4.2.2). While, Reactome enrichment analysis was carried out by ReactomePA package (version 1.38.0). The filter criteria for GO, KEGG and Reactome terms was p < 0.05. GSEA was carried out by the GSEA software (version 4.2.1). We took c5.all.v7.2, c2.cp.kegg.v7.2 and c2.cp.reactome.v7.4 as the reference gene sets for GO, KEGG and Reactome in GSEA, respectively. Terms with normalized p < 0.05 were deemed to be statistically significant in GSEA.
Transcriptome data of the genes that in the top 25% of median absolute deviation (MAD) in the training set were used to conduct WGCNA by WGCNA package (version 1.71). During the sample clustering analysis, 43 outlying samples were eliminated when the clustering threshold was set to −2.5. The remaining 1520 samples were enrolled into the subsequent WGCNA procedure. With the soft threshold set to 5, the adjacency matrix was converted into TOM. Genes that were intrinsically homogeneous in expression patterns were divided into identical module. TOM-based connectivity network and PPI network were constructed in the module that was most intensely associated with CHD phenotype. The PPI network was established through Search Tool for the Retrieval of Interacting Genes (STRING) database with the interaction score set to 0.15. Core networks in the TOM network and PPI network were recognized based upon 12 algorithms provided by CytoHubba, a plug-in from Cytoscape (version 3.9.0).
Dimensionality reduction process
LASSO analysis has been widely applied in dimensionality reduction to select vital features for high-dimensional data [35]. The methylation and expression data of potential DNA methylation-regulated genes in the training set were imported into LASSO analysis as independent features by glmnet package (version 4.1.4). A gene was considered to be eligible only if both its methylation and expression features passed the screening of LASSO analysis.
Analogously, SelectFpr implements feature screening by performing false positive rate (FPR) estimation through scikit-learn (version 0.23.2). A gene was deemed to pass the SelectFpr test when its methylation and expression features were both with p < 0.05. The intersection genes between LASSO analysis and SelectFpr test were incorporated into subsequent machine learning modeling.
DCA, CICA and machine learning modeling
In order to preliminarily assess the clinical net benefits of the candidate DNA methylation-regulated genes in CHD prediction, we carried out DCA based on the methylation and expression data of the genes in the training set [36]. According to the epidemiological statistics from 2006 to 2010 in USA, the prevalence of CHD was set to 0.07 in the DCA [37]. CICA was then performed on the basis of DCA to evaluate the practical value of the DNA methylation-regulated genes in CHD prediction. The DCA and CICA were both conducted by rmda package (version 1.6).
Machine learning modeling with appropriate algorithms can lead to observably potentiated predictive efficiency of biomarkers. In this study, LightGBM, XGBoost and Random Forest were utilized to construct machine learning models based on the hub DNA methylation-regulated genes, which were identified by the dimensionality reduction analyses. Each algorithm established 3 models using methylation data, expression data, and the combination of methylation and expression data, respectively.
The performance of models was assessed by a few parameters as following. The discernibility was evaluated by total accuracy (ACC), the classification accuracy was represented using Confusion matrix and Kappa value, while risk distinction and sorting ability was reflected by Kolmogorov–Smirnov (KS) and lift chart. Comprehensive performance of the models was estimated through receiver operating characteristic curve (ROC), area under ROC (AUC), precision-recall curve (PRC) and average precision score (AP). Finally, F1 score was considered as the paramount parameter that indicated the equilibrium of the models. The weight of features in the models was imputed by 2 methodologies, one was based on Gini importance, the other was in the light of SHapley Additive exPlanations (SHAP) values. The machine learning modeling was conducted on scikit-learn (version 0.23.2).
The optimal methylation model and expression model generated from FHS data were further applied to two additional datasets from Gene Expression Omnibus (GEO) database. GSE107143 contained peripheral blood methylome data from eight atherosclerosis patients and eight controls. GSE42148 provided peripheral blood transcriptome data from 13 CHD patients and 11 controls. The detection platforms of GSE107143 and GSE42148 were Illumina Infinium HumanMethylation450 BeadChip and Agilent SurePrint G3 Microarray, respectively. The data of these two datasets were processed and normalized with reference to the data processing procedures for FHS data.
Wet experiment validation using transcriptome sequencing and methylation microarray
To further verify the identified DNA methylation-regulated genes, we performed experimental validation in Chinese populations. A total of 12 male CHD patients were recruited in Zhongnan Hospital of Wuhan University (Wuhan, China) between December 2019 and July 2021. The patients were diagnosed with CHD as coronary angiography confirmed that there were more that 50% stenotic lesions in at least one coronary artery. Another 12 males without history of cardiovascular events were recruited as controls. The clinical baseline information of the participants was listed in Additional file 1: Table S2.
We obtained 8 mL peripheral blood from each participant to isolate monocytes using Ficoll-Hypaque solution (Sigma-Aldrich, Germany) and CD14 microbeads (Miltenyi, Germany) according to the manufacturer instructions. Half of the monocytes were used for DNA extraction and the other half were used for RNA extraction (Omega, USA). An amount of 1 μg DNA was treated by bisulfate (Zymo Research, USA), followed by whole genome amplification, enzymatic end-point fragmentation, precipitation, and resuspension according to the Illumina Infinium HD Methylation Protocol. Then, the resuspended samples were hybridized on Illumina Infinium HumanMethylation850 BeadChip. Subsequent methylation analyses were conducted on ChAMP package (version 2.21.1). At the same time, over 1 μg total RNA was used to obtain mRNA by poly-T oligo-attached magnetic beads. The mRNA-seq libraries were established by the NEBNext Multiplex mRNA Library Prep Set for Illumina kit (New England Biolabs, USA) and finally sequenced on the Illumina NovaSeq 6000 platform. DESeq2 package (version 1.34.0) was applied to the sequencing data analyses.
Statistical analysis
The Framingham 10-year risk scale is one of the most extensively used rating scales to evaluate the risk of developing cardiovascular diseases within ten years [38]. This scale assigns scores for a variety of traditional risk factors, including gender, age, TC, smoking, HDL-C and SBP [39]. The accumulated score of the 6 factors is referred to as FRS, which corresponds to the probability of suffering cardiovascular diseases in ten years. Individuals with estimated probabilities ≥ 20% are considered as high-risk populations [40]. Among the 1563 individuals in the training set, parameters required for FRS calculation were applicable in 1479 individuals. Besides, 487 of the 522 individuals in the validation set provided parameters required for FRS calculation. We calculated the FRS using the Framingham 10-year risk scale and compared the performance between the FRS model and the DNA methylation-regulated genes models.
In the present study, normally distributed data were shown as mean ± standard deviation (SD), while data with non-normal distribution were described as median and inter-quartile range. The differences between continuous data were analyzed by Student’s t test or Mann–Whitney U test. Analyses involving binary data were processed by Chi-square test or Fisher’s exact test. Pearson and Spearman correlation tests were applied to assess the correlations between continuous data. All statistical analyses were performed on GraphPad Prism (version 9.0), R (version 4.1.3) and Python (version 2.7.11). The threshold for the statistical tests was two-tail p < 0.05.