Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality
Clinical Epigenetics volume 10, Article number: 155 (2018)
The effects of tobacco smoking on epigenome-wide methylation signatures in white blood cells (WBCs) collected from persons living with HIV may have important implications for their immune-related outcomes, including frailty and mortality. The application of a machine learning approach to the analysis of CpG methylation in the epigenome enables the selection of phenotypically relevant features from high-dimensional data. Using this approach, we now report that a set of smoking-associated DNA-methylated CpGs predicts HIV prognosis and mortality in an HIV-positive veteran population.
We first identified 137 epigenome-wide significant CpGs for smoking in WBCs from 1137 HIV-positive individuals (p < 1.70E−07). To examine whether smoking-associated CpGs were predictive of HIV frailty and mortality, we applied ensemble-based machine learning to build a model in a training sample employing 408,583 CpGs. A set of 698 CpGs was selected and predictive of high HIV frailty in a testing sample [(area under curve (AUC) = 0.73, 95%CI 0.63~0.83)] and was replicated in an independent sample [(AUC = 0.78, 95%CI 0.73~0.83)]. We further found an association of a DNA methylation index constructed from the 698 CpGs that were associated with a 5-year survival rate [HR = 1.46; 95%CI 1.06~2.02, p = 0.02]. Interestingly, the 698 CpGs located on 445 genes were enriched on the integrin signaling pathway (p = 9.55E−05, false discovery rate = 0.036), which is responsible for the regulation of the cell cycle, differentiation, and adhesion.
We demonstrated that smoking-associated DNA methylation features in white blood cells predict HIV infection-related clinical outcomes in a population living with HIV.
Smoking is a common and underappreciated contributor to poor outcomes in HIV-infected individuals. The prevalence of smoking among HIV-infected people exceeds 60% , and it is an independent risk factor for mortality in treated HIV-infected individuals . Smoking increases the mortality risk among HIV-infected individuals with an odds ratio between 2 and 3 [2, 3]. However, we have little insight into the mechanisms through which smoking contributes to poorer HIV outcomes.
Smoking-associated effects on DNA methylation in white blood cells (WBCs) have been demonstrated through epigenome-wide association studies (EWAS). DNA methylation is an epigenetic mechanism regulating gene expression independent of variation in the DNA sequence. To date, hundreds of CpG sites (i.e., cytosine-guanine dinucleotides), where cytosines can be methylated to form 5-methylcytosine, in WBCs have been associated with smoking status , quantity , smoking cessation , and smoking-related traits or diseases (e.g., oxidative stress level , lung cancer , chronic inflammatory disease ) in the HIV-uninfected population. Indices of DNA methylation constructed from smoking-associated CpG sites have predicted smoking-related lung cancer incidence  and oral cancer incidence . A recent study using a smoking DNA methylation index derived from six CpG sites was associated with frailty in aging populations . And finally, smoking-associated CpGs in the blood were reported to predict all-cause mortality [13, 14] and cardiovascular-related mortality . However, smoking-related DNA methylation associations have not been described in HIV-infected populations to date.
The host epigenome is also impacted by HIV infection. We and others recently showed that DNA methylation is associated with HIV infection and HIV-related aging [16,17,18,19]. We reported that CpG sites in the promoter of NLRC5, a transcriptional activator of major histocompatibility complex class I, were less methylated in samples from HIV-infected persons as compared to samples from HIV-uninfected persons . Epigenetic marks were also associated with cognitive impairment in the HIV-infected population, and the epigenetic clock relates to biological aging in HIV-infected individuals . Taken together, it is reasonable to hypothesize that both smoking and HIV infection have effects on the epigenome that contribute to poor HIV outcomes and an increased risk of mortality.
To select high-dimensional epigenetic data for predicting clinical outcomes is challenging. For this purpose, machine learning has emerged as a powerful tool that enables the discovery of unknown features in the epigenome to predict phenotypes of interest . Machine learning has been successfully applied to select DNA methylation features to identify biomarkers for complex diseases and to predict treatment outcomes [16, 21, 22]. Recently, a kernel machine learning method improved the prediction of cancer prognosis by integrating molecular profiles and clinical predictors . A panel of DNA methylation markers was able to diagnose common cancers with 95% accuracy and identified 29 out of 30 colorectal cancer metastases . In another study, DNA methylation-based learning selected immune response features improved the prediction of better treatment outcomes of chemotherapy and survival for breast cancer patients . Such an approach can be useful to identify biological signatures of HIV-related outcomes influenced by smoking.
In this study, using an ensemble-based machine learning approach, our goal was to select smoking-associated DNA methylation CpGs in the HIV-infected host epigenome and link the selected CpGs to the HIV disease outcomes. The motivation to use ensemble-based learning is that an ensemble approach has advantages to reduce the bias from individual machine learning methods and to improve the stability of prediction performance in an imbalanced sample [26, 27]. We were also interested in understanding the biological significance of the selected features. This study demonstrates that the application of advanced machine learning on methylation features provides evidence of a link between the mechanisms of smoking and smoking-associated adverse HIV outcomes.
The study design and the framework are presented in Fig. 1. Briefly, all DNA samples were extracted from WBCs collected from people who live with HIV from the Veteran Aging Cohort Study (VACS) (N = 1137). All samples were randomly divided into a discovery (cohort 1) sample and a replication (cohort 2) sample. Demographic and clinical variables are presented in Table 1. We first conducted a meta-analysis of the EWAS for smoking in two separate HIV-infected samples. We then selected smoking-associated CpGs that predicted HIV outcomes by using an ensemble-based learning approach.
DNA methylation in WBCs associated with tobacco smoking
We profiled CpGs using the Illumina Infinium HumanMethylation 450 Beadchip (450K) (San Diego, CA, USA) in HIV-infected samples (cohort 1, N = 608; current smokers = 361; non-smokers = 247) from the VACS. After adjustment for potential confounders (i.e., age, immune cell types, adherence of antiretroviral therapy, the top principal components to limit global confounding effects), we identified 41 CpGs differentially methylated (i.e., 33 hypomethylated CpGs, 8 hypermethylated CpGs) between smokers and non-smokers (Fig. 2a, pnominal < 1.0E−7) (Additional file 1: Table S1). Of note, 40 out of 41 CpG sites were previously reported to be associated with smoking [4, 9, 10, 28,29,30,31,32,33,34,35]. The most significant sites included the established smoking biomarkers on AHRR (cg05575921, cg23576855, cg26703534, cg21161138) and on F2RL3 (cg03636183). One CpG site, cg15212292 located in the body of PRKCA, was previously reported significant association for smoking in a large meta-analysis from combined European-American (EA) and African-American (AA) populations but showed no association with smoking in AA . We found this CpG site highly significant in our sample of predominantly AA (t = − 8.911; p = 5.074E−19). Consistent with previous reports, the majority of smoking-associated CpGs were hypomethylated in smokers as compared to non-smokers.
We conducted a second EWAS for smoking in a sample that was independent of the discovery sample (VACS cohort 2, N = 529; current smokers = 309; non-smokers = 220). DNA methylation in the replication sample was profiled using the Illumina Methylation EPIC platform (San Diego, CA, USA) that included 870 K CpGs, with 408,583 CpGs shared between the Illumina 450K and EPIC arrays. To ensure consistency in comparisons across the samples, only CpGs shared across both arrays were assessed. The methylation state probes common to both platforms were highly correlated (r ~ 0.91 to 0.99).
Applying the same analytical protocol, we adjusted for the same confounders in the discovery and replication samples. A total of 49 CpG sites reached epigenome-wide significance in the replication sample including the 41 CpGs identified in the discovery EWAS and 8 significant CpGs that were only seen in the replication sample (Fig. 2b) (Additional file 1: Table S2). The 8 additional CpGs were all hypomethylated in smokers compared to non-smokers. The high concordance in findings between the two samples suggests that smoking-associated CpG sites are highly reproducible.
Combining the discovery and replication samples, a meta-EWAS revealed a total of 137 CpGs that were significantly associated with smoking (p < 1.0E−7) (Table 2, Additional file 2: Figure S1). A test for heterogeneity across the two samples for these 137 CpG sites was not significant after Bonferroni correction(padjusted > 0.05) for any of the sites, suggesting that their association with smoking is not due to the confound of sample heterogeneity. Of the 137 CpG sites, 122 sites were hypomethylated, and only 15 CpG sites were hypermethylated in smokers compared to non-smokers. As expected, the most significant CpG site was cg05575921 at AHRR. An additional 15 CpG sites on AHRR were also significantly associated with smoking status. Consistent with the findings from more than 30 previous studies in HIV-uninfected samples, these results demonstrate that alteration of DNA methylation is associated with smoking exposure regardless of HIV status.
Ensemble-based feature selection of DNA methylation for HIV frailty
The VACS index was used as an indicator of HIV outcome . High HIV frailty and poor prognosis was defined as a VACS index of greater than 50. Ensemble learning was applied to classify the samples with a VACS index score of greater than 50 as having a poor prognosis, and samples with a VACS index of less than 50 as having a good prognosis. All samples were divided into a training set (80% of the samples in cohort 1), a testing set (20% of the samples in cohort 1), and a validating set (cohort 2).
We first filtered CpGs based on p values (false discovery rate, FDR < 0.5) from the EWAS analysis. A total of 997 candidate CpGs from the discovery EWAS were used for feature selection. The goal of the feature selection was to eliminate redundant and irrelevant CpGs without losing informative loci that were associated with high frailty and poor prognosis. In our sample, the numbers of high and low VACS index samples were unequal (high VACS index = 237, low VACS index = 900). Individual machine learning approaches favor the classification of samples into the larger class (e.g., low VACS index samples). To reduce this potential bias without decreasing the sample employed in the training set, we applied a greedy ensemble-based feature selection to build a classifier less likely to be biased towards the larger class from the four machine learning methods(i.e., lasso and elastic-net regularized generalized linear model (GLMNET), support vector method (SVM), random forest (RF), and XGBoot).
In the training sample from cohort 1, we applied a bootstrap aggregating (Bagging) approach, in which GLMNET was used with 100 bootstraps using 70% of the training sample, to weigh the importance of each CpG (Fig. 3a). The CpGs were subsequently clustered into 21 CpG groups from 2 to 997 CpGs based on the importance rank with an incrementation of 50 CpGs. Four machine learning methods, GLMNET, SVM, RF, and XGBoost, were applied to build prediction models using each CpG group separately. Then, a set of classifiers was determined and used to classify new data points by taking a weighted average of the prediction from each of the four machine learning methods. The performance of tenfold cross-validation for each CpG group showed high sensitivity (> 0.9) but relatively low specificity (< 0.5) for each of the 4 machine learning methods. The models from ensemble learning and 4 individual machine learnings were evaluated in the test sample separately.
In the testing set, the ensemble method selected a set of 689 CpGs that discriminated poor and good prognosis with the best performance (Fig. 3b). The prediction efficiency was estimated using receiver operator characteristic curves; the 698 CpG set displayed an area under curve (AUC) of 0.73 (95%CI 0.63~0.83) for high HIV frailty. The AUCs from RF and XGBoost at the 698 CpGs were also high (0.76). Although RF and XGBoost had high AUCs across all CpG sets, their balanced accuracy was not as good as ensemble method (Fig. 3c). Therefore, the set of 698 CpGs was selected to test the prediction efficiency. Importantly, the majority of EWAS-significant CpGs (121 out of 137 EWAS-significant CpGs) were included in the 698 CpGs (Fig. 3d), suggesting that ensemble learning enables the selection of biologically informative CpGs to predict HIV frailty.
Validation of prediction for HIV frailty using the selected 698 CpGs
To further validate the prediction results of the 698 CpGs from the discovery sample, we tested the prediction efficiency in the replication sample (cohort 2). Using the same VACS index score cut point, we found that the AUC was 0.78 (95%CI 0.73~0.83) (Fig. 4). The balanced accuracy of prediction was improved to 0.76. The results suggest that the model built in the training set had minimal overfitting features and can be applied to differentiate good and poor HIV prognosis in independent samples.
Of note, to test whether an individual machine learning method alone, a penalized regression model, can select a smaller number of CpG sites than ensemble learning from genome-wide CpG sites to predict HIV frailty, we conducted a feature selection from 408,583 CpG sites using GLMNET to predict the same high and low VACS index. We found that GLMNET selected 1852 CpG sites that predicted the VACS index with AUC of 0.76 (Additional file 3: Figure S2). Although the performance of GLMNET was comparable to the ensemble-based approach, the latter was able to select a smaller number of features and linked smoking-DNA methylation to HIV outcomes.
We tested whether ensemble learning can predict resilient persons that are HIV-positive. Using cutoff of the VACS index < 16 as an excellent prognosis, we found that ensemble learning showed poor performance prediction (AUC < 0.7 and balanced accuracy < 0.5). The poor prediction is likely due to an insufficient number of samples with excellent prognosis (i.e., the sample was underpowered).
We were also interested in understanding whether the prediction of the high and low VACS index using the 698 CpG sites performed better than smoking status alone. We found the AUC of smoking status predicting VACS index was 0.55 (Additional file 4: Figure S3), suggesting that smoking-associated DNA methylation is a better predictor for HIV frailty compared to smoking status alone.
Prediction of the selected 698 CpGs for all-cause mortality in HIV infection
To support the value of the 698-CpG set in predicting HIV outcomes, the ability of the set to predict mortality in HIV-infected individuals was evaluated. Using the same ensemble model, we first tested the prediction performance of the 698 CpGs with mortality in cohort 2, in which 84 subjects died within 5 years after the blood draw used to profile the DNA methylome. The AUC was 0.66 (95%CI 0.60~0.73) (Additional file 5: Figure S4), which was not as good as the prediction of HIV frailty.
We then constructed a DNA methylation index score based on the coefficient of each CpG site from the 698 CpGs in cohort 1. After adjusting for confounding factors such as age, CD4 count, viral load, and antiretroviral therapy, we found a significant association between the methylation index and the 5-year survival rate in cohort 2 (HR = 1.46; 95%CI 1.06~2.02, p = 0.02) (Fig. 5). As expected, the significant association was driven by hypomethylated CpG sites for smoking (HR = 1.39, p = 0.02) but not by hypermethylated CpGs for smoking (HR = 1.21, p = 0.21). The results provide further evidence that DNA methylation-based prediction of mortality can be applied in the HIV-infected population.
Biological significance of the selected 698 CpG sites
The selected 698 CpGs were located among 445 genes (Additional file 1: Table S3). Pathway analysis showed a significant enrichment on the canonical integrin signaling pathway (p = 9.55E−05, FDR = 0.036). Fourteen out of 445 genes were in this pathway: MAP2K4, ITGA2B, ARHGAP26, PIK3R5, ITGAL, PTK2, NCK2, CAPN8, RHOG, GAB1, LIMS1, ITGA11, CTTN, and ACTN1. Integrin signaling determines cellular responses such as migration, survival, differentiation, and motility and provides a context for responding to other inputs. The function of integrin signaling is critical for cell adhesion, tissue maintenance and repair, host defense, and hemostasis. Among non-canonical pathways, cancer, organismal injury, and abnormalities were the most significant (FDR = 1.87E−17). Other top disease-related pathways were in the categories of gastrointestinal disease, liver hyperproliferation, and dermatological diseases. These results suggest that ensemble learning selected biologically relevant features underlying pathological changes in smoking-related diseases.
Applying a DNA methylation-based machine learning approach, we report a set of smoking-associated DNA methylation sites predicting HIV prognosis and mortality in people living with HIV. The prediction of HIV frailty by the selected features showed an ability to accurately differentiate good and poor HIV-related clinical outcomes in an independent sample. The DNA methylation index constructed from the selected CpGs was also associated with mortality in the HIV-infected population. Interestingly, the selected smoking-associated methylation features were enriched in the integrin signaling pathway and related to multiple cancers and organismal injuries, which supports the hypothesis that the contributions of smoking to poor disease outcome are in part due to the changes in DNA methylation in the HIV-infected host epigenome. The study has demonstrated that the application of methylation-based machine learning can be useful for linking molecular information to clinical outcomes.
One of the major challenges to building a successful model using high dimensional data to predict disease outcomes is how to select informative features among redundant or irrelevant data, background noise, and biased features . We applied several approaches to guide the machine learning process. First, epigenome CpGs were filtered based on association analysis of DNA methylation sites with smoking, which considerably reduced the number of features for model building. We rationalized using smoking-associated features because smoking alters DNA methylation, and smokers have higher mortality rates in the population when living with HIV. Second, we applied ensemble learning based on the results of multiple machine learning methods to optimize the selected features and to limit the bias of each method. This data processing method typically improves the accuracy of the model when employing an unbalanced sample. Our results showed that the performance of the ensemble-based model is highly reproducible and better than individual machine learning method such as GLMNET. The advantage of the greedy ensemble machine learning approach can also reduce overfitting and improve model stability . Overfitting is another major challenge in building a predictive model. To address this concern, we split the sample into two cohorts: cohort 1, which was sub-divided into a training set and a testing set, and cohort 2, which was used to replicate the predictive model performance. Thus, the features selected from cohort 1 could be independently tested in cohort 2. Therefore, the steps we took ensured we selected features with high accuracy to predict HIV outcomes.
Our results showed that the selected features were predictive for HIV frailty with moderate to high sensitivity and specificity. Methylation marks for smoking were previously applied to predict frailty in an elderly population. Gao et al. reported that 9 smoking-associated CpG sites were significantly associated with higher frailty. We found that our selected 698 features showed better performance (AUC 0.78 versus AUC 0.55), which may be due to the inclusion of significantly more CpG sites and different populations in our sample compared to the Gao et al. study. The prediction of HIV frailty using the selected 698 features also outperformed the use of tobacco smoking alone.
We found that the prediction of 698 sites for mortality was not as good as the prediction for the VACS index. This result is not unexpected as the model was built for the VACS index, not for mortality. Second, the number of deaths by cohort was unbalanced. In cohort 2, only 87 individuals had died at the time of this analysis, which may have reduced the power for accurate prediction. However, the methylation index with 698 CpGs was significantly predictive for 5-year survival rate. Individuals with a greater methylation index were more likely to have shorter life expectancy than individuals with lower methylation index.
Importantly, the selected DNA methylation features were not only computationally effective for classifying good and poor outcomes and for predicting mortality but were also biologically relevant to HIV frailty and mortality. The selected 698 CpGs included loci in the genes involving immune activation and inflammatory processes, which is highly associated with HIV frailty and mortality. For example, the most significant smoking-associated gene, AHRR, not only involves the metabolism of endogenous toxins from smoking that result in pathological processes but also represses other signaling pathways, including NF-κappaB, and is capable of regulating inflammatory responses . TNFRSF4 has been shown to activate NF-kappa B and plays a role in apoptosis. In addition, a number of CpGs in the 698 CpG sites were previously reported to involve acceleration of aging, frailty, cancer pathogenesis, and all-cause mortality. Although the majority of DNA methylation differences at a single CpG site between smokers and non-smokers are modest, the 698 features were enriched in pathways highly relevant to disease prognosis, frailty, and mortality.
While a model of 698 CpG sites may seem to be a large number of features for the prediction of frailty, emerging evidence has demonstrated that DNA methylation at individual CpG sites on a complex trait is small (less than 10%) . In our EWAS analysis, the effect size of single CpG sites on smoking was in a range of 1 to 13%. To predict a complex outcome such as frailty with a small number of CpG sites is highly unlikely. A recently published paper showed a panel of 200 to 1100 CpG sites predicting multiple complex traits including alcohol, smoking, HDL cholesterol, education, and death . Thus, a panel of hundreds of CpG sites predicting complex traits is expected. However, methods to select more informative features and to potentially reduce the number of features in future studies are warranted.
We acknowledge several limitations of this study. A recent study suggests that mRNA and miRNA profiles showed the best prediction for cancer prognosis . Integrating DNA methylation with other omic and clinical data may improve the predictive value and clinical utility of the predicting model. Due to methodological limitations, we were unable to build a model to predict the VACS index as a continuous variable, which may have better clinical utility. The study was conducted in a retrospective cohort and smoking was defined from self-report, which may introduce bias. Applying our predictive model using the 698 selected features in a prospective cohort is warranted to confirm the results. The mechanisms that underlie the selected CpG features on HIV progression remain to be defined. Future studies of smoking’s effects on DNA methylation in HIV-infected specific cell types are warranted to better understand how the selected features involve smoking-related HIV prognosis.
Our results demonstrate a machine learning approach to establish methylation signatures for disease outcomes. The identified methylation sites may be a biological surrogate for the VACS index to measure clinical outcomes and to predict mortality. This first-ever methylation-based machine learning-based study sheds light on the impact of smoking on risk for complicated clinical outcomes, estimated using a molecular profile, in the setting of HIV infection.
Applying DNA methylation-based ensemble learning, we identified a set of 698 smoking-associated DNA methylation CpG sites that predict HIV frailty and mortality. Building on more than 30 previous studies in HIV-uninfected persons, our findings suggest that smoking exposure changes DNA methylation in the HIV-infected host genome that is linked to HIV disease prognosis. Our results demonstrate that DNA methylation-based machine learning is a robust approach for the prediction of HIV prognosis.
Study population and phenotype assessment
The VACS, a nation-wide multicenter collaborative project designed to understand the role of co-morbid medical and psychiatric diseases in determining clinical outcomes in HIV infection, was the source of specimen and data (https://medicine.yale.edu/intmed/vacs/). The VACS biobank cohort is comprised of 2470 participants who were recruited for genetic studies from 2006 to 2007. Participants of the VACS biobank cohort provided written informed consent for the genetic study and provided blood samples. Clinical and demographic data were collected within 90 days of the blood sample collection. A total of 1137 samples were selected and randomly divided into two subsets (labeled cohort 1 and cohort 2), and DNA methylation was processed separately using different methylation arrays.
Self-report was used to collect information on smoking status. Current smokers were defined as smoking cigarettes daily during the past week; non-smokers reported never smoking cigarettes. The VACS created an index score to estimate overall frailty of HIV-infected individuals by summing pre-assigned points for age, routinely monitored indicators of HIV disease (CD4 count and HIV-1 RNA), and general indicators of organ system injury including hemoglobin, platelets, aspartate and alanine transaminase (AST and ALT), creatinine, and viral hepatitis C infection (HCV) (https://medicine.yale.edu/intmed/vacs/welcome/vacsindexinfo.aspx). The VACS index has been associated with important changes in health condition and behavior . The VACS index has been shown to predict all-cause mortality among those undergoing treatment for HIV infection . A higher VACS index score indicated greater frailty. Mortality rate 5 years after blood draw was 16%.
Profiling DNA methylation using Illumina DNA methylation Beadchips
Genomic DNA was extracted from whole blood samples. DNA methylation profiling was conducted at the Yale Center for Genomic Analysis using the Illumina (San Diego, CA, USA) Infinium HumanMethylation450 BeadChip (HM450K) for cohort 1 and Illumina Infinium MethylationEPIC (EPIC) for cohort 2. Two sample sets were processed at different times but were processed by the same scientist at the Yale Center for Genomic Analysis who was blinded to the phenotypic information collected. All samples were randomly placed on each array and batch-corrected using the removeBatchEffect function in limma. Probe normalization and batch correction were performed as previously described by Lehne et al. .
Data quality control and normalization
In cohort 1, we removed 11,648 probes on sex chromosomes and 36,142 probes within 10 base pairs of single nucleotide polymorphisms. A total of 437,722 probes remained for analysis. As described by Lehne et al. , 24,416 probes on Y chromosomes were applied to evaluate the detection p value. A p < 1E−12 was set as a detection p value threshold to improve the quantification of methylation intensities. The intensity values with detection p > 1E−12 were labeled missing, and samples with a call rate < 98% were excluded. We also compared the predicted sex with self-reported sex. All samples matched as male. In cohort 2, we applied the same criteria for quality control. We removed 11 samples due to mismatched sex or low call rate. Only the 408,583 probes that were identical with HM450 array were extracted for replication analysis. Quantile normalization of intensity values was performed following the recommendations of Lehne et al. Six cell types (CD4+ T cells, CD8+ T cells, NK T cells, B cells, monocytes, and granulocytes) in the blood were estimated in each sample using the method described by Houseman et al. .
The study design and analytical approaches are summarized in Fig. 1.
Epigenome-wide association analysis
Analyses of discovery and replication stages were performed using the same pipeline . To adjust for significant global confounding factors, we conducted two serial regression analyses to determine the associations between methylome-wide CpGs and smoking. The following steps were performed to correct for global co-variations that may confound specific DNA methylation in smoking.
The first principal component analysis (PCA) was performed to evaluate the intensity values of positive control probes designed in HM450. Then, the first GLM was performed as follows:
The residuals for each probe and the top 30 PCs of the first PCA were used to adjust for technical biases, particularly batch effects.
The second PCA was performed on the resulting regression residuals from the first model. The top 5 PCs of the second PCA were used to control for global biological confounders that cannot be directly captured in the model.
Final GLM model
The significance threshold was set at p < 1.0E−07, which is equivalent to the Bonferroni correction.
We conducted an EWAS meta-analysis by combining the data from the discovery (cohort 1) and replication (cohort 2) samples. Effect size and p values for each probe were obtained from analyses in cohort 1 and cohort 2 samples, respectively. We performed fixed-effects, inverse-variance meta-analysis, with scheme parameters of sample size and standard error by implementing the METAL (ver: 2010-02-08) program, combining summary statistics in two sample sets. We investigated heterogeneity in two sample sets using the I2 statistic.
Machine learning prediction HIV prognosis
Considering the samples were processed at different times and platforms, batch effects were removed using the removeBatchEffect function in limma using R (ver. 3.32.10) before performing the machine learning prediction. To reduce redundant DNA methylation signals and noise for improving the prediction accuracy of HIV frailty, CpG sites with FDR < 0.5 from EWAS in cohort 1 were selected for machine learning. The samples in cohort 1 were randomly divided into a training set and a test set with a ratio of 8:2. We first built a model using the training set, in which each sample was labeled poor (VACS index > 50) or good prognosis (VACS index ≤ 50). We then tested the model by performing 10-fold cross-validation in the testing set, and the best-performed model was tested in an independent replication set.
Prediction model and validation
Machine learning GLMNET was used to build a prediction model. A total of 997 CpGs from EWAS (FDR < 0.1) were ranked based on an importance value for each CpG from GLMNET. The CpG sites were clustered as 21 groups from 2 to 997 sites using 50 CpG increments.
Tenfold cross-validation was performed in the training set to identify the best performing model. Additional machine learning methods were used to predict the best outcomes. GLMENT, SVM, RF, and XGBoost were performed separately. The parameters were fine-tuned by using R package caret (ver: 6.0-78) (https://libraries.io/cran/caret/6.0-78) for each algorithm. To avoid bias of each method, we used the ensemble method with R package caretEnsemble (ver: 2.0.0) (https://cran.r-project.org/web/packages/caretEnsemble/index.html) that constructed a new model by weighing the vote of each CpG from four machine learning methods.
The testing set was employed to evaluate the model by ROC analysis. The best pre-formed features were used to further validate the model in the independent testing set (cohort 2) using an ensemble-based method. Sensitivity, specificity, and AUC were used to assess model performance.
Association of DNA methylation index with mortality
To examine whether the selected CpG site methylation was associated with mortality, we constructed a methylation index from the 698 CpG sites following the previous formula . A separate index was constructed for hypomethylated and hypermethylated CpG sites, respectively.
The association of the DNA methylation risk index with all-cause mortality was examined by Kaplan-Meier plots and log-rank tests in all samples. Cox regression model was then used to adjust for age, antiretroviral therapy adherence, HIV-1 load, and CD4 count. In the Cox regression model, the DNA methylation index score was a categorical variable (using the highest quartiles as the reference category) or a continuous variable (calculating HR for a decrease in DNA methylation by one standard deviation). Indexhypo and indexhyper were evaluated for the prediction of mortality separately.
Gene enrichment analysis
Pathway and network analysis was conducted for the selected 698 CpG sites on 455 genes by employing the Ingenuity Pathway Analysis (IPA). For genes with multiple CpG sites, the lowest p value at the CpG site within a gene was used to represent the gene level significance. Significant pathways were defined at a FDR < 0.05.
Area under curve
Alcohol Use Disorder Identification Test-C
Epigenome-wide association study
Lasso and elastic-net regularized generalized linear models
Support vector method
Veteran Aging Cohort Study
White blood cells
Ruggles KV, Fang Y, Tate J, Mentor SM, Bryant KJ, Fiellin DA, Justice AC, Braithwaite RS. What are the patterns between depression, smoking, unhealthy alcohol use, and other substance use among individuals receiving medical care? A longitudinal study of 5479 participants. AIDS Behav. 2017;21:2014–22.
Helleberg M, May MT, Ingle SM, Dabis F, Reiss P, Fatkenheuer G, Costagliola D, d’Arminio A, Cavassini M, Smith C, et al. Smoking and life expectancy among HIV-infected individuals on antiretroviral therapy in Europe and North America. AIDS. 2015;29:221–9.
Reddy KP, Parker RA, Losina E, Baggett TP, Paltiel AD, Rigotti NA, Weinstein MC, Freedberg KA, Walensky RP. Impact of cigarette smoking and smoking cessation on life expectancy among people with HIV: a US-based modeling study. J Infect Dis. 2016;214:1672–81.
Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015;7:113.
Zhang Y, Florath I, Saum KU, Brenner H. Self-reported smoking, serum cotinine, and blood DNA methylation. Environ Res. 2016;146:395–403.
Philibert R, Hollenbeck N, Andersen E, McElroy S, Wilson S, Vercande K, Beach SR, Osborn T, Gerrard M, Gibbons FX, Wang K. Reversion of AHRR demethylation is a quantitative biomarker of smoking cessation. Front Psychiatry. 2016;7:55.
Gao X, Gao X, Zhang Y, Breitling LP, Schottker B, Brenner H. Associations of self-reported smoking, cotinine levels and epigenetic smoking indicators with oxidative stress among older adults: a population-based study. Eur J Epidemiol. 2017;32:443–56.
Fasanelli F, Baglietto L, Ponzi E, Guida F, Campanella G, Johansson M, Grankvist K, Johansson M, Assumma MB, Naccarati A, et al. Hypomethylation of smoking-related genes is associated with future lung cancer in four prospective cohorts. Nat Commun. 2015;6:10192.
Marabita F, Almgren M, Sjoholm LK, Kular L, Liu Y, James T, Kiss NB, Feinberg AP, Olsson T, Kockum I, et al. Smoking induces DNA methylation changes in multiple sclerosis patients with exposure-response relationship. Sci Rep. 2017;7:14589.
Zhang Y, Elgizouli M, Schottker B, Holleczek B, Nieters A, Brenner H. Smoking-associated DNA methylation markers predict lung cancer incidence. Clin Epigenetics. 2016;8:127.
Teschendorff AE, Yang Z, Wong A, Pipinikas CP, Jiao Y, Jones A, Anjum S, Hardy R, Salvesen HB, Thirlwell C, et al. Correlation of smoking-associated DNA methylation changes in buccal cells with DNA methylation changes in epithelial cancer. JAMA Oncol. 2015;1:476–85.
Gao X, Zhang Y, Saum KU, Schottker B, Breitling LP, Brenner H. Tobacco smoking and smoking-related DNA methylation are associated with the development of frailty among older adults. Epigenetics. 2017;12:149–56.
Zhang Y, Wilson R, Heiss J, Breitling LP, Saum KU, Schottker B, Holleczek B, Waldenberger M, Peters A, Brenner H. DNA methylation signatures in peripheral blood strongly predict all-cause mortality. Nat Commun. 2017;8:14617.
Zhang Y, Schottker B, Ordonez-Mena J, Holleczek B, Yang R, Burwinkel B, Butterbach K, Brenner H. F2RL3 methylation, lung cancer incidence and mortality. Int J Cancer. 2015;137:1739–48.
Zhang Y, Schottker B, Florath I, Stock C, Butterbach K, Holleczek B, Mons U, Brenner H. Smoking-associated DNA methylation biomarkers and their predictive value for all-cause and cardiovascular mortality. Environ Health Perspect. 2016;124:67–74.
Nelson KN, Hui Q, Rimland D, Xu K, Freiberg MS, Justice AC, Marconi VC, Sun YV. Identification of HIV infection-related DNA methylation sites and advanced epigenetic aging in HIV-positive, treatment-naive U.S. veterans. AIDS. 2017;31:571–5.
Horvath S, Levine AJ. HIV-1 infection accelerates age according to the epigenetic clock. J Infect Dis. 2015;212:1563–73.
Zhang X, Hu Y, Justice AC, Li B, Wang Z, Zhao H, Krystal JH, Xu K. DNA methylation signatures of illicit drug injection and hepatitis C are associated with HIV frailty. Nat Commun. 2017;8:2243.
Zhang X, Justice AC, Hu Y, Wang Z, Zhao H, Wang G, Johnson EO, Emu B, Sutton RE, Krystal JH, Xu K. Epigenome-wide differential DNA methylation between HIV-infected and uninfected individuals. Epigenetics. 2016:11(10):750–60.
Corley MJ, Dye C, D’Antoni ML, Byron MM, Yo KL, Lum-Jones A, Nakamoto B, Valcour V, SahBandar I, Shikuma CM, et al. Comparative DNA methylation profiling reveals an immunoepigenetic signature of HIV-related cognitive impairment. Sci Rep. 2016;6:33310.
Holder LB, Haque MM, Skinner MK. Machine learning for epigenetics and future medical applications. Epigenetics. 2017;12:505–14.
Adorjan P, Distler J, Lipscher E, Model F, Muller J, Pelet C, Braun A, Florl AR, Gutig D, Grabs G, et al. Tumour class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Res. 2002;30:e21.
Zhu B, Song N, Shen R, Arora A, Machiela MJ, Song L, Landi MT, Ghosh D, Chatterjee N, Baladandayuthapani V, Zhao H. Integrating clinical and multiple omics data for prognostic assessment across human cancers. Sci Rep. 2017;7:16954.
Hao X, Luo H, Krawczyk M, Wei W, Wang W, Wang J, Flagg K, Hou J, Zhang H, Yi S, et al. DNA methylation markers for diagnosis and prognosis of common cancers. Proc Natl Acad Sci U S A. 2017;114:7414–9.
Jeschke J, Bizet M, Desmedt C, Calonne E, Dedeurwaerder S, Garaud S, Koch A, Larsimont D, Salgado R, Van den Eynden G, et al. DNA methylation-based immune response signature improves patient diagnosis in multiple cancers. J Clin Invest. 2017;127:3090–102.
Castellanos-Garzon JA, Ramos J, Lopez-Sanchez D, de Paz JF, Corchado JM. An ensemble framework coping with instability in the gene selection process. Interdiscip Sci. 2018;10:12–23.
Chen L, Jin P, Qin ZS. DIVAN: accurate identification of non-coding disease-specific risk variants using multi-omics profiles. Genome Biol. 2016;17:252.
Su D, Wang X, Campbell MR, Porter DK, Pittman GS, Bennett BD, Wan M, Englert NA, Crowl CL, Gimple RN, et al. Distinct epigenetic effects of tobacco smoking in whole blood and among leukocyte subtypes. PLoS One. 2016;11:e0166486.
Baselmans BM, van Dongen J, Nivard MG, Lin BD, Consortium B, Zilhao NR, Boomsma DI, Bartels M. Epigenome-wide association study of wellbeing. Twin Res Hum Genet. 2015;18:710–9.
Wilson R, Wahl S, Pfeiffer L, Ward-Caviness CK, Kunze S, Kretschmer A, Reischl E, Peters A, Gieger C, Waldenberger M. The dynamics of smoking-related disturbed methylation: a two time-point study of methylation change in smokers, non-smokers and former smokers. BMC Genomics. 2017;18:805.
Jhun MA, Smith JA, Ware EB, Kardia SLR, Mosley TH Jr, Turner ST, Peyser PA, Park SK. Modeling the causal role of DNA methylation in the association between cigarette smoking and inflammation in African Americans: a 2-step epigenetic Mendelian randomization study. Am J Epidemiol. 2017;186:1149–58.
Reynolds LM, Lohman K, Pittman GS, Barr RG, Chi GC, Kaufman J, Wan M, Bell DA, Blaha MJ, Rodriguez CJ, Liu Y. Tobacco exposure-related alterations in DNA methylation and gene expression in human monocytes: the multi-ethnic study of atherosclerosis (MESA). Epigenetics. 2017;12:1092–100.
Fa S, Larsen TV, Bilde K, Daugaard TF, Ernst EH, Olesen RH, Mamsen LS, Ernst E, Larsen A, Nielsen AL. Assessment of global DNA methylation in the first trimester fetal tissues exposed to maternal cigarette smoking. Clin Epigenetics. 2016;8:128.
Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, Zhang W, Schwartz J, Just A, Colicino E, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics. 2016;32:3150–4.
Joehanes R, Just AC, Marioni RE, Pilling LC, Reynolds LM, Mandaviya PR, Guan W, Xu T, Elks CE, Aslibekyan S, et al. Epigenetic signatures of cigarette smoking. Circ Cardiovasc Genet. 2016;9:436–47.
Justice AC, Freiberg MS, Tracy R, Kuller L, Tate JP, Goetz MB, Fiellin DA, Vanasse GJ, Butt AA, Rodriguez-Barradas MC, et al. Does an index composed of clinical data reflect effects of inflammation, coagulation, and monocyte activation on mortality among those aging with HIV? Clin Infect Dis. 2012;54:984–94.
Patil P, Parmigiani G. Training replicable predictors in multiple studies. Proc Natl Acad Sci U S A. 2018;115:2578–83.
Vogel CFA, Haarmann-Stemmann T. The aryl hydrocarbon receptor repressor - more than a simple feedback inhibitor of AhR signaling: clues for its role in inflammation and cancer. Curr Opin Toxicol. 2017;2:109–19.
Leenen FA, Muller CP, Turner JD. DNA methylation: conducting the orchestra from exposure to phenotype? Clin Epigenetics. 2016;8:92.
McCartney DL, Hillary RF, Stevenson AJ, Ritchie SJ, Walker RM, Zhang Q, Morris SW, Bermingham ML, Campbell A, Murray AD, et al. Epigenetic prediction of complex traits and death. Genome Biol. 2018;19:136.
Justice AC, Gordon K, Skanderson M, Edelman EJ, Akgun KM, Gibert CL, Lo Re V 3rd, Rimland D, Womack JA, Wyatt CM, Tate JP. Non antiretroviral polypharmacy and adverse health outcomes among HIV-infected and uninfected individuals. AIDS. 2018:32(6):739–49.
Justice AC, McGinnis KA, Tate JP, Braithwaite RS, Bryant KJ, Cook RL, Edelman EJ, Fiellin LE, Freiberg MS, Gordon AJ, et al. Risk of mortality and physiologic injury evident with lower alcohol exposure among HIV infected compared with uninfected men. Drug Alcohol Depend. 2016;161:95–103.
Lehne B, Drong AW, Loh M, Zhang W, Scott WR, Tan ST, Afzal U, Scott J, Jarvelin MR, Elliott P, et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 2015;16:37.
Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in the analysis of DNA methylation array data: a mathematical perspective. BMC Bioinformatics. 2015;16:95.
Gao X, Zhang Y, Breitling LP, Brenner H. Relationship of tobacco smoking and smoking-related DNA methylation with epigenetic age acceleration. Oncotarget. 2016;7:46878–89.
The authors appreciate the support of the Veteran Aging Study Cohort Biomarker Core and Yale Center of Genomic Analysis.
The project was supported by the National Institute on Drug Abuse [R03 DA039745 (Xu), R01 DA038632 (Xu), R01DA047063 (Xu and Aouizerat), R01DA047820(Xu and Aouizerat)] and the National Center for Post-Traumatic Stress Disorder, USA.
Availability of data and materials
Demographic, clinical variables, and methylation for the VACS samples were submitted to GEO dataset (GSE117861) and are available to the public. All codes for analysis are also available upon a request to the corresponding author.
Ethics approval and consent to participate
The study was approved by the committee of the Human Research Subject Protection at Yale University and the Institutional Research Board Committee of the Connecticut Veteran Healthcare System. All subjects provided written consents.
The following are the competing interests of Dr. John H Krystal: (1) Consultant: note: The Individual Consultant Agreements listed below are less than $10,000 per year: AstraZeneca Pharmaceuticals; Biogen, Idec, MA; Biomedisyn Corporation; Bionomics, Limited (Australia); Concert Pharmaceuticals, Inc.; Heptares Therapeutics, Limited (UK); Janssen Research & Development; L.E.K. Consulting; Otsuka America Pharmaceutical, Inc.; Spring Care, Inc.; Sunovion Pharmaceuticals, Inc.; Takeda Industries; Taisho Pharmaceutical Co., Ltd.; Scientific Advisory Board; Bioasis Technologies, Inc.; Biohaven Pharmaceuticals; Blackthorn Therapeutics, Inc.; Broad Institute of MIT and Harvard; Cadent Therapeutics; Lohocla Research Corporation; Pfizer Pharmaceuticals; and Stanley Center for Psychiatric Research at the Broad Institute; (2) Stock: ArRETT Neuroscience, Inc.; Blackthorn Therapeutics, Inc.; Biohaven Pharmaceuticals Medical Sciences; and Spring Care, Inc. Stock options: Biohaven Pharmaceuticals Medical Sciences; (3) income greater than $10,000: Editorial Board.
Editor - Biological Psychiatry; Patents and Inventions: Seibyl JP, Krystal JH, Charney DS. Dopamine and noradrenergic reuptake inhibitors in treatment of schizophrenia. US Patent #:5,447,948. September 5, 1995; Vladimir, Coric, Krystal, John H, Sanacora, Gerard – Glutamate Modulating Agents in the Treatment of Mental Disorders US Patent No. 8,778,979 B2 Patent Issue Date: July 15, 2014. US Patent Application No. 15/695,164:
Filing date: September 5, 2017; Charney D, Krystal JH, Manji H, Matthew S, Zarate C., − Intranasal Administration of Ketamine to Treat Depression United States Application No. 14/197,767 filed on March 5, 2014; United States application or Patent Cooperation Treaty (PCT) International application No. 14/306,382 filed on June 17, 2014; Zarate, C, Charney, DS, Manji, HK, Mathew, Sanjay J, Krystal, JH, Department of Veterans Affairs “Methods for Treating Suicidal Ideation”, Patent Application No. 14/197.767 filed on March 5, 2014 by Yale University Office of Cooperative Research; Arias A, Petrakis I, Krystal JH. – Composition and methods to treat addiction; Provisional Use Patent Application no.61/973/961. April 2, 2014. Filed by Yale University Office of Cooperative Research; Chekroud, A., Gueorguieva, R., & Krystal, JH. “Treatment Selection for Major Depressive Disorder” [filing date June 3, 2016, USPTO docket number Y0087.70116US00]. Provisional patent submission by Yale University; Yoon G, Petrakis I, Krystal JH – Compounds, Compositions, and Methods for Treating or Preventing Depression and Other Diseases. US Provisional Patent Application No. 62/444,552, filed on.
January 10, 2017, by Yale University Office of Cooperative Research OCR 7088 US01; Abdallah, C, Krystal, JH, Duman, R, Sanacora, G. Combination Therapy for Treating or Preventing Depression or Other Mood Diseases. U.S. Provisional Patent Application No. 047162-7177P1 (00754) filed on August 20, 2018, by Yale University Office of Cooperative Research OCR 7451 US01.
All other authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Table S1. Epigenome-wide significant CpG sites associated with tobacco smoking in a discovery sample. Table S2. Epigenome-wide significant CpG sites associated with tobacco smoking in a replication sample. Table S3. Machine learning selected 698 CpGs for the prediction of HIV frailty. (XLSX 77 kb)
Figure S1. Meta-analysis of epigenome-wide association of smoking in HIV-infected samples. A. Manhattan plot of meta-analysis in two sample sets. Red line indicates Bonferroni-corrected epigenome-wide significance; B. Hypo- and hyper-CpG sites associated with tobacco smoking. (PDF 1562 kb)
Figure S2. Prediction of 408,583 CpG sites on HIV frailty by using GLMNET model. HIV frailty is represented by Veteran Aging Cohort Study index (VACS index). AUC: area under curve from receiver operating characteristic analysis. (PDF 54 kb)
Figure S3. Prediction of smoking status on HIV frailty indicated by Veteran Aging Cohort Study (VACS) index. AUC: area under curve from receiver operating characteristic analysis. (PDF 8 kb)
Figure S4. A prediction of the smoking-associated 698 CpG sites for mortality in a HIV-positive population. AUC: area under curve. (PDF 708 kb)
About this article
Cite this article
Zhang, X., Hu, Y., Aouizerat, B.E. et al. Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality. Clin Epigenet 10, 155 (2018). https://doi.org/10.1186/s13148-018-0591-z
- DNA methylation
- Ensemble machine learning
- HIV frailty
- Tobacco smoking