Causal effect of smoking on DNA methylation in peripheral blood: a twin and family study

Background Smoking has been reported to be associated with peripheral blood DNA methylation, but the causal aspects of the association have rarely been investigated. We aimed to investigate the association and underlying causation between smoking and blood methylation. Methods The methylation profile of DNA from the peripheral blood, collected as dried blood spots stored on Guthrie cards, was measured for 479 Australian women including 66 monozygotic twin pairs, 66 dizygotic twin pairs, and 215 sisters of twins from 130 twin families using the Infinium HumanMethylation450K BeadChip array. Linear regression was used to estimate associations between methylation at ~ 410,000 cytosine-guanine dinucleotides (CpGs) and smoking status. A regression-based methodology for twins, Inference about Causation through Examination of Familial Confounding (ICE FALCON), was used to assess putative causation. Results At a 5% false discovery rate, 39 CpGs located at 27 loci, including previously reported AHRR, F2RL3, 2q37.1 and 6p21.33, were found to be differentially methylated across never, former and current smokers. For all 39 CpG sites, current smokers had the lowest methylation level. Our study provides the first replication for two previously reported CpG sites, cg06226150 (SLC2A4RG) and cg21733098 (12q24.32). From the ICE FALCON analysis with smoking status as the predictor and methylation score as the outcome, a woman’s methylation score was associated with her co-twin’s smoking status, and the association attenuated towards the null conditioning on her own smoking status, consistent with smoking status causing changes in methylation. To the contrary, using methylation score as the predictor and smoking status as the outcome, a woman’s smoking status was not associated with her co-twin’s methylation score, consistent with changes in methylation not causing smoking status. Conclusions For middle-aged women, peripheral blood DNA methylation at several genomic locations is associated with smoking. Our study suggests that smoking has a causal effect on peripheral blood DNA methylation, but not vice versa. Electronic supplementary material The online version of this article (10.1186/s13148-018-0452-9) contains supplementary material, which is available to authorized users.


Background
Epigenetics is a mechanism modifying gene expression without changing underlying DNA sequence. DNA methylation, a phenomenon that typically a methyl group (-CH3) is added to a cytosine-guanine dinucleotide (CpG) at which the cytosine is converted to a 5methylcytosine, has been proposed to play a role in the aetiology of complex traits and diseases [1,2].
At least 21 epigenome-wide association studies (EWASs) have reported that methylation in the blood of adults at a great many CpGs is associated with smoking status . A recent, and the largest meta-analysis so far, reported 18,760 CpGs annotated to 7201 genes, which account for approximately one third of the known human genes, were differentially methylated between 2433 current smokers and 6956 never smokers [11]. Associations for several loci, such as AHRR, F2RL3, GPR15, GFI1, 2q37.1 and 6p21. 33, have been consistently reported, and a systematic review published in 2015 found that associations for 62 CpGs had been reported at least three times [24]. Apart from smoking status, other smoking exposures such as cumulative smoking [3, 4, 8-12, 16-18, 20, 22] and years since quitting [4, 9-12, 15, 16, 19, 20, 22] have also been found to be associated with blood DNA methylation.
Most of the reported associations are from crosssectional designs; thus, the causal nature of the association, i.e. whether DNA methylation has a causal effect on smoking or vice versa, is unknown. There is also a possibility that cross-sectional epigenetic associations are due to familial confounding [25]. Studies have suggested that smoking-related blood DNA methylation mediates the effects of smoking on lung cancer [26,27], death [28], leukocyte telomere length [29], and subclinical atherosclerosis [30]. These studies assume that smoking has a causal effect on methylation without evidence of causality. To the best of our knowledge, the only causal evidence comes from a study using a twostep Mendelian randomisation (MR) approach to investigate the mediating role of methylation between smoking and inflammation [31]. This study found that smoking had a causal effect on methylation at CpGs located at F2RL3 and GPR15 genes.
In this study, we aimed to investigate association between smoking and blood DNA methylation, to replicate associations previously reported and to investigate putative causal nature of the association using regression methods for related individuals.

Study sample
The sample comprised women from the Australian Mammographic Density Twins and Sisters Study [32]. A total of 479 women including 66 monozygotic twin pairs, 66 dizygotic twin pairs and 215 sisters from 130 families were selected [33].

Smoking data collection
A telephone-administered questionnaire was used to collect participants' self-reported information on smoking. Participants were asked the question 'Have you ever smoked at least one cigarette per day for 3 months or longer?' Participants who answered 'No' were classified as never smokers, and the rest ever smokers. Ever smokers were further questioned for age at starting smoking, the average number of cigarettes smoked per day, and age at stopping smoking, if any. Ever smokers who had stopped smoking before the interview were classified as former smokers, and the rest current smokers.

DNA methylation data
DNA was extracted from dried blood spots stored on Guthrie cards using a method previously described [34]. Methylation was measured using the Infinium Human-Methylation450K BeadChip array. Raw intensity data were processed by Bioconductor minfi package [35], which included normalisation of data using Illumina's reference factor-based normalisation methods (prepro-cessIllumina) and subset-quantile within array normalisation (preprocessSWAN) [36] for type I and II probe bias correction. An empirical Bayes batch-effects removal method ComBat [37] was applied to minimise technical variation across batches. Probes with missing values (detection P value> 0.01) in one or more samples, with documented SNPs at the target CpG, with beadcount < 3 in more than 5% samples, binding to multiple locations [38] or binding to X chromosome, and the 65 control probes were excluded, leaving 411,219 probes included in the analysis; see Li et al. [33] for more details.

Epigenome-wide association analysis
We investigated the association using a linear mixedeffects model in which the methylation M value, a logit transformation of the percentage of methylation, as the outcome and smoking status (never, former and current smokers) as the predictor. The model was adjusted for age and estimated cell-type proportions [39] as fixed effects and for family and zygosity as random effects, fitted using the lmer() function from the R package lme4 [40]. The likelihood ratio test was used to make inference, that is, a nested model without smoking status was fitted and a P value was calculated based on that, twice the difference in the log likelihoods between the full and nested models approximately follows the chi-squared distribution with two degrees of freedom. To account for multiple testing, associations with a false discovery rate (FDR) [41] < 0.05 were considered statistically significant and the corresponding CpGs were referred to as 'identified CpGs'.
For identified CpGs, we investigated their associations with cumulative smoke exposure indicated by pack-years for ever smokers and with years since quitting for former smokers. Pack-years were calculated as the average number of cigarettes smoked per day divided by 20 and multiplied by the number of years smoked, and were log-transformed to be approximately normal distributed. Years since quitting were calculated as age at interview minus age at stopping smoking. The covariates adjusted and statistical inference were the same as those for smoking status, except that the model for pack-years was additionally adjusted for smoking status (former and current smokers) to investigate associations independent of smoking status.

Replication of previously reported associations
After quality control, 18,671 CpGs reported from the largest meta-analysis performed by Joehanes et al. [11] were included in our study. For these CpGs, we investigated their associations with smoking status in our study. Given the sample size of our study and not to miss any potential replication, associations with a nominal P < 0.05 and the same direction as that reported by Joehanes et al. were considered to be replicated, and the corresponding CpGs were referred to as 'replicated CpGs'.

Familial confounding analysis
For the identified CpGs and replicated CpGs, we performed between-and within-sibship analyses [25,42] to investigate if familial factors confound the associations. Given that never and former smokers had similar methylation levels for most of the CpGs, we combined them into one group. The new smoking status was thus analysed with current smokers as '1' and the rest as '0'.
In the analysis, the methylation M values, smoking exposures and covariates were orthogonally transformed within sibships to obtain sibship means and within-sibship differences for these variables; see Stone et al. [42] for more details about the transformation. The betweensibship analyses investigated associations between sibship means for methylation levels and those for smoking exposures, and the within-sibship analyses investigated associations between within-sibship differences for methylation levels and those for smoking exposures. Associations estimated from the within-sibship analyses are independent of familial confounding, as the confounding effects of familial factors shared by siblings, both known and unknown, were cancelled out when using within-sibship differences. Evidence for familial confounding can be obtained by comparing between-sibship coefficient (β B ) and within-sibship coefficient (β W ). When β B ≠ β W and β W ≈ 0, i.e. the association disappears when familial factors are adjusted, the observation is consistent with the association being due to familial confounding. When β B ≈ β W ≠ 0, i.e. the association is similar regardless of whether familial factors are adjusted, the observation is consistent with absence of evidence for familial confounding; see Carlin et al. [43] for more details about the implications from comparing β B and β W .

Causal inference analysis
We performed causal inference between smoking status and methylation using Inference about Causation through Examination of FAmiliaL CONfounding (ICE FALCON), a regression-based methodology for analysing twin data [44][45][46][47][48]. By causal is meant, that if it were possible to vary a predictor measure experimentally, the expected value of the outcome measure would change.
As shown in Fig. 1, suppose there are two variables, X and Y, measured for pairs of twins, and for example, let X refer to smoking status and Y refer to methylation. Assume that X and Y are positively associated within an individual. Let S denote the unmeasured familial factors that affect both twins, S X represents those factors that influence X values only, S Y those that influence Y values only, and S XY those that influence both X and Y values. For the purpose of explanation, let 'self' refer to an individual and 'co-twin' refer to the individual's twin, but recognise that these labels can be exchanged and both twins within a pair are used in the analysis.
If there is a correlation between Y self and X co-twin , it might be due to a familial confounder, S XY (Fig. 1a). It could also be due to X having a causal effect on Y within an individual, provided X self and X co-twin are correlated ( Fig. 1b), or to Y having a casual effect on X, provided Y self and Y co-twin are correlated (Fig. 1c). Note that the confounders specific to an individual, C self and C co-twin , do not of themselves result in a correlation between Y self and X co-twin .
Using the Generalised Estimating Equations (GEE), fitted using the geeglm() function from R package geepack [49], to take into account any correlation in Y between twins within the same pair, three models are fitted: If the correlation between Y self and X co-twin is solely due to familial confounders (Fig. 1a), the marginal association between Y self and X self (β self in model 1) and the marginal association between Y self and X co-twin (β co-twin in model 2) must both be non-zero. Adjusting for X self , however, the conditional association between Y self and X co-twin (β′ co-twin in model 3) is expected to attenuate from β co-twin in model 2 towards the null. Similarly, adjusting for X co-twin (model 3), the conditional association between Y self and X self (β′ self in model 3) is expected to attenuate from β self in model 1 towards the null.
If the correlation between Y self and X co-twin is solely due to a causal effect from X to Y (Fig. 1b), Y self and X co-twin in model 2 will be associated through two pathways: the confounder S X , and conditioning on the collider Y co-twin (GEE analysis in effect conditions on Y co-twin ). Conditioning on Y co-twin induces a negative correlation between X co-twin and Y self (note that we assume X and Y are positively associated within an individual), so that β co-twin in model 2 depends on the within-pair correlations in X (ρ X ) and in Y (ρ Y ): if ρ X > ρ Y , β co-twin is expected to be positive; otherwise β co-twin to be negative. Conditioning on X self (model 3), both pathways are blocked and the conditional association (β′ co-twin in model 3) is expected to attenuate towards the null. If the correlation between Y self and X co-twin is solely due to a causal effect from Y to X (Fig. 1c), in model 2 the pathway through S X is blocked due to X self as a collider, and the pathway through S Y is blocked due to that GEE analysis in effect conditions on Y co-twin , so there is no marginal association between Y self and X co-twin , and β co-twin of model 2 is expected to be zero.
We studied methylation at the identified CpGs and replicated CpGs, respectively. For each group of CpGs, methylation was analysed as a weighted methylation score, calculated as the sum of the products of methylation level and weight of each CpG. For a locus containing multiple CpGs, only the CpG with the smallest P value was included in the methylation score. For the identified CpGs, the methylation level was the standardised M value and the weight was the log odds ratio for smoking status. For the replicated CpGs, the methylation level was the Beta value, the scale used in the metaanalysis, and the weight was the Z statistic reported by Joehanes et al. [11]. Smoking status was analysed as a binary variable with current smokers as '1' and the rest as '0'. We first used smoking status to be X and methylation score to be Y and regressed methylation score on smoking status. We then exchanged X and Y to regress smoking status on methylation score and undertook the same analyses. The data for 132 twin pairs were used. We made statistical inference about the change in regression coefficient using one-sided t test with a standard error computed using nonparametric bootstrap method. That is, twin pairs were randomly sampled with replacement to generate 1000 new datasets with the same sample size as the original dataset. ICE FALCON was then applied to each dataset to calculate the change in regression coefficient for that dataset and standard error was then estimated by computing the standard deviation.

Characteristics of the sample
The mean (standard deviation [SD]) age for the 479 women was 56.4 (7.9) years. The women included 291 (60.8%) never smokers, 147 (30.7%) former smokers and 41 (8.5%) current smokers. Ever smokers had a median (interquartile range) of 7.0 (13.8) packyears. Former smokers had an average (SD) of 21.5 (11.4) years since quitting. The cross-twin cross-trait correlation is due to the causal effect of X on Y. c The cross-twin cross-trait correlation is due to the causal effect of Y on X

Epigenome-wide analysis results
Methylation at 39 CpGs located at 27 loci was found to be associated with smoking status (Table 1; Q-Q plot and Manhattan plot in Fig. 2). Associations for 37 of the 39 CpGs have been reported by at least two studies and associations for two CpGs, cg06226150 (SLC2A4RG) and cg21733098 (12q24.32), have only been reported from the meta-analysis performed by Joehanes et al. [11]. For Of the 39 CpGs and at a 5% FDR, methylation at 18 CpGs was negatively associated with pack-years and at 20 CpGs was positively associated with years since quitting. Methylation at 15 CpGs was associated with packyears and years since quitting both (Table 2).

Replication for previously reported associations
For the associations for 18,671 CpGs reported by Joehanes et al. [11], 1882 were replicated with a nominal P < 0.05 and in the same direction, and the 133 most significant associations also had a FDR < 0.05.
Of the 1882 replications, 1154 were for the novel CpGs reported by Joehanes et al. (Additional file 1: Table S1).

Between-and within-sibship analyses results
For the 39 identified CpGs, no evidence for a difference between β B and β W was found for any CpG (Table 3; all P values > 0.05 from the β B and β W comparison). The same results were found from the analyses of pack-years and years since quitting (Table 3).
For the 1882 replicated CpGs, no evidence for a difference between β B and β W was found for any CpG (Additional file 2: Table S2; the smallest P value = 1.3 × 10 − 3 and the smallest FDR = 0.99 from the β B and β W comparison).
The ICE FALCON results for methylation at the replicated CpGs are shown in Table 4. From the analysis in which smoking status was the predictor and methylation score the outcome, a women's methylation score was associated with her own smoking status (model 1; β self = 74.6, 95% CI 55.3, 93.9), and negatively associated with her co-twin's smoking status (model 2; β co-twin = − 30.8, 95% CI − 57.7, − 4.0). Conditioning on her co-twin's smoking status (model 3), β′ self remained unchanged (P = 0.41) compared with β self in model 1, while conditioning on her own smoking status (model 3), β co-twin in model 2 attenuated by 123.3% (95% CI 49.6%, 185.2%; P = 0.002) to be β′ co-twin of 2.5 (95% CI − 16.3, 21.3). From the analysis in which methylation score was the predictor and smoking status the outcome, a woman's smoking status was associated with her own methylation score (model 1; β self = 4.1, 95% CI 2.7, 5.4), but not with her co-twin's methylation score (model 2; β co-twin = 0.4, 95% CI − 1.0, 1.8). In model 3, β′ self and β′ co-twin remained unchanged (both P > 0.1) compared with β self in model 1 and β co-twin in model 2, respectively. These results were consistent with that smoking has a causal effect on the overall methylation level at these CpGs, but not in the opposite direction. Similar results were found and a similar causality was inferred for smoking status and the overall methylation level at the identified CpGs (Table 4).

Discussion
We performed an EWAS of smoking for a sample of middle-aged women and found 39 CpGs at which methylation was associated with smoking status. Our study confirmed the associations for several previously consistently reported loci including AHRR, F2RL3, 2q37.1, and 6p21.33, and for two novel CpGs, cg06226150 (SLC2A4RG) and cg21733098 (12q24.32), reported by the largest meta-analysis [11] so far. In addition, we replicated the associations for 1882 CpGs    Regression coefficients from the analyses for pack-years and years since quitting were reported as being multiplied by 100, as well as for standard errors *P-value from comparing the between-sibship coefficient with the within-sibship coefficient reported by the meta-analysis. The investigation of causation suggests that smoking has a causal effect on DNA methylation, not vice versa or being due to familial confounding.
To the best of our knowledge, our study is the first study to confirm the associations for cg06226150 and cg21733098. cg06226150 is located at the promoter of, and potentially regulates the expression of, SLC2A4RG (solute carrier family 2 member 4 regulator gene). SLC2A4RG is involved in the Gene Ontology pathway for regulation of transcription (GO:0006355). Protein encoded by SLC2A4RG regulates the activation of SLC2A4 (solute carrier family 2 member 4). SLC2A4 is involved in the glucose transportation across cell membranes stimulated by insulin. Genetic variants at SLC2A4RG have been found to be associated with inflammatory bowel disease [50] and prostate cancer [51]. cg21733098 is located at an intergenic region on 12q24.32. The region contains several long non-coding RNA genes. Little is known about the regulatory function of cg21733098. The biological relevance of smoking to blood methylation at these two CpGs is largely unknown, and more research are warranted.
We found evidence that 18 and 20 of the identified CpGs were also associated with pack-years and years since quitting, respectively. Given that smokers have lower methylation levels at the identified CpGs, the negative associations with pack-years imply that there appear to be dose-relationships between smoking and methylation at the 18 CpGs, and the positive associations with years quitting smoking imply that methylation changes at the 20 CpGs tend to reverse after cessation. The dose-relationship and reversion have also been reported by several studies [4, 9-12, 15, 16, 19, 20, 22].
Our study, as one of the first studies, provides insights into the causality underlying the cross-sectional association between smoking and blood DNA methylation. Our results are inconsistent with the proposition that the crosssectional association is due to familial confounding, e.g. shared genes and/or environment. The roles of shared genes and/or environment are also in part unsupported by that certain smoking-related loci, such as AHRR and F2RL3, are observed across Europeans [3, 5, 8-11, 16, 19, 20, 22], South Asians [8], Arabian Asians [21], East Asians [12,23], and African Americans [7,11,13,18], who have different germline genetic backgrounds and environments. Our results support that smoking has a causal effect on the overall methylation at the identified CpGs and at the replicated CpGs, but not vice versa. Results from the twostep MR analysis performed by Jhun et al. [31] also suggest that differential methylation at cg03636183 (F2RL3) and cg19859270 (GPR15) between current and never smokers are consequential to smoking under the assumptions of MR.
That smoking causes changes in methylation is also supported to some extent by other evidence. The 'reversion' phenomenon is in line with the 'experimental evidence' criterion proposed by Bradford Hill, i.e. 'reducing or eliminating a putatively harmful exposure and seeing if the frequency of disease subsequently declines' [52]. The associations between cord blood methylation for newborns at some active-smoking-related loci, such as AHRR and GFI1, and maternal smoking in pregnancy [53] also imply that smoking is likely to cause methylation changes at these loci. Additionally, some smoking-related loci are involved in the metabolism of smoking-released chemicals. AHRR gene encodes a repressor of the aryl hydrocarbon receptor (AHR) gene, the protein encoded by which is involved in the regulation of biological response to planar aromatic hydrocarbons. Polycyclic aromatic hydrocarbons, one main smoking-related toxic and carcinogenic substance, trigger AHR signalling cascade [16,22]. Protein Regression coefficients from the analyses in which the methylation score as the predictor were reported as being multiplied by 100, as well as for standard errors coded by the AHR gene activates the expression of the AHRR gene, which in turn represses the function of AHR through a negative feedback mechanism [54]. That hypomethylation at AHRR gene caused by smoking is biologically plausible. That smoking causes changes in blood methylation has great clinical and etiological implications: methylation might mediate the effects of smoking on smokingrelated health outcomes. As introduced above, there have been a few studies [26][27][28][29] investigating the mediating role of methylation. A better understanding of the mechanisms of smoking affecting health is expected with more investigations on methylation.
Our study shows the value of ICE FALCON in causality assessment for observational associations. Associations from observational studies can be due to confounding and, although analyses of measured potential confounders can eliminate some confounding, there is always the possibility of unmeasured confounding, even with prospective studies. With recent discoveries of genetic markers that predict variation in risk factors, the MR concept has been explored by epidemiologists. MR uses measured genetic variants as the instrumental variable and the results of MR might be biased due to several factors such as strengthen of instrumental variable, directional pleiotropy, and unmeasured confounding [55]. ICE FALCON is a novel approach to making inference about causation. It in effect uses the familial causes of exposure and of outcome as instrumental variables. The familial causes are not measured but surrogated by co-twin's measured exposure and outcome. Thus, ICE FALCON resembles a bidirectional MR approach [56]. The instrumental variables consider all familial causes in exposure and in outcome, thus potentially less biased by their strengths than a finite number of genetic markers. More importantly, even should directional pleiotropy exist, the attenuation in the coefficient for co-twin's exposure after adjusting for an individual's own exposure also supports a causal effect.

Conclusions
We found evidence that in the peripheral blood from middle-aged women, DNA methylation at several loci is associated with smoking. By investigating causation underlying the association, our study found evidence consistent with smoking having a causal effect on methylation, but not vice versa.