Data source
Data were retrieved from the Taiwan Biobank database (2008–2015). The Taiwan Biobank was established in 2005 with the aim of integrating the genetic and medical information of about 200,000 ethnic Taiwanese with no history of cancer [29, 30]. The enrollment of individuals in the Taiwan Biobank Project conforms to relevant regulations and guidelines. A total of 29 recruitment centers are distributed all over the country; each county or city possesses a minimum of one center [29]. A letter of consent was signed by each individual before data collection. Data were collected by well-trained researchers through questionnaires, physical examinations, and biochemical analyses of blood and urine samples.
Data collection and study participants
Participants’ data on SOX2 promoter methylation, residence, age, smoking, exposure to SHS, exercise, drinking, body fat, BMI, WHR, asthma, and emphysema were extracted from the Taiwan Biobank dataset (2008–2015). Residence, age, smoking, exposure to SHS, exercise, drinking, asthma, and emphysema were self-reported while DNA methylation, body fat, BMI, and WHR were measured.
Venous blood (9 ml) was collected into sodium citrate tubes and transported to the lab under 4 °C. DNA was extracted from blood using Chemagic™ Prime™ instrument which is an automated chemical extraction machine that uses magnetized rods to separate nucleic acids from solutions. The DNA length was determined with the Fragment Analyzer (Agilent) while the purity was assessed using the optical density (OD) at 260/280. Samples with an OD 260/280 ratio of 1.6–2.0 were considered to be pure. Pure samples intended for long-term use were stored at − 80 °C. DNA samples were subjected to sodium bisulfite treatment using the EZ DNA Methylation Kit (Zymo Research, CA, USA). DNA methylation at each CpG site was determined with the Infinium® MethylationEPIC BeadChipEPIC array (Illumina Inc.) which covers 850,000 methylation sites. Details on this platform are described elsewhere [31,32,33]. The methylation levels were expressed as beta-values (β) which range between 0 and 1. β-values were determined using the formula β = M/(M + U), where M is the methylated intensity and U is the unmethylated intensity. Quality control was done according to the Illumina® GenomeStudio® Methylation Module v1.8 [34]. Samples with detection P value > 0.05 and bead counts < 3 were eliminated. Dye-bias across batches was adjusted by normalization, and background correction was performed. Outliers were removed using the median absolute deviation method.
Participants were considered to be from an area if they lived there for at least 3 months. Where participants lived was considered to be where they were likely to be exposed to air pollution. The residences were grouped into northern and central/southern regions. This is because PM2.5 pollution in the northern areas is lower compared to the central and southern areas [16, 18,19,20]. The northern areas included Taipei and New Taipei Cities, while the central/southern areas included Taichung and Nantou Cities, Changhua and Yunlin Counties, Chiayi County, Tainan, and Chiayi Cities. PM2.5 statistics in these areas were obtained from the Air Quality Monitoring Database (AQMD) set up by the Environmental Protection Administration, Taiwan. There were 41 air quality monitoring stations in the study areas. Of these stations, 18 were located in the northern region while 23 were located in the central/southern regions. Annual PM2.5 readings (2006–2011) from the various stations were used to determine the annual average concentration for each region.
Participants were considered as (1) non-smokers: never smoked or did not continuously smoke for 6 months or more; (2) former smokers: continuously smoked for at least 6 months but were currently not smoking; (3) current smokers: continuously smoked for 6 months or more and were currently smoking; (4) non-drinkers: no history of alcohol drinking or weekly drinking of less than 150 cc of alcohol for continuously 6 months; (5) former drinkers: abstained from drinking for over 6 months; (6) current drinkers: weekly drinking of at least 150 cc of alcohol continuously for 6 months; (7) physically active: exercised for > 150 min per week; and (8) exposed to SHS: exposure for at least 5 min per hour.
After excluding former and current smokers as well as those with incomplete information, a total of 461 non-smokers comprising 176 men and 285 women were included in our study. A total of 24 SOX2 CpG sites located on SOX2 promoter region were available in the Taiwan Biobank dataset. These sites were cg24513480, cg05664581, cg19258425, cg18148179, cg07747133, cg24782772, cg00666105, cg04948892, cg15106134, cg02573703, cg20106776, cg11129008, cg08062338, cg12930100, cg01023203, cg22530053, cg27331851, cg14783675, cg01340005, cg17051733, cg11142406, cg09530873, cg25933341, and cg08464053. The β-values at all the 24 SOX2 CpG sites were summed up, and the mean value was determined. In order to reach the conclusion that SOX2 is a reliable biomarker of cancer risk in air pollution areas, the mean β-value of CpG sites at the promoter of KRAS, another lung cancer risk gene, was determined. Ethical approval for this study was obtained from Chung Shan Medical University Institutional Review Board (CS2-17070).
Statistical analysis
Data management and analyses were performed with the SAS 9.4 software (SAS Institute, Cary, NC). Continuous variables were analyzed using t test and expressed as mean ± standard deviation (SD) while categorical variables were analyzed using chi-square test and reported as percentages (%). Methylation data were normalized using the Illumina® GenomeStudio V2011.1 software [34]. The correction of cell-type heterogeneity was done with the R software using the Reference-Free Adjustment for Cell-Type composition (ReFACTor) method [35].
The association between SOX2 promoter methylation and residential area was determined using multiple linear regression. Multivariate adjustments were performed for age, exposure to SHS, exercise, drinking, body fat, BMI, WHR, asthma, emphysema, and cell-type composition. P values less than 0.05 were considered to be statistically significant.