Open Access

Identification and validation of the methylation biomarkers of non-small cell lung cancer (NSCLC)

  • Shicheng Guo1, 9,
  • Fengyang Yan1,
  • Jibin Xu2,
  • Yang Bao3,
  • Ji Zhu4,
  • Xiaotian Wang1,
  • Junjie Wu1, 5,
  • Yi Li1,
  • Weilin Pu1,
  • Yan Liu6,
  • Zhengwen Jiang6,
  • Yanyun Ma1,
  • Xiaofeng Chen7,
  • Momiao Xiong8,
  • Li Jin1, 9Email author and
  • Jiucun Wang1, 9Email author
Clinical EpigeneticsThe official journal of the Clinical Epigenetics Society20157:3

DOI: 10.1186/s13148-014-0035-3

Received: 14 October 2014

Accepted: 10 December 2014

Published: 22 January 2015



DNA methylation was suggested as the promising biomarker for lung cancer diagnosis. However, it is a great challenge to search for the optimal combination of methylation biomarkers to obtain maximum diagnostic performance.


In this study, we developed a panel of DNA methylation biomarkers and validated their diagnostic efficiency for non-small cell lung cancer (NSCLC) in a large Chinese Han NSCLC retrospective cohort. Three high-throughput DNA methylation microarray datasets (458 samples) were collected in the discovery stage. After normalization, batch effect elimination and integration, significantly differentially methylated genes and the best combination of the biomarkers were determined by the leave-one-out SVM (support vector machine) feature selection procedure. Then, candidate promoters were examined by the methylation status determined single nucleotide primer extension technique (MSD-SNuPET) in an independent set of 150 pairwise NSCLC/normal tissues. Four statistical models with fivefold cross-validation were used to evaluate the performance of the discriminatory algorithms. The sensitivity, specificity and accuracy were 86.3%, 95.7% and 91%, respectively, in Bayes tree model. The logistic regression model incorporated five gene methylation signatures at AGTR1, GALR1, SLC5A8, ZMYND10 and NTSR1, adjusted for age, sex and smoking, showed robust performances in which the sensitivity, specificity, accuracy, and area under the curve (AUC) were 78%, 97%, 87%, and 0.91, respectively.


In summary, a high-throughput DNA methylation microarray dataset followed by batch effect elimination can be a good strategy to discover optimal DNA methylation diagnostic panels. Methylation profiles of AGTR1, GALR1, SLC5A8, ZMYND10 and NTSR1, could be an effective methylation-based assay for NSCLC diagnosis.


Non-small cell lung cancer DNA methylation Biomarker Batch effect elimination Diagnosis


Lung cancer, a complex disease involving both genetic and epigenetic changes, is the leading cause of cancer deaths worldwide [1]. About 80% of primary lung cancers are non-small cell lung carcinoma (NSCLC), which is characterized by a long asymptomatic latency and poor prognosis. While the overall 5-year survival rates for late stage III and IV of NSCLC patients were just 5% to 14% and 1%, respectively, the rate could increase to 50% for the early stage of the NSCLC patients who are typically treated with surgery [2]. Many imaging and cytology-based strategies have been employed in NSCLC diagnosis; however, none of them have yet been proven completely effective in reducing the mortality. The advances in molecular profiling of NSCLC over the past decade have made a paradigm shift in its diagnosis and treatment.

Among all the genetic variations, single nucleotides polymorphisms (SNPs) have been considered as the most stable biomarker for heritable disease, since the status of the SNPs can be detected with almost 100% accuracy and unchanged during the entire life. It is specific and powerful for a single gene-caused disease. However, for complex diseases, such as cancers, the prediction power of SNPs is limited. A plethora of studies have shown that AUCs of the prediction model based on significant SNPs can confer only 0.54 to 0.55 for non-small cell lung cancer [3] and 0.54 to 0.60 for thyroid cancer [4], which has been considered as one of highest familial-risk carcinomas among all kinds of cancers. Molecular biomarkers such as mRNA, microRNA and protein for NSCLC diagnosis have been developed and investigated in the past decades. However, their accuracy for diagnosis of NSCLC is far from reaching clinical implementation, in which >90% sensitivity and specificity of diagnosis should be guaranteed.

DNA methylation, which is one of the most important mechanisms involved in gene and microRNA expression regulation [5] and in alternative gene splicing [6], plays important roles in the early stage of cancer. Because it is stable and easily detected qualitatively or quantitatively, DNA methylation was taken as the most promising diagnostic marker for the early detection of cancer [7,8] when compared with SNP/mutation [4], copy number variations (CNVs) [9] and gene/microRNA expression [10]. Hundreds of aberrant DNA methylation changes in the early stage of NSCLC have been identified in the past decades [11,12]. However, despite several diagnostic panels having been developed [13], these studies on DNA methylation in NSCLC were still limited by their small sample size, low number of selected genes and qualitative rather than quantitative DNA methylation. These limitations would cause low reproducibility of the assay and explain why the majority of these studies could not be replicated.

In our previous study, we found that prediction ability is limited when the prediction model only includes the methylation status of a single gene, even for a classic tumor suppressor gene [14]. A diagnostic panel with several genes would be a promising approach to achieve better prediction performance for clinical utility. Methylation microarrays measure the methylation levels of thousands of genes in a single assay. These arrays are a revolutionary tool for identifying genes whose methylation changes in response to a specific situation, such as different development stages, physiological status or pathological status, and provide fundamental data for feature selection to construct the best combination of the predictive variables. In addition, a large number of public methylation microarray datasets have been shared in certain database, such as Gene Expression Omnibus (GEO). The stability and reproducibility of the prediction model would be significantly increased when multiple datasets with the same study design are pooled together. However, methylation array results can be greatly affected by a variety of nonbiological variables, such as methods for DNA isolation, bisulfite conversion, probe processing and scanning, reagents from different companies, different technicians or even different atmospheric ozone levels. Usually, the term ‘batch’ refers to microarrays processed at one site over a short period of time using the same platform. The cumulative error introduced by these time, place and situation-dependent variations is referred to as batch effects. In terms of different study, the methylation microarray data were created in different times, places, and by different technicians and so on; therefore, the main variation among data would be shown as a batch effect. In our previous study, we found that the ComBat algorithm could remove such noise (batch signal or each individual study) from the dataset with powerful efficiency when adjusted with additional and multiple effects of the batch information [15,16], which provide the prerequisite to combine the methylation array dataset to increase the sample size of the statistical analysis.

In the present study, we first systematically integrated three independent high-throughput DNA methylation datasets from the GEO [17] and TCGA projects (Additional file 1: Table S1). An optimized DNA methylation combination was established through the feature selection procedure after preliminary normalization and batch effect elimination with the ComBat algorithm among the datasets to maximize the NSCLC prediction performance. Methylation statuses for five genes - AGTR1, GALR1, SLC5A8, ZMYND10 and NTSR1- were identified as being the most powerful combination for the NSCLC prediction. Then, to further evaluate their performance for diagnosis, we designed a novel methylation status as determined by the single nucleotide primer extension technique (MSD-SNuPET) for the simultaneous quantification of methylation at these five methylated loci. These five significantly differentially methylated genes were used to validate the results in 150 pairs of NSCLC and normal tissues from a Chinese Han population with MSD-SNuPET.


Public dataset collection, batch effect elimination and candidate gene selection

NSCLC-related public DNA methylation microarrays were searched through the Gene Expression Omnibus (GEO), ArrayExpress and TCGA projects. In total, three independent NSCLC datasets were created with a total of 458 microarrays, which included 352 NSCLC and 106 normal tissues (Figure 1 and Additional file 1: Table S1). A batch effect significantly existed among the datasets, and this was shown in the first and second principle components. We observed that the samples were clustered mainly by studies rather than by tumor and normal tissue samples (Figure 2A). ComBat, an empirical Bayes method, was used to eliminate the batch effects after quantile normalization in the three datasets. As a result, the batch effect was largely removed by ComBat (Figure 2B). In addition, as the hierarchical cluster analysis showed, biological information was highly preserved after batch effect elimination (Additional file 1: Figure S2). The SVM was used to conduct feature selection and assess the prediction abilities with leaving-one-out cross-validation. The accuracy of the SVM for classifying NSCLC was 98.98%, in the test set. Among the 112 shared probes, five CpG sites (NTSR1, SLC5A8, GALR1, AGTR1 and ZMYND10) were selected in the feature selection stage. We found these five genes were significantly differentially methylated between the tumor and normal tissue samples. In detail, meta-analysis of the DNA methylation microarrays showed that NTSR1 (P = 5.4 × 10-15), SLC5A8 (P = 5.9 × 10-9), GALR1 (P = 9.9 × 10-10) and AGTR1 (P = 6.7 × 10-5) were significantly hypermethylated in NSCLC, whereas ZMYND10 (P = 6.2 × 10-20) was significantly hypomethylated in NSCLC (Additional file 1: Figure S3). These results suggested that the selected five predictors would be potential biomarkers for the NSCLC diagnosis. To further evaluate their performance for diagnosis of NSCLC, we developed a panel of these five DNA methylation biomarkers and validated their diagnostic efficiency in 150 paired NSCLC and normal tissue samples in China.
Figure 1

Sketch of the study design and pipeline. Candidate biomarkers were selected from meta-analysis to multiple high-throughput DNA methylation microarrays. The significant or best feature combination was screened in an independent validation study of non-small cell lung cancer (NSCLC) with the methylation status determined single nucleotide primer extension technique (MSD-SNuPET) technique.

Figure 2

ComBat treatment and methylation status determined single nucleotide primer extension technique (MSD-SNuPET). Principal component analysis was applied to show the efficiency of the elimination of ComBat. A, B, A total of 120 probe sets with DNA methylation values after background and quantile normalization in a set of 352 non-small cell lung cancer (NSCLC) and 106 normal samples. X and Y axes represent the first and second principal components (PC1 and PC2), respectively. C-I were validation of the methylation status of the five candidate markers in an independent samples. Y-axis represents absolute DNA methylation percentage from MSD- SNuPET. LINE-1 and Reference were taken as the positive and negative control for MSD- SNuPET.

Methylation status validation with methylation status determined single nucleotide primer extension technique

In order to validate the results from the meta-analysis, methylation status of the above five genes were detected with MSD-SNuPET in 150 pairs of NSCLC and adjacent normal tissues. The characteristics of patients were showed in Table 1. Consistent with the microarray data, the absolute DNA methylation percentage of these five genes were significantly differentially methylated between NSCLC and normal tissues (Table 2, Figure 2C-I). Logistic regression analysis showed that hypermethylated NTSR1, SLC5A8, GALR1, and AGTR1 and hypomethylated ZMYND10 were significantly associated with the NSCLC when risk-adjusted for age, sex and smoking status with the P value of 5.9 × 10-7, 7.8 × 10-9, 2.3 × 10-6, 1.3 × 10-6, and 5.2 × 10-8, respectively (Table 2). The MSD-SNuPET results showed that the methylation of LINE-1 was significantly lower in NSCLC than normal tissue (t-test, P = 2.39 × 10-12). Additionally, DNA methylation of LINE-1 was significantly associated with sex (R2 = 0.18, P value = 0.0087), which was highly consistent with the previous reports about the methylation status of this gene [18,19] and supports the high credibility of the MSD-SNuPET. The prediction ability for each gene separately was also evaluated by logistic regression. Moderate prediction ability was identified, in which sensitivity ranges from 44.3% to 73.15%, specificity ranges from 79.59% to 94.56%, and AUC ranges from 0.67 to 0.80 (Table 2) were demonstrated. Correlation analysis showed that there was no co-methylation among the five genes. In addition, no significant association was observed between any of the five genes with age, smoking, TNM stage, lung cancer differentiation and lung cancer subtype (Ad or Sc) in both the univariate and multivariate association models in our study. However, a significant association between sex and SLC5A8 (P = 0.0001), ZMYND10 (P = 0.045) was identified, which might indicate a specific biological mechanism of SLC5A8 and ZMYND10 in the tumorigenesis of NSCLC. Protein-protein interaction networks from String 9.0 showed that there were comprehensive networks for both NTSR1 and GALR1. The majority of these genes were cancer-related genes, which have been reported to play important roles in cancer initiation, progress or therapy, such as S100A9, NGF, TAC1, CCK, FPR2, ADRA1B, and CCL21 in the gene-gene interaction networks (Additional file 1: Figure S4).
Table 1

Characteristics of patients


NSCLC = 150


40 (IQR = 15 to 65)







Smoke Statusa


Non-smokers (never)


Smokers (ever)






Squamous cell carcinoma







42 (10,32)


48 (16,32)


46 (41,5)









NSCLC, non-small cell lung cancer; aSmokers include former and current smoker individuals. bOthers include adenosquamous carcinoma (ADSQ), bronchioloalveolar carcinoma, mucoepidermoid lung tumor, Sarcomatoid carcinoma. cTNM Stages were assessed by the seventh edition of TNM classification criteria. dQualitative assessment of tumor differentiation was based on sum of the architecture score and cytologic atypia score (2 = well differentiated, 3 = moderately differentiated, 4 = poorly differentiated).

Table 2

Differential methylation in non-small cell lung cancers (NSCLCs)



AMP (Control)

P value b

log 10 (OR) (95% CI)

P value c

Sen d

Spe d





1.06 × 10-7

3.49 (2.08, 4.91)

1.30 × 10-6







6.58 × 10-9

2.56 (1.5, 3.63)

2.30 × 10-6







1.09 × 10-9

9.02 (5.48, 12.55)

5.90 × 10-7







4.77 × 10-12

3.80 (2.51, 5.09)

7.80 × 10-9







1.08 × 10-7

-4.61 (-6.27, -2.95)

5.20 × 10-8







2.39 × 10-12

-10.3 (-13.5, -7.2)

1.80 × 10-10







2.85 × 10-1

-19.37 (-45.35, 6.62)





aDifferential methylation analysis was conducted between 150 NSCLC and adjacent normal tissues. AMP represents average methylation percentage. b P valueb is the Bonferroni adjusted P value which is based on paired t-test comparing the intensity of the methylation signals between case and control. cThe log10(OR) and P valuec represent log-transformed odds ratio and P value based on logistic regression adjusted by sex, age and smoking status. dSensitivity, specificity and area under the curve (AUC) were calculated with a logistic regression prediction model without adjustment for sex, age and smoking status. eReference site was a C site that was not in the CpG site; therefore, no or a low-methylated signal would be detected and a nonsignificant association should be detected between cancer and normal tissues.

Sensitivity, specificity and accuracy of the diagnosis panel

Several classification methods, including logistic regression model, random forest, support vector machine (SVM), and Bayes tree, were used to construct effective diagnosis models for cancer prediction based on MSD-SNuPET results. No significant unbalances were found in the train and test dataset, which suggested the prediction models were credible and stable. Fivefold cross validation was used to evaluate the performance of the classifiers. As a result, the Bayes tree was the most powerful model for the diagnosis of NSCLC, whose sensitivity (Sen), specificity (Spe) and classification accuracy (Acc) were 86%, 96% and 91% (Table 3), respectively. Other classification methods had similar performance, and the worst classifier was the logistic regression. However, even the logistic regression model incorporated the same five genes mentioned above, and in this model, the sensitivity, specificity, classification accuracy, and area under the curve (AUC) could reach 78%, 97%, 87%, and 0.906 (95% CI: 0.89 to 0.91), respectively, after being adjusted for age, sex and smoking. The logistic regression still showed the potential diagnostic significance of the five methylated genes. In addition, prediction abilities between smoking and non-smoking, adenocarcinoma and squamous cell carcinoma, early stage (I and II) and late stage (III and IV), and well or moderately and poorly differentiated populations were assessed under the Bayes tree model. We found there is no significant differential performance between smoking (Acc = 92.1%, 95% CI: 90.6% to 93.6% ) and non-smoking (Acc = 0.939, 95% CI: 0.935 to 0.943), adenocarcinoma (Acc = 0.82, 95% CI: 0.72 to 0.92) and squamous cell carcinoma (Acc = 0.94, 95% CI: 0.87 to 0.95), early stage (Acc = 0.87, 95% CI: 0.75 to 0.87) and late stage (Acc = 0.92, 95% CI: 0.82 to 0.92), while a significant difference (permutation test, P <10 to 10) was found between well or moderately (Acc = 0.9, 95% CI: 0.83 to 0.91) and poorly differentiated populations (Acc = 0.73, 95% CI: 0.5 to 0.74), which suggested further research should be considered.
Table 3

Diagnosis accuracy, sensitivity and specificity based on several classification methods with fivefold cross-validation











Logistic regression














Random forest







Bayes tree







aSVM represents support vector machines and Kernel Methods. Sensitivity, specificity and classification accuracy were its mean value in fivefold validations with 1,000 replications. In the main body of the manuscript, sensitivity, specificity and accuracy were derived from training result of the classification.


NSCLC early diagnosis and corresponding surgical intervention are taken as the most effective methods for increasing the survival time and for decreasing mortality from NSCLC death. Since the global change of DNA methylation occurred in the beginning of the carcinogenesis, DNA methylation has been considered as the most powerful biomarker for early detection, even screening [20]. In the present study, the two stage biomarker discovery pipeline was applied to optimize the combination of DNA methylation biomarkers for NSCLC diagnosis. The optimal biomarker combination was identified using 107 genes in a large discovery dataset. A novel DNA methylation diagnosis panel of five genes (NTSR1, SLC5A8, GALR1, AGTR1 and ZMYND10) was identified. The DNA methylation diagnosis panel was then validated in another independent NSCLC study. A multi-loci DNA methylation detection method (MSD-SNuPET), was conducted to determine the absolute quantitative methylation level of the five genes in 150 pairs of NSCLC and adjacent normal tissues from a Chinese Han population. In the validation stage, the Bayes tree model shows the highest sensitivity, specificity and accuracy for NSCLC diagnosis based on the five genes, which is potential for clinical application.

It is important that five candidate biomarkers have been investigated widely in cancer research. Neurotensin receptor-1 (NTSR1) is a G-protein coupled receptor (GPCR). It has been widely reported to be associated with carcinogenesis, cancer progression [21] and prognosis [22,23]. Previous evidence showed the potential use of the NTSR1 as a biomarker for cancer progression and as a component of personalized medicine in selective cancers [24], and this is consistent with our present result. GALR1, galanin receptor subtype 2, suppresses cell proliferation in several cancers such as head and neck [25,26] and oral squamous cell carcinoma [27]. Gene expression inactivation of GALR1 can be caused by promoter hypermethylation [25]. Meanwhile, GALR1 has also been a subtype determining gene in breast cancer, which suggests its potentially powerful role in cancer diagnosis. SLC5A8 (solute carrier family 5, member 8) is a tumor suppressor gene and is usually suppressed in colon, and gastric cancers [28-30]. ZMYND10 (Zinc finger, MYND-type containing 10) has recently been identified as a candidate tumor suppressor gene due to the occurrence of mis-sense mutations and loss of its expression in lung cancer.

Multicellular tissue is a great challenge in epigenetic studies. On one side, cancer tissues include cancer cells (epithelial cells), mesenchymal cells and so on. However, the proportion (at least 70% in general) of the tumor cells in cancer tissue is always significantly much higher than that of other cells. On the other side, normal tissues also include epithelial cells, mesenchymal cells and some others. In the present study, the null hypothesis is that the methylation level in the cancer tissue (mixed cells) is the same with normal tissue (mixed cells). The alternative hypothesis is that the methylation level in the cancer tissue (mixed cells) is different from normal tissue (mixed cells). We used the paired t-test to test the difference in the mean of the methylation between cancer tissue and normal tissue. The background or the noises from the adjacent non-cancer cells could be adjusted from the cancer cells when the methylation profiles of the corresponding cells were established.

All the results in the present study were based on quantitative signals of the DNA methylation. We also conducted analyses that were based on discrete DNA methylation signals in which beta values <0.2 were defined as the un-methylated CpGs; beta values >0.8 were defined as the full methylated CpGs, and beta values between 0.2 and 0.8 define semi-methylated CpGs. In this condition, five genes were still significantly differentially methylated between the NSCLC and normal tissues. No significant changes were found in classification sensitivity, specificity and accuracy. Also, the sensitivity, specificity and AUC of diagnosis with one gene added to the model each time are summarized in the Additional file 1: Figure S5; in these cases, we found that sensitivity and AUC gradually increased, step by step.

Lung cancer diagnosis is a challenging problem. In order to discover a potential panel of DNA methylation-based biomarkers for diagnosis of NSCLC, we should perform a genome-wide search for an optional combination of tens or hundreds of loci from the genome-wide DNA methylation profile. Integration analysis of interplatform, genome-wide DNA methylation datasets with appreciated data normalization and batch effect elimination could provide optimal biomarker combinations in a large sample population to obtain maximum diagnosis efficiency. With this approach, we identified a five-gene signature including AGTR1, GALR1, SLC5A8, ZMYND10 and NTSR1, which could provide high diagnostic sensitivity and specificity.


Integrated analysis of multiple-platform high-throughput DNA methylation microarray datasets followed by batch effect elimination is a good approach to discover diagnostic biomarker panels for NSCLC. Methylation profiles of AGTR1, GALR1, SLC5A8, ZMYND10 and NTSR1 would be an effective methylation-based assay for the NSCLC diagnosis.


Study design and pipeline description

Public high-throughput microarray databases that include GEO and ArrayExpress were searched to collect NSCLC-related DNA methylation microarray data. Non-small cell lung cancer and/or methylation were taken as the key words in the retrieval procedure. Although a large number studies have been conducted in NSCLC biomarker research, only two GSE records were retrieved, including GSE16559 and GSE28094. GSE16559, which included 57 NSCLC and 52 normal tissue samples, was used to discover aberrant DNA methylation in lung adenocarcinoma and mesothelioma. GSE28094, with 33 NSCLC and 3 normal tissue samples, was designed to make the DNA methylation fingerprint with 1,628 human samples of different tissues and statuses. Both of these two datasets were based on the Illumina GoldenGate platform, which includes 371 genes with 1,536 loci. Additionally, the CGA project is another comprehensive study that included 262 NSCLC and 51 normal tissue samples. Infinium methylation 27 K with 14,495 genes and 27,578 loci were used to perform the DNA methylation profiling. The number of DNA methylation genes shared by these two methylation microarray platforms was 107 genes (112 probes). Eventually, DNA methylation profiling data of 458 NSCLC-associated samples (352 NSCLC and 106 normal tissue) were obtained from the above three public datasets. These data will be taken as the primary data in the biomarker discovery stage (Additional file 1: Table S1).

When the microarray is provided as fluorescent signals, the gene methylation level was calculated with the fluorescent signals of methylation and un-methylation alleles by the traditional function of
$$ \mathrm{beta}=\frac{max\left(\mathrm{M},0\right)}{max\left(\mathrm{M},0\right)+ max\left(\mathrm{U},0\right)}. $$
where M and U represent the signal intensities for about 30 methylated (M) and un-methylated (U) probes on the array. Background-correction was conducted according to the recommended methods for each platform. K-nearest neighbor imputation (KNN imputation) was performed to deal with the missing values. A total of 112 probes were shared between these two microarray platforms. DNA methylation signals of these probes were combined for all the samples. Quantile normalization was applied to combine all the data from different studies. To further reduce biases, we use the batch effect elimination tool, ComBat, to eliminate the batch effects that exist in independent datasets [15]. In the present study, we use the principal component analysis (PCA) to visualize the extension of the elimination of batch effect by observing the batch information distribution in the two-dimension plot of principle component 1 (PC1) and principle component 2 (PC2). The data adjusted by the Combat was then used for feature selection procedure in classification and differential methylation analysis. Feature selection was conducted by random forest and SVM with leave-one cross-validation. Differential methylation analysis was conducted by Wilcox signed-rank test without normality assumption. The most powerful panel was identified and the differential methylation status was estimated. In the validation stage, the methylation status of genes from the above panel (methylation genes combination) was detected in 150 NSCLC and normal tissues from the Chinese Han population by MSD-SNuPET. Logistic regression model, random forest, support vector machine (SVM), and Bayes tree were used to classify NSCLC in the validation data with fivefold cross-validation.

Patients, samples and DNA

NSCLC samples and corresponding normal lung tissues for validation study in the Chinese population were obtained from 150 patients who underwent pulmonary resection for primary NSCLC at Changhai Hospital, Shanghai, China. The study was approved by Fudan University and Changhai Hospital, and informed consents were obtained from the patients. Exclusion criteria included subjects with a family history of lung cancer, previous radiotherapy, and chemotherapy or adjuvant therapy before surgery. All tissues were immediately frozen at -80°C after surgical resection. Histological examination and tumor-node-metastasis classification were conducted according to the World Health Organization classification criteria [31] and the AJCC Cancer Staging Manual, 7th Edition [32], respectively. Age, sex, smoking status, histology type, TNM stage and differentiation status were collected for use as the covariates when conducting the association between DNA methylation and disease status. Smoking status was assigned to a binary status: never and ever smoking. TNM stage was assigned to early stage (I and II) or late stage (III and IV) when necessary, so that the sample size can be big enough to get the efficient statistic power.

Methylation status-dependent single nucleotide primer extension assay

DNA extraction and bisulfite conversion were performed as previously described [33,34]. Methylation status determined by the single nucleotide primer extension technique (MSD-SNuPET) was designed for the quantification of methylation at multiple methylated loci simultaneously. MSD-SNuPET was developed based on SNPshot technology to bisulfite converted CpG sites. An unmethylated cytosine would be converted to uracil when treated with bisulfite, whereas methylated cytosine maintains as the cytosine. Therefore, methylation status detection can be detected by specific primer and PCR amplification. Primer 3.0 was used to design primer sets (called the amplifying primers) which were applied to amplify genome regions including the target CpG sites. Allele-specific elongation primers were used to quantify the copy number of C and T alleles. Primer pairs were showed in Additional file 1: Table S2. PCR was performed in a final volume of 10 μL containing 1× HotStarTaq buffer, 3.0 mM Mg2+, 0.3 mM dNTP, 1 U HotStarTaq polymerase (Qiagen Inc. USA), 1 μl DNA template and 1 μl multiple primer set. Amplifications were conducted in a GeneAmp PCR System 9700 thermal cycler (Applied Biosystems, Foster City, CA) with the following thermal cycling profile: denaturation for 2 min at 95°C, followed by 11 cycles, each consisting of 20 sec at 94°C, 40 sec at 60°C, 90 sec at 72°C, and a final extension step for 2 min at 72°C. Negative and positive controls were included in each run of PCR as described above. The products of the sequencing reactions were purified and SNaPshot analysis of single nucleotides extension for multiple loci operation was shown as in our previous works [35]. DNA sequencing was conducted with the 3730 DNA analyzer. GeneMapper 4.1 (Applied Biosystems, Co., Ltd., USA) was used to analyze the fluorescence signals that represent different alleles. DNA methylation level was positively correlated with the magnitude of the C allele (H C ) and negative corrected with the magnitude of the T allele (H T ) in MSD-SNuPET technique (Additional file 1: Figure S1). In order to quantitatively estimate the methylation level for each CpG site, a standard calibration curve was established, in which synthetic DNA fragments of C and T alleles were mixed with C allele proportion at 10%, 20%, 30%, 35%, 40%, 50%, 60%, 70%, 75%, 80% and 90%, respectively. Then, a standard calibration curve could be fitted as a quadratic regression model: y = β 0 x 2 + β 1 x, in which β 0 and β 1 are optimized parameters. x indicates the ratio of H and T alleles (H C /H T ). In the present study, one technique and biological control were set. The reference site was a C site that was not in the CpG site; therefore, a low methylation signal should be detected and nonsignificant association should be detected between cancer and normal samples. Methylation status of LINE-1 was taken as a biological control since we are clear that it is hypomethylation in the cancer tissues.

Statistical analysis and machine learning

We selected methylated genes for classification by ranking genes with P values for testing differential methylation between tumor and normal tissue samples. We used three test statistics: student t-test, Wilcoxon rank sum test and Wilcoxon signed rank test statistic to test for differential methylation between two conditions for the normal distribution of methylation level, nonpaired tumor and normal tissue samples and paired tumor and normal tissue samples, respectively. False discovery rate (FDR) correction was used for multiple test correction with the R function of p.adjust with fdr as a parameter. Euclidean distance and partitioning around medoids were used to conduct hierarchical cluster analysis. Logistic regression (Package stats), support vector machine (SVM, Package e1071), random forest based classification (Package randomForest) and Bayes tree (Package BayesTree) were used to classify the NSCLC tumor and normal tissues. The optimized prediction model was built with the best prediction accuracy in the training dataset, and then, the sensitivity, specificity, accuracy were obtained from logistic regression, SVM, random forest and Bayes tree model in the test dataset with previous parameters applied in the training stage. All statistical analyses were conducted in R [36]. Protein-protein interaction networks were constructed by String 9.0 to show the function network of the genes in our study [37].



angiotensin II receptor, type 1


area under the curve


galanin receptor 1


long interspersed element-1


methylation status determined single nucleotide primer extension technology


non-small cell lung cancer


neurotensin receptor 1


solute carrier family 5, member 8


The Cancer Genome Atlas Project


zinc finger, MYND-type containing 10



We thank all participating subjects for their kind cooperation in this study. We thank Dr. Hongyan Xu (Department of Biostatistics and Epidemiology, Georgia Regents University) and Dr. Yan Sun (Department of Biomedical Informatics, School of Medicine, Emory University) for their critical review and comments. This research was supported by the Science and Technology Committee of Shanghai Municipality (11DJ1400102), National High-Tech Research and Development Program (2012AA021802), National Science Foundation of China (NSFC, 81172228, 30890034), Ministry of Science and Technology (2011BAI09B00), and 111 Project (B13016). The computations involved in this study were supported by Fudan University High-End Computing Center.

Authors’ Affiliations

State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University Jiangwan Campus
Department of Cardiothoracic Surgery, Changzheng Hospital of Shanghai
Yangzhou No.1 People’s Hospital
Department of Cardiothoracic Surgery, Changhai Hospital of Shanghai
Department of Pneumology, Changhai Hospital of Shanghai
Center for Genetic & Genomic Analysis, Genesky Biotechnologies Inc.
Department of Cardiothoracic Surgery, Huashan Hospital, Fudan University
Human Genetics Center, The University of Texas School of Public Health
Fudan-Taizhou Institute of Health Sciences


  1. Siegel R, Naishadham D, Jemal A. Cancer statistics, 2012. CA Cancer J Clin. 2012;62:10–29.PubMedView ArticleGoogle Scholar
  2. Hankey BF, Ries LA, Edwards BK. The surveillance, epidemiology, and end results program: a national resource. Cancer Epidemiol Biomarkers Prev. 1999;8:1117–21.PubMedGoogle Scholar
  3. Li H, Yang L, Zhao X, Wang J, Qian J, Chen H, et al. Prediction of lung cancer risk in a Chinese population using a multifactorial genetic model. BMC Med Genet. 2012;13:118.PubMed CentralPubMedView ArticleGoogle Scholar
  4. Guo S, Wang YL, Li Y, Jin L, Xiong M, Ji QH, et al. Significant SNPs have limited prediction ability for thyroid cancer. Cancer Med. 2014;3:731–5.PubMed CentralPubMedView ArticleGoogle Scholar
  5. He Y, Cui Y, Wang W, Gu J, Guo S, Ma K, et al. Hypomethylation of the hsa-miR-191 locus causes high expression of hsa-mir-191 and promotes the epithelial-to-mesenchymal transition in hepatocellular carcinoma. Neoplasia. 2011;13:841–53.PubMed CentralPubMedGoogle Scholar
  6. Flores K, Wolschin F, Corneveaux JJ, Allen AN, Huentelman MJ, Amdam GV. Genome-wide association between DNA methylation and alternative splicing in an invertebrate. BMC Genomics. 2012;13:480.PubMed CentralPubMedView ArticleGoogle Scholar
  7. Zhao Y, Sun J, Zhang H, Guo S, Gu J, Wang W, et al. High-frequency aberrantly methylated targets in pancreatic adenocarcinoma identified via global DNA methylation analysis using methylCap-seq. Clin Epigenetics. 2014;6:18.PubMed CentralPubMedView ArticleGoogle Scholar
  8. Laird PW. The power and the promise of DNA methylation markers. Nat Rev Cancer. 2003;3:253–66.PubMedView ArticleGoogle Scholar
  9. Jiang F, Todd NW, Li R, Zhang H, Fang H, Stass SA. A panel of sputum-based genomic marker for early detection of lung cancer. Cancer Prev Res (Phila). 2010;3:1571–8.View ArticleGoogle Scholar
  10. Zhu J, Yao X. Use of DNA methylation for cancer detection: promises and challenges. Int J Biochem Cell Biol. 2009;41:147–54.PubMedView ArticleGoogle Scholar
  11. Zhao Y, Zhou H, Ma K, Sun J, Feng X, Geng J, et al. Abnormal methylation of seven genes and their associations with clinical characteristics in early stage non-small cell lung cancer. Oncol Lett. 2013;5:1211–8.PubMed CentralPubMedGoogle Scholar
  12. Anglim PP, Alonzo TA, Laird-Offringa IA. DNA methylation-based biomarkers for early detection of non-small cell lung cancer: an update. Mol Cancer. 2008;7:81.PubMed CentralPubMedView ArticleGoogle Scholar
  13. Nikolaidis G, Raji OY, Markopoulou S, Gosney JR, Bryan J, Warburton C, et al. DNA methylation biomarkers offer improved diagnostic efficiency in lung cancer. Cancer Res. 2012;72:5692–701.PubMed CentralPubMedView ArticleGoogle Scholar
  14. Guo S, Tan L, Pu W, Wu J, Xu K, Li Q, et al. Quantitative assessment of the diagnostic role of APC promoter methylation in non-small cell lung cancer. Clin Epigenetics. 2014;6:5.PubMed CentralPubMedView ArticleGoogle Scholar
  15. Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. Plos One. 2011;6:e17238.PubMed CentralPubMedView ArticleGoogle Scholar
  16. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27.PubMedView ArticleGoogle Scholar
  17. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–10.PubMed CentralPubMedView ArticleGoogle Scholar
  18. El-Maarri O, Walier M, Behne F, van Üüm J, Singer H, Diaz-Lacava A, et al. Methylation at global LINE-1 repeats in human blood are affected by gender but not by age or natural hormone cycles. PLoS One. 2011;6:e16252.PubMed CentralPubMedView ArticleGoogle Scholar
  19. El-Maarri O, Becker T, Junen J, Manzoor SS, Diaz-Lacava A, Schwaab R, et al. Gender specific differences in levels of DNA methylation at selected loci from human total blood: a tendency toward higher methylation levels in males. Hum Genet. 2007;122:505–14.PubMedView ArticleGoogle Scholar
  20. Tsou JA, Hagen JA, Carpenter CL, Laird-Offringa IA. DNA methylation analysis: a powerful new tool for lung cancer diagnosis. Oncogene. 2002;21:5450–61.PubMedView ArticleGoogle Scholar
  21. Heakal Y, Woll MP, Fox T, Seaton K, Levenson R, Kester M. Neurotensin receptor-1 inducible palmitoylation is required for efficient receptor-mediated mitogenic-signaling within structured membrane microdomains. Cancer Biol Ther. 2011;12:427–35.PubMed CentralPubMedView ArticleGoogle Scholar
  22. Valerie NC, Casarez EV, Dasilva JO, Dunlap-Brown ME, Parsons SJ, Amorino GP, et al. Inhibition of neurotensin receptor 1 selectively sensitizes prostate cancer to ionizing radiation. Cancer Res. 2011;71:6817–26.PubMedView ArticleGoogle Scholar
  23. Alifano M, Souaze F, Dupouy S, Camilleri-Broet S, Younes M, Ahmed-Zaid SM, et al. Neurotensin receptor 1 determines the outcome of non-small cell lung cancer. Clin Cancer Res. 2010;16:4401–10.PubMedView ArticleGoogle Scholar
  24. Dupouy S, Mourra N, Doan VK, Gompel A, Alifano M, Forgez P. The potential use of the neurotensin high affinity receptor 1 as a biomarker for cancer progression and as a component of personalized medicine in selective cancers. Biochimie. 2011;93:1369–78.PubMedView ArticleGoogle Scholar
  25. Misawa K, Ueda Y, Kanazawa T, Misawa Y, Jang I, Brenner JC, et al. Epigenetic inactivation of galanin receptor 1 in head and neck cancer. Clin Cancer Res. 2008;14:7604–13.PubMed CentralPubMedView ArticleGoogle Scholar
  26. Kanazawa T, Kommareddi PK, Iwashita T, Kumar B, Misawa K, Misawa Y, et al. Galanin receptor subtype 2 suppresses cell proliferation and induces apoptosis in p53 mutant head and neck cancer cells. Clin Cancer Res. 2009;15:2222–30.PubMed CentralPubMedView ArticleGoogle Scholar
  27. Henson BS, Neubig RR, Jang I, Ogawa T, Zhang Z, Carey TE, et al. Galanin receptor 1 has anti-proliferative effects in oral squamous cell carcinoma. J Biol Chem. 2005;280:22564–71.PubMedView ArticleGoogle Scholar
  28. Park JY, Helm JF, Zheng W, Ly QP, Hodul PJ, Centeno BA, et al. Silencing of the candidate tumor suppressor gene solute carrier family 5 member 8 (SLC5A8) in human pancreatic cancer. Pancreas. 2008;36:e32–9.PubMedView ArticleGoogle Scholar
  29. Ueno M, Toyota M, Akino K, Suzuki H, Kusano M, Satoh A, et al. Aberrant methylation and histone deacetylation associated with silencing of SLC5A8 in gastric cancer. Tumour Biol. 2004;25:134–40.PubMedView ArticleGoogle Scholar
  30. Miyauchi S, Gopal E, Fei YJ, Ganapathy V. Functional identification of SLC5A8, a tumor suppressor down-regulated in colon cancer, as a Na(+)-coupled transporter for short-chain fatty acids. J Biol Chem. 2004;279:13293–6.PubMedView ArticleGoogle Scholar
  31. Gibbs AR, Thunnissen FB. Histological typing of lung and pleural tumours: third edition. J Clin Pathol. 2001;54:498–9.PubMed CentralPubMedView ArticleGoogle Scholar
  32. Edge SB, Compton CC. The American Joint Committee on Cancer: the 7th edition of the AJCC cancer staging manual and the future of TNM. Ann Surg Oncol. 2010;17:1471–4.PubMedView ArticleGoogle Scholar
  33. Zhao Y, Guo S, Sun J, Huang Z, Zhu T, Zhang H, et al. Methylcap-seq reveals novel DNA methylation markers for the diagnosis and recurrence prediction of bladder cancer in a Chinese population. PLoS One. 2012;7:e35175.PubMed CentralPubMedView ArticleGoogle Scholar
  34. Wang X, Wang L, Guo S, Bao Y, Ma Y, Yan F, et al. Hypermethylation reduces expression of tumor-suppressor PLZF and regulates proliferation and apoptosis in non-small-cell lung cancers. FASEB J. 2013;27:4194–203.PubMedView ArticleGoogle Scholar
  35. Wang YL, Feng SH, Guo SC, Wei WJ, Li DS, Wang Y, et al. Confirmation of papillary thyroid cancer susceptibility loci identified by genome-wide association studies of chromosomes 14q13, 9q22, 2q35 and 8p12 in a Chinese population. J Med Genet. 2013;50:689–95.PubMedView ArticleGoogle Scholar
  36. Dessau RB, Pipper CB. “R”–project for statistical computing. Ugeskr Laeger. 2008;170:328–30.PubMedGoogle Scholar
  37. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561–8.PubMed CentralPubMedView ArticleGoogle Scholar


© Guo et al.; licensee BioMed Central. 2015

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.