Skip to main content


Targeted bisulfite sequencing identified a panel of DNA methylation-based biomarkers for esophageal squamous cell carcinoma (ESCC)



DNA methylation has been implicated as a promising biomarker for precise cancer diagnosis. However, limited DNA methylation-based biomarkers have been described in esophageal squamous cell carcinoma (ESCC).


A high-throughput DNA methylation dataset (100 samples) of ESCC from The Cancer Genome Atlas (TCGA) project was analyzed and validated along with another independent dataset (12 samples) from the Gene Expression Omnibus (GEO) database. The methylation status of peripheral blood mononuclear cells and peripheral blood leukocytes from healthy controls was also utilized for biomarker selection. The candidate CpG sites as well as their adjacent regions were further validated in 94 pairs of ESCC tumor and adjacent normal tissues from the Chinese Han population using the targeted bisulfite sequencing method. Logistic regression and several machine learning methods were applied for evaluation of the diagnostic ability of our panel.


In the discovery stage, five hyper-methylated CpG sites were selected as candidate biomarkers for further analysis as shown below: cg15830431, P = 2.20 × 10−4; cg19396867, P = 3.60 × 10−4; cg20655070, P = 3.60 × 10−4; cg26671652, P = 5.77 × 10−4; and cg27062795, P = 3.60 × 10−4. In the validation stage, the methylation status of both the five CpG sites and their adjacent genomic regions were tested. The diagnostic model based on the combination of these five genomic regions yielded a robust performance (sensitivity = 0.75, specificity = 0.88, AUC = 0.85). Eight statistical models along with five-fold cross-validation were further applied, in which the SVM model reached the best accuracy in both training and test dataset (accuracy = 0.82 and 0.80, respectively). In addition, subgroup analyses revealed a significant difference in diagnostic performance between the alcohol use and non-alcohol use subgroups.


Methylation profiles of the five genomic regions covering cg15830431 (STK3), cg19396867, cg20655070, cg26671652 (ZNF418), and cg27062795 (ZNF542) can be used for effective methylation-based testing for ESCC diagnosis.


Esophageal cancer is one of the most aggressive cancers and one of the leading causes of cancer death worldwide [1,2,3]. Esophageal cancer can be classified as esophageal adenocarcinoma (EAC) or esophageal squamous cell carcinoma (ESCC) by histology [4, 5]. The incidence of EAC is higher in Western countries, while the ESCC subtype is predominant in Asians, especially in China (88.84%), suggesting that the studies of ESCC in the Chinese population is of great importance [6,7,8,9,10]. Currently, most of ESCCs are diagnosed at advanced stages, and studies have revealed that the 5-year survival rate is much higher in the early stage than in the advanced stages of ESCC, indicating the urgent need for effective early diagnosis methods [11,12,13].

DNA methylation is a key epigenetic modification in the mammalian genomes with many essential functions, including the repression of gene expression and genomic imprinting [14,15,16,17]. Numerous studies have suggested that the altered DNA methylation patterns in tumor tissues may silence the tumor suppressor genes and activate the oncogenes through hyper/hypo methylation [18, 19]. In addition, DNA methylation alterations have been found to occur early in the carcinogenesis and therefore could be applied as a promising biomarker for cancer early detection [20,21,22]. Till now, numerous DNA methylation-based biomarkers have been identified in several types of cancers, including lung cancer, colorectal cancer, prostate cancer, gastric cancer, etc. [23,24,25,26]. What is more, SHOX2 methylation-based screening biomarker has been commercialized in lung cancer [27]. However, despite of several diagnostic panels for ESCC detection, these studies were limited by the relatively small sample size, inaccurate methylation detection methods, and lack of validation datasets. Biomarkers with these limitations may pose a burden for the further prospective studies with large sample sizes.

Therefore, due to the limitations of the current biomarkers, we want to extract more cost-efficient biomarkers with high sensitivity and specificity for ESCC early diagnosis. In addition, with the fast development of liquid biopsy of cancer diagnosis, the diagnostic biomarkers are urgently needed and applied for the large-scale prospective studies. Here, we integrated the ESCC methylation datasets from the public database for biomarker screening and validated a biomarker panel consisting of five candidate CpG sites in 94 pairs of ESCC and normal tissues from the Chinese Han population. Due to the relatively high specificity in ESCC diagnosis, the biomarker panel might be further applied in the liquid biopsy of ESCC along with the other biomarkers with high sensitivity.


Integration of TCGA datasets and GEO datasets for biomarker discovery

Public DNA methylation microarray datasets of ESCC were carefully searched. The esophageal carcinoma methylation dataset from TCGA was first identified, with 84 ESCC tumors and 3 ESCC adjacent normal tissue samples, as well as 78 EAC tumors and 13 EAC adjacent normal tissues. In order to achieve better statistical power, we combined the ESCC and EAC adjacent normal tissues as the control samples due to their similarity, which could be validated using PCA analysis (Additional file 1: Figure S1). As a result, 84 ESCC tumor tissues as well as 16 adjacent normal tissues were employed for the discovery stage analysis. In addition, the GSE52826 dataset from the Gene Expression Omnibus (GEO) database, with a relatively small sample size (4 ESCC tumors and 8 control tissues), was also utilized as the validation dataset [28]. Based on our feature selection procedure and the primer design filtering for constructing the multiplex PCR reaction system, which was described in the “Methods” section (Fig. 1), cg15830431 (P = 2.20 × 10−4), cg19396867 (P = 3.60 × 10−4), cg20655070 (P = 1.71 × 10−3), cg26671652 (P = 5.77 × 10−4), and cg27062795 (P = 3.60 × 10−4) were selected for further validation. Among them, cg19396867 and cg20655070 were not in the regulatory regions of specific genes, while cg15830431 (STK3, CpG Island), cg26671652 (ZNF418, CpG Shore), and cg27062795 (ZNF542, CpG Island) were either in CpG islands or the CpG shores of a gene. We showed that these 5 selected CpG sites were significantly hyper-methylated in the ESCC tumor tissues, compared to the adjacent normal tissues. Moreover, the methylation status of these 5 CpG sites was also validated in GSE52826 dataset and showed similar results. In addition, all 5 CpG sites showed hypo-methylated states in the PBMC (peripheral blood mononuclear cells) and PBL (peripheral blood leucocytes) from healthy samples (Additional file 2: Table S1). Based on the above analysis, we believed that these 5 CpG sites would be the candidate non-invasive biomarkers for ESCC. As a result, we built a prediction model based on the logistic regression using all 5 predictors without adjustment for age, gender, and other covariates, which provided a way to discriminate between ESCCs and normal tissues (sensitivity = 0.89, specificity = 0.81, AUC = 0.87). To further evaluate and validate the diagnostic ability of these 5 CpG sites, we then conducted the validation study in 94 paired ESCC and adjacent normal tissue samples obtained from patients from the Chinese Han population.

Fig. 1

Flow chart of the study design. Candidate biomarkers were selected from the high-throughput DNA methylation microarrays from the TCGA project and further validated with the ESCC methylation data from the GEO dataset, as well as PBMC and PBL from healthy controls. In addition, the PBL and PBMC methylation datasets from healthy samples were also utilized for biomarker filtering. Based on our preliminary screening, the candidate methylation biomarkers for ESCC were then further validated with targeted bisulfite sequencing in independent Chinese Han ESCC patients

Methylation status validation of the five CpG sites with targeted bisulfite sequencing

The characteristics of the ESCC patients are shown in Table 1. Quality control procedures were first applied to the targeted bisulfite sequencing data. We found that the bisulfite conversion rate of each sample was higher than 98%, and no significant difference was found between the tumors and adjacent normal tissues, indicating the bisulfite conversion was efficient and reliable (Fig. 3a). In addition, the samples and the CpG sites with high missing rate (> 30%) and low coverage (< 20×) were also filtered out as described in the “Methods” section. After the quality control procedures, 174 of the 188 samples (94 pairs of ESCC tumor/adjacent normal tissues) still remained for further study. The principal component analysis (PCA) was conducted for all samples and showed a clear discrimination between ESCC tumors and adjacent normal tissues (Additional file 3: Figure S2). Differential methylation analyses were conducted for the five CpG sites as well as nearby CpG sites, suggesting a major difference between the ESCCs and adjacent normal tissues (Fig. 2). A logistic regression model was then applied and showed significant hyper-methylation of the five selected CpG sites in the ESCCs (Table 2, cg15830431, P = 1.25 × 10−6; cg19396867, P = 2.71 × 10−11; cg20655070, P = 8.04 × 10−10; cg26671652, P = 4.82 × 10−11; cg27062795, P = 1.23 × 10−12). As a result, we then averaged the methylation status of all the nearby CpG sites in a genomic region as representatives of the candidate regions for further analysis (Fig. 3b–f). Based on the mean methylation status of the five genomic regions, the prediction ability of each region separately was evaluated through logistic regression without adjustment for age, gender, and other covariates. The sensitivity of each region ranges from 0.64 to 0.74, while the specificity ranges from 0.82 to 0.90 and the AUC ranges from 0.76 to 0.84 (Table 3). Moreover, in the logistic model taking all of the five regions as predictors, we obtained the sensitivity of 0.75 and specificity of 0.88, as well as the AUC of 0.85 (Fig. 3g).

Table 1 Characteristics of the ESCC patients included in this study
Fig. 2

The methylation status of the CpG sites in the five genomic regions. ae represent the methylation status of the CpG sites in regions covering STK3, cg19396867, cg20655070, ZNF418, and ZNF542, respectively. The x-axis represents the genomic positions of the CpG sites in the targeted regions. The y-axis represents the mean methylation percentage in the ESCC tumor tissues as well as the normal tissues for each of the CpG sites. The error bar represents the confidence interval of the methylation percentage in the ESCC tumor tissues as well as the normal tissues for each of the CpG sites

Table 2 The methylation status of the five CpG sites in the TCGA dataset and the validation dataset
Fig. 3

The mean methylation status of each genomic region and bisulfite conversion efficiency between ESCC tumors and normal tissues as well as the overall ROC (Receiver Operating characteristics) curve. a represents the bisulfite conversion efficiency between ESCC and adjacent normal tissues. Bisulfite conversion efficiency was calculated by using the number of transformed C to T divided by the number of C in each sample. bf represent the mean methylation status of the genomic regions covering STK3, cg19396867, cg20655070, ZNF418, and ZNF542, respectively. Each point represents mean methylation percentage in a genomic region of a sample. The boxplot showed overall methylation percentage of different groups in each genomic region. g represents the overall ROC curve, which was calculated through a logistic regression model, incorporating the mean methylation percentage of the five genomic regions as the variables and without the adjustment for gender, age, and smoking status and alcohol status

Table 3 The mean methylation status of the five genomic regions in the validation datasets

The diagnostic ability of the five genomic regions based on machine learning methods

In order to get a better estimation of the diagnostic ability of the selected biomarkers, several machine learning methods, including logistic regression, random forest (RF), supporting vector machine (SVM), neural network (NN), Naïve Bayes (NB), linear discriminant analysis (LDA), mixture discriminant analysis (MDA), and flexible discriminant analysis (FDA), were utilized to build the diagnostic models for ESCC classification. The mean methylation percentages of the CpG sites in each genomic region were utilized for analysis. The five-fold cross-validation method was also conducted to give a robust estimation of the performance of the models. As shown in Table 4, in the training stage, the sensitivity of all the models ranged from 0.63 to 0.76 and the specificity ranged from 0.77 to 0.89. The logistic regression model and the SVM model both performed well with regard to accuracy. In the testing stage, the sensitivity of the models ranged from 0.63 to 0.73 and the specificity ranged from 0.78 to 0.88. The SVM model again achieved the highest accuracy, indicating the robustness and effectiveness of the model. In addition, we found that the diagnostic performance was similar between the training and testing stage in all the models, suggesting the reliability of our results.

Table 4 Diagnosis accuracy, sensitivity, and specificity of different classification models with five-fold cross-validation

Evaluation of diagnostic models in the ESCC subgroup analysis

Previous studies have found several risk factors for the incidence of ESCC, including age, gender, smoking status, and alcohol status [29,30,31,32]. As a result, we conducted subgroup analyses according to these risk factors. The mean methylation percentage of each targeted region was utilized for subgroup analysis. In the young/old subgroups, the median age of each patient was taken as the criteria for dividing the samples. We found that there was no significant difference between the sensitivity, specificity, and the AUC between the two subgroups (Additional file 2: Table S2). The overall AUC using all the variables in the two subgroups was 0.86 for both (Additional file 4: Figure S3A-B). In the male/female subgroups, we found that the diagnostic model performed better for the female subgroup than the male subgroup (Additional file 2: Table S3), and the overall AUC of the female subgroup was much higher than that of the male subgroup (AUC 0.89 vs. 0.84, Additional file 4: Figure S3C-D). In addition, in the smoker/non-smoker subgroup analysis, there was no significant difference between the diagnostic performances (Additional file 2: Table S4 and Additional file 4: Figure S3E-F). However, a significant difference was identified in the analysis of the alcohol/non-alcohol subgroups. Compared with the alcohol use subgroup, the AUCs in four of the five genomic regions were elevated in the non-alcohol subgroup, especially the two genomic regions covering ZNF418 and ZNF542 (Additional file 2: Table S5). The overall AUC obtained with all the genomic regions of the non-alcohol subgroup was substantially higher than that of the alcohol subgroup (0.89 vs. 0.79, respectively; Additional file 4: Figure S3G-H). In addition, we found that our female samples were all included in the non-alcohol subgroup. In order to eliminate the confounding factor of gender in the alcohol/non-alcohol subgroup analysis, we then selected the male samples only for subgroup analysis. Concordantly, we found that the diagnostic ability was still substantially better in the non-alcohol subgroup than in the alcohol subgroup, indicating that the observed difference was not introduced by the gender bias (0.90 vs. 0.79, respectively; Additional file 2: Table S6). The vast difference in the diagnostic performance in the alcohol/non-alcohol subgroup indicates that alcohol use may contribute to the epigenetic changes in ESCC as well as to the pathogenesis of ESCC [30].


DNA methylation plays a key role in the gene expression regulation and therefore has great potential as a non-invasive biomarker for cancer diagnosis and prognosis. ESCC patients who receive an early diagnosis will have longer survival times and lower mortality. Previous studies have found several candidate methylation biomarkers for ESCC detection and prognosis as well as treatment response. In our study, we integrated the methylation dataset from TCGA project and the GEO dataset for biomarker discovery and removed the candidate biomarkers with hyper-methylation status in PBMC and PBL cells of healthy controls to ensure its validity for future non-invasive diagnosis. Finally, a novel DNA methylation biomarker panel consisting of five CpG sites was then identified. Moreover, we validated these five CpG sites in 94 pairs of ESCC tumors and their adjacent normal tissues from Chinese patients with a targeted bisulfite sequencing method, enabling us to not only detect the methylation status of five CpG sites but their genomic regions as well. As a result, we then obtained the mean methylation percentage of each targeted region, which is a more robust estimation of the methylation status than the single CpG site itself. The methylation testing of these five genomic regions has a fairly high accuracy, sensitivity, and specificity in different models, suggesting that the methylation testing of these five genomic regions may be promising biomarkers for the detection of ESCC. In addition, the subgroup analyses identified that the diagnostic performance of the methylation testing is much better in the non-alcohol-consuming patients than in the ESCC patients who consume alcohol, suggesting the importance of taking the epidemiological data into considerations when performing ESCC diagnosis. Further studies may be required to explore the association between the methylation status of these five genomic regions and the use of alcohol.

Of the five genomic regions, two genomic regions covering cg19396867 and cg20655070 were not in the regulatory regions of specific genes. However, the H3k4me3, H3k4me1, and H3k27ac status of these two regions from the ENCODE project showed that these regions might be associated with the enhancers, indicating that the regions might also have important regulatory functions (data not shown). In contrast, cg15830431 (STK3, CpG Island), cg26671652 (ZNF418, CpG Shore), and cg27062795 (ZNF542, CpG Island) were either in the CpG islands or the CpG shores of a gene. The serine/threonine kinase 3 (STK3) gene encodes a serine/threonine protein kinase and functions as a growth suppressor, which is one of the key components of the Hippo signaling pathway involving apoptosis. A previous study has found that the deletion of STK3 in mouse liver results in tissue overgrowth and tumor development, demonstrating its importance in suppressing carcinogenesis [33]. Also, hyper-methylation of STK3 has been found in soft tissue sarcoma as well as head and neck squamous cell carcinoma, which is in accordance with the present study [34, 35]. However, in our expression analysis with the RNA-seq dataset from TCGA, we found that the expression of STK3 was upregulated in the ESCC tumor tissues, which is inconsistent with our assumptions and needs further analysis (Additional file 5: Figure S4). ZNF418 (zinc finger protein 418) is a member of the zinc finger-containing transcription factor family, which has been implicated as critical regulators for development and diseases. ZNF418 has been shown to be a transcriptional repressor, which may act as a negative regulator in the MAPK signaling pathway, and we also found the downregulation of ZNF418 in the ESCC tumor tissues in TCGA dataset, indicating the possible activation of MAPK pathway by decreased expression of ZNF418 in the ESCC pathogenesis (Additional file 5: Figure S4) [36]. ZNF542 (zinc finger protein 542) is a pseudogene, which also may be involved in transcriptional regulation. Studies have found hyper-methylation of ZNF542 in oropharyngeal squamous cell carcinoma and sporadic colorectal cancer [37, 38]. Moreover, a pan-cancer study analysis based on the TCGA methylation datasets identified the hyper-methylation status of ZNF542 in 12 cancer types [39]. Moreover, the expression profiles of ZNF542 were in accordance with the methylation status, which was significantly downregulated in the ESCC tumor tissues (Additional file 5: Figure S4).

Several studies have recently conducted a search for the miRNAs and metabolomics, as well as DNA methylation-based biomarkers for ESCC diagnosis. Zhou X et al. have found a panel consisting of six microRNAs in serum which could serve as the biomarker for ESCC diagnosis [40]. Moreover, miR-1246, miR-18a, miR-25,s and miR-21 were all validated as the promising diagnostic biomarkers for ESCC previously [41,42,43,44]. Several mRNA-based biomarkers were also confirmed as candidate biomarkers for ESCC [45,46,47]. Jing X et al. found that the urine metabolomics were the promising diagnostic biomarkers for ESCC [48]. As for the DNA methylation-based biomarkers, Hiroaki N et al. have found that HOXB2 and SEPT9 were two candidate diagnostic biomarkers for the prediction of lymph node metastasis of ESCC [49]. In addition, the methylation status of PAX1, ZNF582, HIN1, TFPI-2, DACH1, and SOX17 were all reported as the candidate diagnostic biomarkers for ESCC [50, 51].

Compared with the other kinds of biomarkers, DNA methylation alterations may occur in advance of the alterations of mRNA and protein levels in the carcinogenesis thus might have a better early diagnosis potential. Here, in our study, we integrated the public high-throughput microarray datasets and applied the targeted bisulfite sequencing method to explore the methylation status of our candidate CpG sites as well as their adjacent genomic regions. With the fast development of next generation sequencing (NGS), the targeted bisulfite sequencing method is becoming the recommended method for methylation detection because of high accuracy and high throughput and cost-effectiveness. Previous studies have revealed that the adjacent CpG sites on the same DNA molecules would share similar methylation patterns due to the locally coordinated activities of the DNA methyltransferases (DNMTs) or ten-eleven translocation (TET) proteins, which are methylation haplotypes, epi-alleles, or epi-haplotypes [52,53,54]. Because of the increased CpG sites in the region, the methylation haplotypes may be less susceptible to the complex and random environment stimulus and would be a more stable representative for methylation quantification [55].

Till now, the majority of the ESCC patients were diagnosed at later stage, and conventional endoscopy is expensive and depends on the availability of specialist clinical expertise and the diagnostic accuracy is relatively low [56]. As a result, better diagnostic methods for ESCC are urgently needed. Recently, a novel method which involves swallowing a sponge on a string has been proposed for ESCC diagnosis [57, 58]. The sponge is then gently pulled back out, taking a sample of cells from the person’s esophagus. In this case, our panel could be served as the diagnostic biomarkers. In addition, with the need for non-invasive diagnosis soaring up, our diagnostic panel could also be utilized for the liquid biopsy for ESCC in coordination with the other kinds of biomarkers.

The early diagnosis of esophageal squamous cell carcinoma is challenging due to its high heterogeneity. A single biomarker by itself may not be adequate for accurate diagnosis, which suggests that a panel consisting of multi-biomarkers is essential. Though our DNA methylation-based biomarkers have reached a fair accuracy in distinguishing the ESCC tumors from normal tissues, some of the ESCC tumor tissues still remained misclassified. Integration analysis of multi-omics datasets, ranging from genomics and epigenomics, as well as proteomics, may reveal more heterogeneity in ESCC and identify more biomarkers for accurate non-invasive diagnosis.


Integration analysis of ESCC high-throughput DNA methylation datasets from TCGA project and GEO database identified five hyper-methylated CpG sites as candidate biomarkers for ESCC diagnosis, which were further validated in an independent analysis of 94 pairs of ESCC tumors and normal tissues using the targeted bisulfite sequencing method. Methylation profiles of the five genomic regions covering cg15830431 (STK3), cg19396867, cg20655070, cg26671652 (ZNF418), and cg27062795 (ZNF542) may be effective DNA methylation-based testing for ESCC diagnosis.


Biomarker discovery based on the public datasets

Public high-throughput DNA methylation microarray datasets were searched, and the comprehensive methylation dataset of esophageal cancer from the TCGA project was the first obtained. There were 84 ESCC and 3 normal tissues in this level 3 dataset. In addition, we also found that there are 78 EAC and 13 adjacent normal tissues in the TCGA dataset (level 3). To increase the sample size for a more robust biomarker discovery, the adjacent normal tissues of the EAC and ESCC were combined for analysis. Finally, 84 ESCC as well as 16 normal tissues were obtained from TCGA for discovery analysis. In addition, a GSE record named GSE52826 was found, with 4 ESCC and 8 normal tissue samples, which were utilized for preliminary validation.

To strengthen the robustness of the candidate biomarkers, we conducted the differential methylation region (DMR) analysis (Additional file 6: Figure S5). We first took the adjacent six CpG sites as a methylation block, and the range of the block should be shorter than 1000 bp due to the low methylation linkage equilibrium. We then applied the sliding window methods according to the genomic position of each CpG site and slides one CpG site each time. Therefore, some CpG sites were overlapped in the adjacent DMRs. Finally, we extracted 105,673 methylation regions which fulfilled our criteria. After that, we calculated the methylation status between the ESCC and the control tissues for each DMR. In summary, we obtained 411 DMRs, covering 1355 unique candidate CpG sites based on our standards (McaM > 0.40, McoM < 0.20, FDR < 0.01, fold change > 2). Subsequently, these 1355 candidate CpG sites were further filtered based on their methylation status and 713 candidate CpG sites were still retained (McaM > 0.25, McoM < 0.20, Diff > 0.15, fold change > 2, FDR < 0.01). Simultaneously, the methylation differences of the candidate CpG sites in the GEO dataset were also obtained for further validation (McaM > 0.15, McoM < 0.15, Diff > 0.10, fold change > 2, P value < 0.05). After that, only 275 candidate CpG sites were included. Due to the fact that it is inevitable to contain DNAs from some of the peripheral blood cells when performing liquid biopsy, to reduce the noises brought by the methylation status of the peripheral blood cells, it is of great importance that the methylation rate of the candidate biomarker should be very low in the adjacent normal tissues as well as in the peripheral blood so that it can be used for non-invasive cancer diagnosis in the future. As a result, we then filtered the candidate CpG sites with high methylation percentage in the peripheral blood mononuclear cells (PBMC, N = 111) and peripheral blood leucocytes (PBL, N = 527) of the healthy normal samples from the GEO database (the PBMC dataset came from the GSE53045 dataset, and the PBL dataset was the combination of GSE36054 and GSE42861). In addition, the CpG sites located far from the CpG islands were also filtered out. Moreover, we further removed the CpG sites with SNPs in their primers and the CpG sites whose corresponding genes have been studied in ESCC carcinogenesis. In total, 175 candidate CpG sites were finally selected. Based on the CG percent, PolyT, and the number of SNPs in the primers of our targeted regions, we obtained the overall score representing the difficulty levels for all the candidate regions. In total, we selected the top five candidate regions with the best chances to be amplified and conducted in the multiple PCR experiment and removed the other candidate regions for further validation. Finally, five of our candidate biomarkers were selected for further validation: cg15830431, cg19396867, cg20655070, cg26671652, and cg27062795.

Patients, samples, and DNA

ESCC samples and their paired adjacent normal tissues for validation study were obtained from the First Affiliated Hospital of Soochow University and Fourth Military Medical University between the years of 2011 and 2015. The patients who did not undergo any neo-adjuvant therapy before the surgery were recruited only. All tumor tissues were evaluated by pathologists and fulfilled the criteria of tumor percent > 50%. All procedures performed in this study were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments. The studies were approved by the institutional review boards of Soochow University at Jiangsu Province and Fudan University, Shanghai, China. Written informed consent was obtained from each study subject. In addition, all of the subjects were re-examined and confirmed by professional pathologists for histopathological diagnosis. All tissues were immediately frozen at − 80 °C after surgical resection. Face-to-face interviews were conducted by professional investigators with a comprehensive questionnaire, including clinical information on tobacco smoking, alcohol consumption, and family history. The smokers were defined as ever using the tobacco products at least once a day for 6 months, and the alcohol drinkers were defined as ever using the alcohol products at least once a week for 6 months.

Targeted bisulfite sequencing assay

DNA extraction and bisulfite conversion were performed as previously described [59, 60]. Based on the genomic coordinates of the five candidate CpG sites, we carefully designed the primers in order to detect them in a panel (Additional file 2: Table S7). The net-PCR was performed firstly to amplify the targeted DNA sequence. Then, the designed DNA fragments were sequenced by Illumina Hiseq 2000. BSseeker2 is one of the most commonly used tools for analyzing the bisulfite sequencing results and was applied in our study for mapping bisulfite-treated reads as well as for methylation calling [61]. After calling methylation, we obtained the bisulfite conversion rate for each sample, and the samples with bisulfite conversion rate < 98% were firstly filtered out. After the preliminary analysis, we then calculated the average coverage as well as the missing rate for each CpG site. The CpG sites with average coverage less than 20× and/or with missing rate > 0.20 were further filtered out. In addition, the samples with missing rate > 0.30 were filtered out finally.

Statistical analysis and machine learning

In the discovery stage, we applied the Wilcoxon rank-sum test for testing the differential methylation status between cancer and normal tissues of each CpG site. Further, differential methylation status in tumor and normal tissues of the candidate CpG sites were tested with a logistic regression method. False discovery rate (FDR) correction was used for multiple test correction. In addition, the logistic regression (Package stats), support vector machine (SVM, Package e1071), random forest (Package randomForest), Naïve Bayes (Package e1071), neural network (Package nnet), linear discriminant analysis (LDA, Package mda), mixture discriminant analysis (MDA, Package mda), and the flexible discriminant analysis (FDA, Package mda) were used for classifying the ESCC and normal tissues. To obtain a robust evaluation of the prediction ability with these biomarkers and methods, five-fold cross-validation was also applied. In addition, sensitivity, specificity, and accuracy were obtained from the logistic regression model. All statistical analyses were all conducted using R 3.2.1 [62].



Area under the curve


Esophageal adenocarcinoma


Esophageal squamous cell carcinoma


Flexible discriminant analysis


Linear discriminant analysis


Mixture discriminant analysis


Naïve Bayes


Next generation sequencing


Neural network


Peripheral blood cell


Peripheral blood mononuclear cell


Random forest


The serine/threonine kinase 3


Support vector machine


The Cancer Genome Atlas


Zinc finger protein 418


Zinc finger protein 542


  1. 1.

    De Angelis R, Sant M, Coleman MP, Francisci S, Baili P, Pierannunzio D, Trama A, Visser O, Brenner H, Ardanaz E, et al. Cancer survival in Europe 1999-2007 by country and age: results of EUROCARE--5—a population-based study. Lancet Oncol. 2014;15(1):23–34.

  2. 2.

    Siegel RL, Miller KD, Jemal A. Cancer statistics, 2016. CA Cancer J Clin. 2016;66(1):7–30.

  3. 3.

    Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer statistics, 2012. CA Cancer J Clin. 2015;65(2):87–108.

  4. 4.

    Siewert JR, Ott K. Are squamous and adenocarcinomas of the esophagus the same disease? Semin Radiat Oncol. 2007;17(1):38–44.

  5. 5.

    Enzinger PC, Mayer RJ. Esophageal cancer. N Engl J Med. 2003;349(23):2241–52.

  6. 6.

    Brown LM, Devesa SS, Chow WH. Incidence of adenocarcinoma of the esophagus among white Americans by sex, stage, and age. J Natl Cancer Inst. 2008;100(16):1184–7.

  7. 7.

    Shibata A, Matsuda T, Ajiki W, Sobue T. Trend in incidence of adenocarcinoma of the esophagus in Japan, 1993–2001. Jpn J Clin Oncol. 2008;38(7):464–8.

  8. 8.

    Kamangar F, Malekzadeh R, Dawsey SM, Saidi F. Esophageal cancer in northeastern Iran: a review. Arch Iran Med. 2007;10(1):70–82.

  9. 9.

    Lin Y, Totsuka Y, He Y, Kikuchi S, Qiao Y, Ueda J, Wei W, Inoue M, Tanaka H. Epidemiology of esophageal cancer in Japan and China. J Epidemiol. 2013;23(4):233–42.

  10. 10.

    Zeng H, Zheng R, Zhang S, Zuo T, Xia C, Zou X, Chen W. Esophageal cancer statistics in China, 2011: estimates based on 177 cancer registries. Thorac Cancer. 2016;7(2):232–7.

  11. 11.

    Zeng H, Zheng R, Guo Y, Zhang S, Zou X, Wang N, Zhang L, Tang J, Chen J, Wei K, et al. Cancer survival in China, 2003–2005: a population-based study. Int J Cancer. 2015;136(8):1921–30.

  12. 12.

    Besharat S, Jabbari A, Semnani S, Keshtkar A, Marjani J. Inoperable esophageal cancer and outcome of palliative care. World J Gastroenterol. 2008;14(23):3725–8.

  13. 13.

    Coleman MP, Gatta G, Verdecchia A, Esteve J, Sant M, Storm H, Allemani C, Ciccolallo L, Santaquilani M, Berrino F, et al. EUROCARE-3 summary: cancer survival in Europe at the end of the 20th century. Ann Oncol. 2003;14(Suppl 5):v128–49.

  14. 14.

    Bernstein BE, Meissner A, Lander ES. The mammalian epigenome. Cell. 2007;128(4):669–81.

  15. 15.

    Guo H, Zhu P, Yan L, Li R, Hu B, Lian Y, Yan J, Ren X, Lin S, Li J, et al. The DNA methylation landscape of human early embryos. Nature. 2014;511(7511):606–10.

  16. 16.

    Robertson KD. DNA methylation and human disease. Nat Rev Genet. 2005;6(8):597–610.

  17. 17.

    Elhamamsy AR. Role of DNA methylation in imprinting disorders: an updated review. J Assist Reprod Genet. 2017;34(5):549–62.

  18. 18.

    Jones PA. Cancer. Death and methylation. Nature. 2001;409(6817):141. 143-144

  19. 19.

    Aran D, Hellman A. DNA methylation of transcriptional enhancers and cancer predisposition. Cell. 2013;154(1):11–3.

  20. 20.

    Witte T, Plass C, Gerhauser C. Pan-cancer patterns of DNA methylation. Genome Med. 2014;6(8):66.

  21. 21.

    Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief Bioinform. 2016;18(5):761–73.

  22. 22.

    Guo S, Diep D, Plongthongkum N, Fung H-L, Zhang K, Zhang K: Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA. Nat Genet. 2017;49(4):635–42.

  23. 23.

    Walter K, Holcomb T, Januario T, Yauch RL, Du P, Bourgon R, Seshagiri S, Amler LC, Hampton GM, SS D. Discovery and development of DNA methylation-based biomarkers for lung cancer. Epigenomics. 2014;6(1):59–72.

  24. 24.

    Coppede F. Epigenetic biomarkers of colorectal cancer: focus on DNA methylation. Cancer Lett. 2014;342(2):238–47.

  25. 25.

    Ashour N, Angulo JC, Andres G, Alelu R, Gonzalez-Corpas A, Toledo MV, Rodriguez-Barbero JM, Lopez JI, Sanchez-Chapado M, Ropero S. A DNA hypermethylation profile reveals new potential biomarkers for prostate cancer diagnosis and prognosis. Prostate. 2014;74(12):1171–82.

  26. 26.

    Tahara T, Arisawa T. DNA methylation as a molecular biomarker in gastric cancer. Epigenomics. 2015;7(3):475–86.

  27. 27.

    Dietrich D, Kneip C, Raji O, Liloglou T, Seegebarth A, Schlegel T, Flemming N, Rausch S, Distler J, Fleischhacker M, et al. Performance evaluation of the DNA methylation biomarker SHOX2 for the aid in diagnosis of lung cancer based on the analysis of bronchial aspirates. Int J Oncol. 2012;40(3):825–32.

  28. 28.

    Li X, Zhou F, Jiang C, Wang Y, Lu Y, Yang F, Wang N, Yang H, Zheng Y, Zhang J. Identification of a DNA methylome profile of esophageal squamous cell carcinoma and potential plasma epigenetic biomarkers for early diagnosis. PLoS One. 2014;9(7):e103162.

  29. 29.

    Morita M, Kumashiro R, Kubo N, Nakashima Y, Yoshida R, Yoshinaga K, Saeki H, Emi Y, Kakeji Y, Sakaguchi Y, et al. Alcohol drinking, cigarette smoking, and the development of squamous cell carcinoma of the esophagus: epidemiology, clinical findings, and prevention. Int J Clin Oncol. 2010;15(2):126–34.

  30. 30.

    Pandeya N, Williams G, Green AC, Webb PM, Whiteman DC, Australian Cancer S. Alcohol consumption and the risks of adenocarcinoma and squamous cell carcinoma of the esophagus. Gastroenterology. 2009;136(4):1215–24. e1211-1212

  31. 31.

    Wang JM, Xu B, Rao JY, Shen HB, Xue HC, Jiang QW. Diet habits, alcohol drinking, tobacco smoking, green tea drinking, and the risk of esophageal squamous cell carcinoma in the Chinese population. Eur J Gastroenterol Hepatol. 2007;19(2):171–6.

  32. 32.

    Nasrollahzadeh D, Kamangar F, Aghcheli K, Sotoudeh M, Islami F, Abnet CC, Shakeri R, Pourshams A, Marjani HA, Nouraie M, et al. Opium, tobacco, and alcohol use in relation to oesophageal squamous cell carcinoma in a high-risk area of Iran. Br J Cancer. 2008;98(11):1857–63.

  33. 33.

    Zhou D, Conrad C, Xia F, Park JS, Payer B, Yin Y, Lauwers GY, Thasler W, Lee JT, Avruch J, et al. Mst1 and Mst2 maintain hepatocyte quiescence and suppress hepatocellular carcinoma development through inactivation of the Yap1 oncogene. Cancer Cell. 2009;16(5):425–38.

  34. 34.

    Seidel C, Schagdarsurengin U, Blumke K, Wurl P, Pfeifer GP, Hauptmann S, Taubert H, Dammann R. Frequent hypermethylation of MST1 and MST2 in soft tissue sarcoma. Mol Carcinog. 2007;46(10):865–71.

  35. 35.

    Steinmann K, Sandner A, Schagdarsurengin U, Dammann RH. Frequent promoter hypermethylation of tumor-related genes in head and neck squamous cell carcinoma. Oncol Rep. 2009;22(6):1519–26.

  36. 36.

    Li Y, Yang D, Bai Y, Mo X, Huang W, Yuan W, Yin Z, Deng Y, Murashko O, Wang Y, et al. ZNF418, a novel human KRAB/C2H2 zinc finger protein, suppresses MAPK signaling pathway. Mol Cell Biochem. 2008;310(1-2):141–51.

  37. 37.

    Lleras RA, Adrien LR, Smith RV, Brown B, Jivraj N, Keller C, Sarta C, Schlecht NF, Harris TM, Childs G, et al. Hypermethylation of a cluster of Kruppel-type zinc finger protein genes on chromosome 19q13 in oropharyngeal squamous cell carcinoma. Am J Pathol. 2011;178(5):1965–74.

  38. 38.

    Chen CH, Sheng Jiang S, Hsieh LL, Tang R, Hsiung CA, Tsai HJ, Chang IS. DNA methylation identifies loci distinguishing hereditary nonpolyposis colorectal cancer without germ-line MLH1/MSH2 mutation from sporadic colorectal cancer. Clin Transl Gastroenterol. 2016;7(12):e208.

  39. 39.

    Gevaert O, Tibshirani R, Plevritis SK. Pancancer analysis of DNA methylation-driven genes using MethylMix. Genome Biol. 2015;16:17.

  40. 40.

    Zhou X, Wen W, Zhu J, Huang Z, Zhang L, Zhang H, Qi LW, Shan X, Wang T, Cheng W, et al. A six-microRNA signature in plasma was identified as a potential biomarker in diagnosis of esophageal squamous cell carcinoma. Oncotarget. 2017;8(21):34468–80.

  41. 41.

    Hirajima S, Komatsu S, Ichikawa D, Takeshita H, Konishi H, Shiozaki A, Morimura R, Tsujiura M, Nagata H, Kawaguchi T, et al. Clinical impact of circulating miR-18a in plasma of patients with oesophageal squamous cell carcinoma. Br J Cancer. 2013;108(9):1822–9.

  42. 42.

    Komatsu S, Ichikawa D, Hirajima S, Kawaguchi T, Miyamae M, Okajima W, Ohashi T, Arita T, Konishi H, Shiozaki A, et al. Plasma microRNA profiles: identification of miR-25 as a novel diagnostic and monitoring biomarker in oesophageal squamous cell carcinoma. Br J Cancer. 2014;111(8):1614–24.

  43. 43.

    Takeshita N, Hoshino I, Mori M, Akutsu Y, Hanari N, Yoneyama Y, Ikeda N, Isozaki Y, Maruyama T, Akanuma N, et al. Serum microRNA expression profile: miR-1246 as a novel diagnostic and prognostic biomarker for oesophageal squamous cell carcinoma. Br J Cancer. 2013;108(3):644–52.

  44. 44.

    Tanaka Y, Kamohara H, Kinoshita K, Kurashige J, Ishimoto T, Iwatsuki M, Watanabe M, Baba H. Clinical impact of serum exosomal microRNA-21 as a clinical biomarker in human esophageal squamous cell carcinoma. Cancer. 2013;119(6):1159–67.

  45. 45.

    Xing S, Zheng X, Wei LQ, Song SJ, Liu D, Xue N, Liu XM, MT W, Zhong Q, Huang CM, et al. Development and validation of a serum biomarker panel for the detection of esophageal squamous cell carcinoma through RNA transcriptome sequencing. J Cancer. 2017;8(12):2346–55.

  46. 46.

    Wang W, He X, Zheng Z, Ma X, Hu X, Wu D, Wang M, Serum HOTAIR. As a novel diagnostic biomarker for esophageal squamous cell carcinoma. Mol Cancer. 2017;16(1):75.

  47. 47.

    Couch G, Redman JE, Wernisch L, Newton R, Malhotra S, Dawsey SM, Lao-Sirieix P, Fitzgerald RC. The discovery and validation of biomarkers for the diagnosis of esophageal squamous dysplasia and squamous cell carcinoma. Cancer Prev Res (Phila). 2016;9(7):558–66.

  48. 48.

    Xu J, Chen Y, Zhang R, He J, Song Y, Wang J, Wang H, Wang L, Zhan Q, Abliz Z. Global metabolomics reveals potential urinary biomarkers of esophageal squamous cell carcinoma for diagnosis and staging. Sci Rep. 2016;6:35010.

  49. 49.

    Nagata H, Kozaki KI, Muramatsu T, Hiramoto H, Tanimoto K, Fujiwara N, Imoto S, Ichikawa D, Otsuji E, Miyano S, et al. Genome-wide screening of DNA methylation associated with lymph node metastasis in esophageal squamous cell carcinoma. Oncotarget. 2017;8(23):37740–50.

  50. 50.

    Ma K, Cao B, Guo M. The detective, prognostic, and predictive value of DNA methylation in human esophageal squamous cell carcinoma. Clin Epigenetics. 2016;8:43.

  51. 51.

    Huang J, Wang G, Tang J, Zhuang W, Wang LP, Liou YL, Liu YZ, Zhou HH, Zhu YS. DNA methylation status of PAX1 and ZNF582 in esophageal squamous cell carcinoma. Int J Environ Res Public Health. 2017;14(2):216–28.

  52. 52.

    Shoemaker R, Deng J, Wang W, Zhang K. Allele-specific methylation is prevalent and is contributed by CpG-SNPs in the human genome. Genome Res. 2010;20(7):883–9.

  53. 53.

    Jones B. DNA methylation: switching phenotypes with epialleles. Nat Rev Genet. 2014;15(9):572.

  54. 54.

    Schwartzman O, Tanay A. Single-cell epigenomics: techniques and emerging applications. Nat Rev Genet. 2015;16(12):716–26.

  55. 55.

    Guo S, Diep D, Plongthongkum N, Fung HL, Zhang K, Zhang K. Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA. Nat Genet. 2017;49(4):635–42.

  56. 56.

    Bhardwaj A, Hollenbeak CS, Pooran N, Mathew A. A meta-analysis of the diagnostic accuracy of esophageal capsule endoscopy for Barrett’s esophagus in patients with gastroesophageal reflux disease. Am J Gastroenterol. 2009;104(6):1533–9.

  57. 57.

    Benaglia T, Sharples LD, Fitzgerald RC, Lyratzopoulos G. Health benefits and cost effectiveness of endoscopic and nonendoscopic cytosponge screening for Barrett’s esophagus. Gastroenterology. 2013;144(1):62–73. e66

  58. 58.

    Kadri S, Lao-Sirieix P, Fitzgerald RC. Developing a nonendoscopic screening test for Barrett’s esophagus. Biomark Med. 2011;5(3):397–404.

  59. 59.

    Zhao Y, Guo S, Sun J, Huang Z, Zhu T, Zhang H, Gu J, He Y, Wang W, Ma K, et al. Methylcap-seq reveals novel DNA methylation markers for the diagnosis and recurrence prediction of bladder cancer in a Chinese population. PLoS One. 2012;7(4):e35175.

  60. 60.

    Guo S, Yan F, Xu J, Bao Y, Zhu J, Wang X, Wu J, Li Y, Pu W, Liu Y, et al. Identification and validation of the methylation biomarkers of non-small cell lung cancer (NSCLC). Clin Epigenetics. 2015;7:3.

  61. 61.

    Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, Chen PY, Pellegrini M. BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics. 2013;14:774.

  62. 62.

    Dessau RB, Pipper CB: “R”--project for statistical computing. Ugeskr Laeger 2008, 170(5):328-330.

Download references


The study was supported by research grants from the National Natural Science Foundation of China (31521003, 81071957), the Key Project in the National Science & Technology Pillar Program (2011BAI09B00), Suzhou City Science and Technology Program (SYS201419), and 111 Project (B13016). Computational support was provided by the High-End Computing Center located at Fudan University.

Availability of data and materials

The datasets used and analyzed in this study are available from the corresponding author on request.

Author information

MW, JW, YZ, and SG contributed to the conception and design of the study. CW, DZ, and YM contributed to the sample collection and DNA extraction; YW, CL, and ZH conducted the targeted bisulfite sequencing experiments for the validation stage; and WP, SC, and CW contributed to TCGA and GEO as well as the targeted bisulfite sequencing data analysis. WP, MW, JW, and SG wrote the manuscript. All authors read and approved the final manuscript.

Correspondence to Shicheng Guo or Jiucun Wang or Minghua Wang.

Ethics declarations

Ethics approval and consent to participate

The studies were approved by the institutional review boards of Soochow University at Jiangsu Province and Fudan University, Shanghai, China. Written informed consent was obtained from each study subject.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1: Figure S1.

PCA analysis of the ESCC and EAC adjacent normal tissues. (PDF 4 kb)

Additional file 2: Table S1.

The methylation status of the five CpG sites in the GEO dataset and normal CD4+ and CD8+ T cells. Table S2 The methylation status of the five genomic regions in the young/old subgroups. Table S3 The methylation status of the five genomic regions in the male/female subgroups. Table S4 The methylation status of the five genomic regions in the smokers/non-smokers subgroups. Table S5 The methylation status of the five genomic regions in the alcohol/non-alcohol subgroups. Table S6 The methylation status of the five genomic regions in the alcohol/non-alcohol subgroups of male samples. Table S7 The designed primers of the five genomic regions for targeted bisulfite sequencing. (DOCX 51 kb)

Additional file 3: Figure S2.

PCA analysis for the ESCC and adjacent normal tissues in the validation dataset. (PDF 6 kb)

Additional file 4: Figure S3.

The ROC (Receiver Operating characteristics) curve for the subgroup analyzes. A-H represent the ROC curve for the young, old, male, female, smoked, non-smoked, alcohol, and non-alcohol subgroups, respectively. A-H each represent the overall ROC curve for the subgroup, which was calculated through a logistic regression model, incorporating the mean methylation percentage of the five genomic regions as the variables and without the adjustment for gender, age, and smoking status and alcohol status. (PDF 446 kb)

Additional file 5: Figure S4.

The expression profiles for the three genes using RNA-seq data from TCGA. (TIFF 3006 kb)

Additional file 6: Figure S5.

The detailed description of biomarker selection pipeline. (PDF 97 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark


  • Esophageal squamous cell carcinoma
  • DNA methylation
  • Biomarker
  • Diagnosis
  • Targeted bisulfite sequencing