Epigenetic imprinting alterations as effective diagnostic biomarkers for early-stage lung cancer and small pulmonary nodules

Early lung cancer detection remains a clinical challenge for standard diagnostic biopsies due to insufficient tumor morphological evidence. As epigenetic alterations precede morphological changes, expression alterations of certain imprinted genes could serve as actionable diagnostic biomarkers for malignant lung lesions. Using the previously established quantitative chromogenic imprinted gene in situ hybridization (QCIGISH) method, elevated aberrant allelic expression of imprinted genes GNAS, GRB10, SNRPN and HM13 was observed in lung cancers over benign lesions and normal controls, which were pathologically confirmed among histologically stained normal, paracancerous and malignant tissue sections. Based on the differential imprinting signatures, a diagnostic grading model was built on 246 formalin-fixed and paraffin-embedded (FFPE) surgically resected lung tissue specimens, tested against 30 lung cytology and small biopsy specimens, and blindly validated in an independent cohort of 155 patients. The QCIGISH diagnostic model demonstrated 99.1% sensitivity (95% CI 97.5–100.0%) and 92.1% specificity (95% CI 83.5–100.0%) in the blinded validation set. Of particular importance, QCIGISH achieved 97.1% sensitivity (95% CI 91.6–100.0%) for carcinoma in situ to stage IB cancers with 100% sensitivity and 91.7% specificity (95% CI 76.0–100.0%) noted for pulmonary nodules with diameters ≤ 2 cm. Our findings demonstrated the diagnostic value of epigenetic imprinting alterations as highly accurate translational biomarkers for a more definitive diagnosis of suspicious lung lesions.

of insufficient tumor morphological evidence to make a definitive pathological diagnosis [4]. Several genetic [5][6][7] and epigenetic biomarkers [8][9][10] have been developed for early cancer detection. However, the reliability and efficiency of these biomarkers have yet to be optimized for clinical applications [11].
As an important epigenetic regulation in mammalian embryo development, genomic imprinting plays important roles in cancers [12,13]. In normal post-natal somatic cells, imprinted genes are "silenced", that is mono-allelically expressed either from the maternal or paternal allele, while in cancers, some silenced imprinting genes' copies could be reactivated, leading to expressions from both alleles. The loss of monoallelic gene regulation is named loss of imprinting (LOI), and has been previously found in various human cancers [13][14][15][16][17][18]. In addition to LOI, amplifications of the activated copies of imprinted genes without affecting the methylation of the silenced copy have also been observed in multiple cancer cell lines [19]. In both cases, the imprinted genes could be expressed in two or more transcription sites instead of one. Therefore, the increased number of transcription site detections of imprinted genes in the cell nuclei could be used as potential cancer biomarkers. The nascent RNA or pre-mRNA in situ hybridization (ISH) method targeting the short-lived introns can be used to visualize and label these transcription sites [20][21][22][23], and have been widely applied to study the transcriptional regulations of both imprinted genes [24][25][26][27] and non-imprinted genes [28,29]. In our previous study, we have adopted this introntargeted labeling approach and developed an objective quantification of epigenetic imprinting alterations through the biallelic (BAE), multiallelic (MAE) and total (TE) expression measures which we termed as quantitative chromogenic imprinted gene in situ hybridization (QCIGISH) [30]. Based on the elevated BAE, MAE and TE signatures observed for various cancers over benign lesions, we formulated a statistical malignancy prediction model and identified GNAS, GRB10 and SNRPN as effective diagnostic biomarkers in ten different cancer types, including lung cancer [30]. Despite the preliminary model achieving 92% sensitivity and 88% specificity for lung cancer diagnosis, opportunities to further advance the diagnostic performance of QCIGISH in clinical applications need to be explored.
In this study, aiming to develop a lung cancer-specific diagnostic model with improved accuracy, we expanded the imprinted gene panel with a fourth imprinted gene minor histocompatibility antigen H13 (HM13). As amplification of HM13 locus has been previously reported in several lung cancer cell lines [19], it is very likely to demonstrate multiallelic expressions using our QCI-GISH method. We conducted a differential analysis and statistical evaluation of the QCIGISH epigenetic imprinting alteration measurements obtained for the normal, benign and malignant lung tissue specimens. To pathologically confirm the relationship between epigenetic imprinting and carcinogenesis, we performed a comparative examination between the imprinting signatures obtained from QCIGISH and morphological characteristics determined through histologic staining. From the alteration patterns, we developed a diagnostic grading model for lung tissue specimens, tested and validated the model using cytology and small biopsy specimens obtained via bronchoscopy or transthoracic CNB, and evaluated the results in comparison with standard diagnostic biopsies. We particularly investigated the diagnostic value of epigenetic imprinting biomarkers in effectively providing clearer malignancy differentiation especially for early-stage lung cancers, with the objective of improving the accuracy of standard diagnostic biopsies for lung lesions.

Patient characteristics
Clinicopathological characteristics between different patient groups in the imprinted gene screening (30 lung cancers and 30 benign lesions); model building and marker pre-selection (174 lung cancers, 51 benign lesions and 21 normal controls); model testing (21 lung cancers and 9 benign lesions) and model validation (117 lung cancers and 38 benign lesions) cohorts are described in Fig. 1 and Additional file 1: Figs. S1-2 are comparatively analyzed, statistically evaluated and summarized in Table 1 and Additional file 2: Table S1.

Evaluation of candidate imprinted gene biomarkers
To evaluate the diagnostic performance of the candidate imprinted gene HM13 against the GNAS, GRB10 and SNRPN panel, we performed a random sampling of 30 tissue specimens each stratified for both benign and malignant subgroups from the model building set ( Fig. 1 and Additional file 2: Table S1). QCIGISH was applied on the 60 samples to determine the BAE, MAE and TE measurements for all four genes ( Fig. 2A). Using the expression status of the imprinted genes GNAS, GRB10 and SNRPN, malignancy predictions for the samples were obtained using the QCIGISH binary classification model developed in our previous study [30]. The receiver operating characteristic (ROC) areas under the curve (AUC) of the BAE, MAE and TE measurements for the imprinted gene HM13 were individually compared to the ROC AUC of the binary classification model. Significantly higher AUC values were only observed for both MAE (p = 0.008) and BAE (p = 0.044) except TE (p = 0.511) after evaluating the diagnostic performance of HM13 against the previous binary classification model which combined the GNAS, GRB10 and SNRPN genes ( Fig. 2B and Additional file 2: Table S2). These findings substantiated the expansion of the GNAS, GRB10 and SNRPN multi-marker panel to four imprinted genes including HM13.

Differential epigenetic imprinting alteration signatures in lung tissue specimens
For the subsequent evaluation to further investigate the most efficient imprinting markers among the four-gene panel, elevated allelic expression patterns for the 174 lung cancers were observed from the heatmap analysis as compared to the 51 benign lesions and 21 normal controls (Additional file 1: Fig. S3A). Statistical evaluation of the BAE, MAE and TE status between these groups demonstrated a substantial increase in imprinting alterations (all p < 0.05) for the malignant cases as compared to both benign and normal samples (Additional file 1: Fig. S3B, Additional file 2: Table S3-S6). Significantly higher BAE and TE (all p < 0.05) were also observed for benign lesions as compared to normal controls (Additional file 1: Fig.  S3B, Additional file 2: Table S3-S6).
Elevated imprinting alterations were pathologically confirmed to be associated with tissue morphology. As illustrated in Fig. 3A, increasing aberrant imprinting signatures were observed between the normal, paracancerous and malignant regions on the same tissue section. Further analysis in different tissue sections showed clear differences in the allelic expression status of imprinted genes between benign and malignant cases (Fig. 3B). Imprinting alterations were visually detected to a greater extent for lung cancers, with elevated expressions observed as early as adenocarcinoma in situ, effectively distinguishing lung malignancy from benign lesions from a pathological perspective.

QCIGISH lung cancer diagnostic grading model building and testing
From the comparative analysis of the malignancy discrimination between imprinting alteration markers, MAE consistently demonstrated higher ROC AUC (0.87 to 0.94) and was the best marker across all genes as compared to BAE (0.84-0.93) and TE (0.78-0.86) (Additional file 1: Fig. S4). In addition, when applying optimal thresholds to dichotomize BAE, MAE and TE into positive and negative categories (Additional file 2: Table S7), MAE demonstrated good specificity and sensitivity for all benign lung lesion subtypes and lung cancer subtypes included in this study (Additional file 1: Fig. S5). Therefore, we identified MAE as the most effective imprinting biomarker over BAE and TE. As each gene demonstrated distinct diagnostic efficacies across the different benign lesion and cancer subtypes, MAE from all four genes were used during diagnostic model building.
We subsequently developed the classification model for distinguishing lung malignancy on the basis of the MAE imprinting alteration signatures from the prior analysis. We adopted the decision tree ensemble model structure from our previous study which combined individual gene classifiers to create more robust diagnostic predictions [30] but upgraded the malignancy classification system from two to five levels and only used the MAE status for 4    each gene (Additional file 1: Fig. S6 and Fig. S7). Through a simulation study of different threshold combinations, twenty candidate models with equally optimal sensitivity and specificity using top one grade or top two grades were determined (Additional file 1: Fig. S8 A-D, Additional file 2: Table S8 and S9). The twenty candidate models were further tested in an independent set of cytology and small biopsy samples obtained via bronchoscopy or transthoracic CNB to determine the optimal threshold for final model. With thresholds 1 to 4 set at 81% specificity, 98% specificity, 46% sensitivity and 40% sensitivity targets, respectively, the model using the top two highest grades demonstrated the best diagnostic performance and was determined as the final model, achieving 95.2% sensitivity (95% CI 86.1-100.0%) and 100.0% specificity in the test set, over x 100% x 100% x 100%    Fig. S9). The final QCIGISH diagnostic grading model was locked on January 4, 2020, with the process flow and threshold values for the individual genes summarized in Additional file 1: Figure S10 and Additional file 2: Table S11, respectively.

Diagnostic performance comparison between the QCIGISH method and cytology and small biopsy pathology
Comparing with standard cytology and small biopsy pathology using the same set of specimens, the QCIGISH diagnostic grading model demonstrated higher AUC values for both best-case (BCC, indeterminate results considered as positive) and worst-case (WCC, indeterminate results considered as negative) conditions (0.99 vs 0.94 and 0.92 with p = 0.033 and p < 0.001, respectively, Additional file 1: Fig. S12 and Additional file 2: Table S14). QCIGISH demonstrated better accuracy than cytology and small biopsy pathology particularly for very early cancer stages (carcinoma in situ to Stage IB) (p = 0.041 for BCC and p = 0.004 for WCC, Fig. 5A and Additional file 2:  (Fig. 5A and Additional file 2: Table S15). For SCLC, QCIGISH also showed 100.0% sensitivities to both limited stage and extensive stage ( Fig. 5B and Additional file 2: Table S15).
To illustrate, QCIGISH was able to accurately classify two preoperatively diagnosed benign cases from small biopsy pathology (normal and benign lung tissues from bronchial biopsy) into malignant cases which were surgically verified as adenocarcinoma in situ and invasive adenocarcinoma (Fig. 5E).

Discussion
The accurate diagnostic evaluation of pulmonary nodules and early-stage lung cancers currently remain a huge clinical challenge for standard diagnostic biopsies due to the insufficient tumor morphological evidence required to make a definitive cancer diagnosis. Epigenetic pathways have been and continue to remain a research hotspot in early lung cancer detection because of clearer evidence of their alterations in lung cancer carcinogenesis that most often predate malignant morphological changes [8]. Although epigenetic alterations have been recognized as potentially powerful tool for earlier diagnosis of lung cancer, epigenetic biomarkers have not been widely used in clinical practice. Altered genomic imprinting triggered by epigenetic changes is proposed to occur before tumor formation and promote tumor progression [8,13]. While many researchers are exploring the changes of allele-specific DNA methylation in cancers, we focused on the transcriptional activity of imprinted  gene loci on alleles, which could be better clinically visualized and quantified. In our previous study, we developed a novel QCIGISH method to evaluate the allelic expression status of imprinted genes and demonstrated the diagnostic significance of the elevated allelic expressions for imprinted genes as effective translational biomarkers for multiple cancers [31,32].
In this study, we have developed a diagnostic grading model from highly sensitive and specific epigenetic imprinting-based biomarkers using the GNAS, GRB10, SNRPN and HM13 gene panel that can be used as more accurate and definitive diagnostic biopsy evaluation of lung lesions. Our QCIGISH diagnostic grading model developed from the multiallelic imprinting alterations of this gene panel achieved excellent overall accuracy (99.1% sensitivity and 92.1% specificity) for diagnosing lung lesions from lung cytology and small biopsy specimens. In comparison with standard diagnostic biopsies, QCIGISH was more sensitive in detecting malignancies at their early curative stages (96.0% vs 68.0% sensitivity for CIS-Stage IA, 100.0% vs 68.6% sensitivity for Stage IB), and more accurate in distinguishing benign from malignant pulmonary nodules (100.0% vs 66.7% for < 2 cm, 100.0% vs 87.5% for 2-3 cm) with comparable specificities. All these findings demonstrated the epigenetic imprinting biomarker's capability to effectively provide clearer and more advanced evidence of cancer than morphology. These excellent diagnostic performance and predictive ability of QCIGISH make this molecular test a robust and useful clinical decision-enabling technology which could improve the accuracy for standard diagnostic biopsies particularly for early-stage lung cancers and small pulmonary nodules.
LOI of the GNAS gene has been reported to be associated with increased risks of multiple cancers including thyroid cancer, skin cancer, osteosarcoma and neurofibromatosis [33]. Similarly, studies have shown the relationship between the aberrant methylation of the GRB10 gene and invasive breast cancer [34]. Hypomethylation for the SNRPN gene has also been linked to breast cancer and seminoma [35,36]. Moreover, research has shown that LOI and upregulation for the HM13 gene have both been involved with breast cancer, in addition to its functional relationship with glioblastoma progression [14,37]. However, to our knowledge, the potential relationship between these four imprinted genes toward lung cancer development has yet to be explored. Using this four-gene panel, our QCIGISH method has detected significantly elevated BAE, MAE and TE in lung cancers as compared to benign lesions. Increased allelic expression can result from either LOI with the normally silenced copy of the gene reactivated [13] or copy number variation (CNV) with the active copy of the gene amplified but the inactive copy still silenced [19]. Our QCIGISH method only detects the transcriptionally active copies which potentially limits the capability to determine the specific mechanisms driving the increased allelic expression. Further studies are needed to investigate and explore the prospective roles of LOI and CNV in the increased allelic expressions of imprinted genes during lung cancer development. Our additional analyses across the different disease subtypes identified MAE as the more effective malignancy biomarker over BAE and TE. This observation might be particularly related to the precocious occurrence of imprinting alterations in tumors. Higher TE was reported for both lung inflammatory lesions and lung cancer and therefore determined as not optimally effective in differentiating malignancy. Higher BAE, representing early epigenetic or genetic alterations of imprinted genes which might precede morphological changes in cells and tissues indicative of malignancy, demonstrated unsatisfactory diagnostic specificity for lung diseases. Higher MAE, which subsequently develops after BAE, effectively demonstrated good malignancy discrimination consistent with current pathological evidence. Further exploratory studies are, however, needed to further investigate the biological implications of elevated TE, BAE and MAE levels toward other cancer types with varying pathophysiology.
From the simultaneous comparative pathological evaluation performed using QCIGISH and H&E staining on the same block resected near the cancer-bearing tissue region, elevated allelic expressions effectively conformed with malignant morphological features. These results highlighted the diagnostic significance of epigenetic imprinting alterations as clear and reliable distinguishing markers for lung malignancy. Therefore, epigenetic imprinting biomarkers could effectively provide a definitive diagnosis of lung cancers especially when clear tumor morphological evidence is insufficient.
Clinical studies have shown that nodule morphological characteristics such as diameter size, among others, have been associated with an increased risk of malignancy [38]. However, current preoperative biopsies for these small nodules may be inadequate to make a definitive diagnosis. While diagnostic guidelines differ between countries, nodules with diameters smaller than 2 cm are generally recommended for a 24-month CT follow-up instead of immediate surgical intervention. Therefore, progressive malignant tumors are not promptly identified to permit timely clinical management [39]. In recent years, as more sub-centimeter nodules are detected with the expanding population receiving LDCT screening [3], more accurate and definitive diagnostic methods for pulmonary nodules have become increasingly essential.
As more than 50% of CT-detected lung cancers are reported as Stage I [3], QCIGISH addresses this unmet clinical need of accurately detecting potentially malignant cases among small pulmonary nodules which are usually at their early stages. Our results showed that QCIGISH could positively detect truly malignant cases from biopsies potentially diagnosed as benign or indeterminate due to unclear morphological evidence, helping tackle a significant clinical diagnostic challenge [40]. The application of QCIGISH now enables the discovery of these early lung cancers which could lead to better clinical outcomes by permitting timely treatment and reducing the uncertainty of delayed monitoring of malignant cases, ultimately increasing patients' survival.
It is interesting that the diagnostic grading model developed from NSCLC samples can also be applied for SCLC, as we discovered that both shared similar epigenetic alterations using the imprinted gene panel despite their different cell origins and distinct genetic alterations [41]. As SCLC patients have very poor prognosis because of late-stage diagnosis [42], QCIGISH could be clinically useful by also effectively supporting the early prediction and accurate diagnosis of SCLC using cytology and small biopsy specimens.
This study had several limitations. First, our validation cohort consisted of only six Chinese hospitals-a more conclusive validation could be achieved using a prospective large-scale evaluation involving more medical centers and higher patient case numbers with more diverse clinical characteristics and disease subtypes; second, we monitored the clinically diagnosed benign cases for only two years-a longer follow-up period could provide a more accurate clinical validation especially for slowly progressive lung cancer cases; third, there are opportunities to further optimize the gene probes that we used-more exploration could be proceeded to additionally improve the diagnostic model's accuracy and better characterize more cancer subtypes while maintaining a minimally efficient number of probes; and lastly, due to a substantial number of cases with unclear LDCT features obtained particularly for benign lesions, radiological features such as solid, subsolid and ground glass were not considered in the analysis although their inclusion could have provided vital perspectives toward malignancy differentiation especially for early-stage lung cancers.

Conclusions
This study demonstrated how epigenetic imprinting biomarkers effectively provided clearer and more advanced differentiation of lung cancer than morphology. The high sensitivity and specificity make this test particularly effective in ruling-out and ruling-in malignancy in lung lesions. Capitalizing on the strength of highly sensitive and specific epigenetic translational biomarkers and a clinically viable technique, QCIGISH represents a reliable epigenetics-based approach and a decision-enabling technology for a more accurate and definitive cytology and small biopsy specimen diagnosis of small pulmonary nodules and early-stage lung cancers. Thus, as an adjunctive procedure to standard biopsies for lung lesions, this novel imprinting biomarker-based diagnostic test has a high potential to improve current clinical treatment decisions, and ultimately health outcomes.

Study design and sample collection
A total of 431 subjects recruited from eight Chinese medical centers were found eligible for the study and were divided into three sets based on specimen type and sample collection date as shown in Fig. 1 and Additional file 1: Figure S1. For the imprinted gene screening, biomarker pre-selection and diagnostic model building set, 283 formalin-fixed and paraffin-embedded (FFPE) surgically resected and histologically diagnosed lung tissue specimens were retrospectively collected. For the model testing set, 35 bronchoscopy and transthoracic CNB sampled lung small biopsy specimens were retrospectively collected. For the blinded model validation set, 240 patients with lung lesions detected on chest CT scans (see Additional file 2: Materials and Methods) were recruited and were clinically examined using bronchoscopy or transthoracic CNB (see Additional file 2: Materials and Methods). The sources and collection time periods of the samples are shown in Fig. 1 and Additional file 1: Figure  S2. The demographic and clinical characteristics of the study subjects are provided in Table 1 and Additional file 2: Table S1. The corresponding surgical histopathology was reviewed by three pathologists, namely RS, HY and WH. CB maintained the blinded data and oversaw the evaluation process. This study has been registered in clinicaltrials.gov (clinical trial ID: NCT03882684).

Sample preparation and QCIGISH detection
The lung tissue specimens and the lung cytology and small biopsy specimens were prepared using a previously described procedure [30]. Briefly, FFPE tissue samples were cut into 10-μm sections and mounted on positively charged slides. Cytology and small biopsy samples were fixed immediately after sampling in 10% NBF (neutral buffered formalin) for 48 h at RT. The dissociated cells were directly mounted onto positively charged slides. With probes targeting the non-coding intronic regions of nascent RNAs for the GNAS, GRB10, SNRPN and HM13 imprinted genes, ISH was applied following a previously described procedure using RNAscope 2.5 HD Assay kit (Advanced Cell Diagnostics, Newark, CA, USA) [30]. The detected gene-expressing sites were visualized as distinct red or brown dots under common bright field microscope after signal amplification ( Fig. 2A). The numbers of nuclei containing no signal (N 0 ), one signal (N 1 ), two signals (N 2 ), and more than two signals (N 2+ ) were collected from the microscopic images using the procedure as previously described [30] and were used to calculate the respective biallelic expression (BAE), multiallelic expression (MAE) and total expression (TE) according to the equations shown in Fig. 2A. The minimum nuclei count applied for processing tissue and cell samples using QCI-GISH was determined as 1500 and 1000, respectively, with details described under Additional file 2: Materials and Methods (Additional file 1: Fig. S13). The technicians who performed QCIGISH detection have no pathology background and were blinded to the simultaneous H&E staining results.

Imprinted gene screening and biomarker pre-selection
In our previous study, we have identified three imprinted genes GNAS, GRB10 and SNRPN for the diagnosis of ten cancer types including lung cancer [30]. Aiming to potentially improve the diagnostic performance of the QCI-GISH binary classification model with 92% sensitivity and 88% specificity [30], we evaluated a new imprinted gene HM13 which was reported to be involved in breast cancer and glioblastoma [14,37].
To subsequently evaluate the imprinting alteration signatures and pre-select candidate biomarkers for diagnostic model building, the discrimination performance for the BAE, MAE and TE measurements for each imprinted gene was individually assessed in a pooled analysis for each disease subtype (see Additional file 2: Materials and Methods, Additional file 1: Fig. S5).

Diagnostic grading model building and testing
Using the imprinting patterns from the most effective markers determined in the prior analysis, the previously developed malignancy classification model structure using an ensemble of individual gene classifiers [30] was updated by extending the diagnostic output from binary response to a five-level grading system (Additional file 1: Fig. S6 and Fig. S7). The development of the diagnostic algorithm and the corresponding evaluation and optimization of the model thresholds in the model building set are detailed under Additional file 2: Materials and Methods. Based on the evaluation results, a number of models with varying threshold combinations which demonstrated an optimal range of diagnostic accuracies were further tested in an independent set of 30 cytology and small biopsy specimens. The candidate model which showed the best diagnostic performance after testing was determined as the final model, with all threshold specifications locked prior to validation in an independent cohort of 155 patients.

Statistics
Continuous variables were reported as medians with interquartile ranges (IQR), while frequencies and proportions were reported for categorical variables. Continuous clinical variables were compared between groups using the Mann-Whitney U and Kruskal-Wallis tests, as applicable, driven by the non-normal distributions determined using the Shapiro-Wilk test [43]. Dunn's test was performed as a post hoc test for the pairwise comparisons between each independent group with Bonferroni correction applied during p-value determination [44]. Categorical clinical variables were compared using Chisquare or Fisher exact tests, as applicable.
Diagnostic discrimination performance was assessed and compared using the receiver operating characteristics area under the curve (ROC AUC) metric with 95% confidence intervals determined using the DeLong method [45]. Sensitivity, specificity and their respective normal-based 95% confidence intervals were computed using standard methods. Diagnostic sensitivities and specificities obtained using QCIGISH were evaluated against cytology and small biopsy pathology using McNemar's test for paired data [46].
All hypothesis tests were done in a two-sided manner, with computed p < 0.05 considered to be statistically significant. All statistical analyses and visualizations were performed using R software (version 3.5.0) [47]. Sample size justification is described under Additional file 2.