Skip to main content

Clusterization in head and neck squamous carcinomas based on lncRNA expression: molecular and clinical correlates



Long non-coding RNAs (lncRNAs) have emerged as key players in a remarkably variety of biological processes and pathologic conditions, including cancer. Next-generation sequencing technologies and bioinformatics procedures predict the existence of tens of thousands of lncRNAs, from which we know the functions of only a handful of them, and very little is known in cancer types such as head and neck squamous cell carcinomas (HNSCCs).


Here, we use RNAseq expression data from The Cancer Genome Atlas (TCGA) and various statistic and software tools in order to get insight about the lncRNome in HNSCC. Based on lncRNA expression across 426 samples, we discover five distinct tumor clusters that we compare with reported clusters based on various genomic/genetic features. Results demonstrate significant associations between lncRNA-based clustering and DNA methylation, TP53 mutation, and human papillomavirus infection. Using “guilt-by-association” procedures, we infer the possible biological functions of representative lncRNAs of each cluster. Furthermore, we found that lncRNA clustering is correlated with some important clinical and pathologic features, including patient survival after treatment, tumor grade, or sub-anatomical location.


We present a landscape of lncRNAs in HNSCC and provide associations with important genotypic and phenotypic features that may help to understand the disease.


Head and neck cancer is the sixth leading cancer worldwide, with an estimated 600,000 new cases annually and a 50% 5-year mortality rate (Globocan 2012) [1]. As more than 90% of this cancer cases are of squamous origin, they are generally referred to as head and neck squamous cell carcinoma (HNSCC). HNSCC arises in the upper aerodigestive tract, comprising the nasal cavity and paranasal sinuses, oral cavity, pharynx, larynx, and trachea. The main risk factors associated with its development are tobacco and alcohol consumption, which have a synergistic effect when combined, and also human papillomavirus (HPV) infection [2, 3]. HPV is known to drive tumorigenesis through the actions of its major oncoproteins E6 and E7, which can inactivate p53 and retinoblastoma (Rb) tumor supressors, respectively, altering cell cycle regulation in infected cells. HPV-positive differ from HPV-negative HNSCCs in tumor biology and clinical characteristics, including clinical outcomes, since HPV-positive tumors have been associated with a more favorable prognosis [4]. HNSCC patients are frequently treated with surgery, together with radiotherapy and/or cisplatin-based chemotherapy. Patients with aggressive disease are treated with cetuximab, an anti-EGFR antibody. Few patients respond to this therapy, and there is no molecular stratification of the patients nor biomarkers of responsiveness [2]. Previous characterization of molecular features in HNSCC, particularly with the aid of large-scale cancer genomics initiatives such as The Cancer Genome Atlas (TCGA), has generated important insights for stratifying patients and delineating tumor subtypes [57]. These multiomic analyses do not take into account the vast long non-coding transcriptome that may substantially contribute to HNSCC pathogenesis and progression.

LncRNAs are transcripts longer than 200 nucleotides that have no apparent protein coding potential [8]. They are highly diverse and actively present in many aspects of cell biology, including cellular differentiation, proliferation, DNA damage response, dosage compensation, and chromosomal imprinting. LncRNAs are categorized as exonic, intronic, intergenic, antisense, or overlapping based on their genomic location relative to a protein coding gene. The most recent estimate of the Encyclopedia of DNA Elements (ENCODE) Project Consortium (GENCODE version 25) is that the human genome contains more than 15,000 lncRNA genes that encode almost 28,000 transcripts [9] although the total number is estimated to be much higher. Like the protein coding genes (PCGs), lncRNA genes are regulated transcriptionally and by histone modification, and lncRNA transcripts are processed by the canonical splicing machinery [10]. In addition, lncRNAs have fewer exons than PCGs, are usually located in the nucleus, are subject to less selective pressure during evolution, and show higher tissue or cell type specificity. Given the large number of lncRNAs that are predicted to exist, it is expected that the functions in which they are involved may include many of the known (and possibly new) biological and physiological processes. Some of the molecular functions described so far include chromatin interactions, transcriptional regulation, RNA processing, mRNA stability or translation, or signaling cascade modulation [11]. In the context of cancer phenotypes, lncRNA functions have been found in proliferation, growth suppression, motility, immortality, angiogenesis, and viability [11, 12]. Unfortunately, the possible functions or phenotypic effect of the vast majority of lncRNAs remains elusive in normal homeostasis or in cancer [12, 13], and is difficult to analyze [14].

Comprehensive genomic analysis across human cancers demonstrated that a large number of lncRNAs show differential expression among known tumor subtypes [1517]. In addition, lncRNA alterations are highly tumor and lineage specific [1517]. The expression of known, tumor-associated lncRNAs, such as HOTAIR, NEAT1, UCA1, MALAT1, and MEG3, has been tested in HNSCC and correlated with clinicopathologic parameters [18]. Functional studies have tested the effects in proliferation, apoptosis, invasion, and migration in HNSCC cell lines after RNA interference (RNAi) of specific lncRNAs [19]. In addition, lncRNA profiling has been done in HNSCC, assessing deregulation between normal and tumor samples, associations with clinical parameters, HPV infection or mutation in the TP53 tumor suppressor gene [1921].

To our knowledge, no comprehensive reports studying the clusterization of human primary HNSCC samples based on genome-wide lncRNA expression have been published. Therefore, the main objective of this report is to analyze the long non-coding transcriptome in HNSCC in order to discover new tumor clusters. Furthermore, we investigate whether lncRNA-based clusterization is useful to predict patient’s clinical outcome, and is associated with important clinicopathologic parameters. Finally, we interrogate the dataset to infer the possible biological functions of lncRNAs in HNSCC based on correlation patterns of the PCGs.


Data resources

Expression values of 12,727 lncRNA genes from 426 HNSCC primary tumor samples of the TCGA RNAseq cohort were downloaded from the TANRIC web page [22]. Methods used to extract these expression values have been described [16]. Briefly, the genomic coordinates of the human lncRNAs from the GENCODE Resource (version 19) were obtained. Thereafter, the lncRNA exons that overlapped with any known coding genes based on the gene annotations of GENCODE and RefGene were filtered out. As a result, the analysis focused on the remaining 12,727 lncRNAs. Based on the BAM files, the expression levels were quantified as RPKM, and the lncRNAs with detectable expression were defined as those with an average RPKM ≥0.3 across all samples in each cancer type, as defined in the literature. The Ensembl identifiers of the 500 lncRNA genes displaying the highest variability in HNSCC were kindly provided by Dr. H. Liang at the Research Group from the Department of Bioinformatics and Computational Biology of The University of Texas and the MD Anderson Cancer Center. We downloaded DNA methylation, CNV, miRNA, and reverse phase protein array (RPPA) clustering data, as well as HPV infection and PCG mutations for the TCGA HNSCC dataset using the cBioPortal [23], the UCSC Xena [24] and the TCGA [25] repositories.

Unsupervised clustering analysis

Consensus Cluster Plus (CCP) tool [26], which is implemented as an R language package from Bioconductor [27], extends the CC algorithm and is briefly described here. The algorithm begins by subsampling a proportion of items and a proportion of features from a data matrix. Each subsample is then partitioned into up to k groups by a user-specified clustering algorithm. This process is repeated for a specified number of times. Pairwise consensus values, defined as “the proportion of clustering runs in which two items are grouped together,” are calculated and stored in a consensus matrix (CM) for each k. Clustering settings used were as follows: maxK = 6; number of bootstraps = 1000; item subsampling proportion = 0.8; feature subsampling proportion = 1; cluster algorithm = pam; inner linkage type = complete; final linkage type = complete; correlation method = Euclidean. Consensus CDF plot and proportion of ambiguous clustering (PAC) per each k was obtained.

Analysis of tumor clusters revealed by lncRNA expression

The association between lncRNA-based sample clusters and sample clusters based on other molecular features/aberrations was done using Fisher’s exact test for fourfold (2 × 2) tables or chi-square test for more than fourfold comparisons. Both tests are used to determine whether there are significant differences between the expected frequencies and the observed frequencies in one or more categories. Kaplan-Meier curves were obtained using the follow-up time of the HNSCC patients of two end-points: recurrence and death. Statistical analyses were done with SPSS 14.0.

Guilt-by-association (GBA) analysis

Pearson correlation analysis was used to select lncRNA genes with significant correlation between artificial expression vectors or templates (t1 to t5) (Fig. 2a), or between mean expression values of correlated lncRNAs and PCGs. We select both direct and inverse correlation patterns by setting thresholds either at the Pearson (r) value or at the associated p val. Pearson’s r values range between −1 and +1, such that two perfectly correlated genes display r = 1, and two perfectly anticorrelated genes r = −1. lncRNA surrogate selection was done setting r > 0.3 with respect to the corresponding template (associated p val <1 × 10−6). Additional filtering criteria include average RPKM ≥0.1 within specific clusters, and overexpression between normal tissue and specific clusters (T test, corrected p val <0.05). PCGs directly or inversely correlated with lncRNAs were selected using r > 0.3, or r < −0.3 with respect to mean expression values of selected lncRNAs per cluster, respectively (associated p val <1 × 10−9). Correlations as well as heatmap drawings were performed using MultiExperiment Viewer v4.9 (MeV) [28].

Gene ontology analysis

Selected PCGs were analyzed with the Gene Functional Annotation Tool available at the DAVID v6.7 website [29, 30] using their official gene symbols. Gene ontology option GOTERM_BP_FAT was selected and a functional annotation chart generated. A maximum p value of 0.05 was chosen to select only significant categories.


HNSCC sample clusters based on lncRNA expression

In order to discover HNSCC tumor subgroups, we selected the 500 lncRNAs (Additional file 1: Table S1) with the most variable expression pattern in HNSCC [16] and the Consensus Cluster Plus (CCP) software tool (see “Methods”). CCP analysis revealed the presence of five HNSCC clusters, which we have named clusters 1 to 5 (Fig. 1 and Additional file 1: Table S2). The results suggest that lncRNAs can distinguish five HNSCC subtypes, having significant differences in lncRNA expression.

Fig. 1
figure 1

Unsupervised clustering of HNSCC using lncRNA expression data. a Consensus Cluster Plus analysis identifies five major groups (samples, n = 426). The blue and white heatmap displays sample consensus. Number of samples per cluster is shown. Consensus CDF plot (b) and PAC values (c) for k = 2 to 6 are represented. Smaller PAC values are obtained with k = 5 and k = 6, with minor differences between them. Therefore, k = 5 was selected. Specifications and parameters used in the analysis are described in the “Methods” section

LncRNA clustering resembles DNA methylation and mRNA clustering

In order to assess the correlation between lncRNAs and additional molecular features, we analyze whether lncRNA clustering resembles previously described sample clustering based on mRNA, miRNA, protein (RPPA) expression, DNA methylation, and CNV. For this, we downloaded clustering data from the TCGA HNSCC dataset [5] and compared them with lncRNA subgroups using contingency analyses (Chi-square test, see “Methods”). The results showed highly significant similarities between our clusterization and DNA methylation clusters (p val = 4.6 × 10−34) or mRNA-based subtypes (p val = 4.2 × 10−30) (Fig. 2a). LncRNA clustering displays lower overlapping with CNV, miRNA, and RPPA subtypes, with p values ranging from 2.8 × 10−15 to 3.5 × 10−5 (Fig. 2a). The high overlapping between lncRNA and mRNA subgroups is suggestive of similar molecular mechanisms of expression control. Also, the contingency results suggest that lncRNA expression might be strongly influenced by DNA methylation (or vice versa), an important mechanism of epigenetic transcriptional regulation.

Fig. 2
figure 2

lncRNA clusters and other molecular aberrations. a LncRNA-based clustering of HNSCC samples is significantly associated with clustering based on diverse molecular features, mainly DNA methylation and expression of PCGs (mRNA). Significance values are plot upon chi-square test computation. b, c Association with HPV infection and HNSCC mutations with lncRNA clusters. Chi-square or odds ratio values are plot upon chi-square test (b) or Fisher’s exact test (c) computation, respectively. Dashed red line: threshold of significance (p val <0.05). d Distribution of lncRNA clusters and HPV-infected samples or samples with mutations in KMT2D or NSD1. Note the enrichment of HPV infection in c5, the NSD1 mutations in c1, and the KMT2D mutations in c2 (red lines) and the depletion of KMT2D mutations in c4 (green line). p values are calculated with Fisher’s exact test. Vertical black lines in d showed HPV+ samples and mutated samples for the selected genes KMT2D and NSD1, respectively

HPV-infected tumors display a specific lncRNAome

A significant proportion of tumor samples from the reported TCGA dataset [5] having 279 samples, is infected with HPV (almost 13%). Interestingly, a deep, multiplatform analysis of the molecular features of HNSCC primary tumors from the TCGA, demonstrated strong differences between HPV-positive and HPV-negative samples [5]. Infected cancers display different mutational landscape, mainly characterized by the absence of TP53 gene mutations. Chi-square test demonstrated that lncRNA clustering is highly significantly associated with HPV (p val = 1.5 × 10−29) (Fig. 2b), indicating infected tumors display a specific lncRNA landscape. Interestingly, a deeper analysis shows that almost all samples from cluster 5 (c5) are HPV positive (24 out of 29) (p val = 4.0 × 10−21) (Fig. 2d). Similar results were reported in the TANRIC study [16].

HNSCC mutations and lncRNA clustering

In order to discover whether particular lncRNA clusters are characterized by the presence or absence of HNSCC mutations, we interrogate lncRNA cluster and mutation associations using Chi-square test (Fig. 2b–d). We selected a total number of 19 genes, found to be significantly mutated in the TCGA HNSCC cohort: CDKN2A, FAT1, TP53, CASP8, AJUBA, PIK3CA, NOTCH1, KMT2D, NSD1, HLA-A, TGFBR2, HRAS, FBXW7, RB1, PIK3R1, TRAF3, NFE2L2, CUL3, and PTEN. Significant associations between these mutations and lncRNA-based clustering are shown (Fig. 2c). Cluster 1 (c1) is enriched in TP53 and NSD1 mutations, cluster 2 (c2) display frequent TP53 and KMT2D mutations, and cluster 4 (c4) frequent CASP8, NOTCH1, and CDKN2A mutations. In addition, cluster c5 is depleted of TP53 and CDKN2A mutations, as most samples are HPV-infected. Finally, cluster 4 (c4) is depleted of KMT2D mutations. Both KMT2D and NSD1 are methyltransferases involved in methylation of H3K4 and H3K36, respectively, which are important epigenetic regulators of gene expression. Most mutations found for both proteins are truncating, therefore indicating that c1 and c2 are characterized by loss-of-function of histone modifiers.

Guilt-by-association (GBA) analysis: selection of surrogate lncRNAs per cluster

In order to select candidate lncRNAs for further analysis, we decided to perform differential expression analysis using the clustering information. Our aim was to select lncRNAs surrogate of each of the five clusters, so they are overexpressed only in one cluster. We designed five expression vectors or templates, (t1 to t5, see Fig. 3a) from which we performed Pearson correlation analysis (see “Methods”) onto the lncRNA genes from which expression values are available for HNSCC (n = 12,727). The output are lncRNAs whose expression patterns are (i) expressed in the specific cluster, (ii) overexpressed in the specific cluster versus normal tissue, and (iii) significantly correlated with the corresponding template (r > 0.3 and p val <10−6). The number of selected lncRNAs per cluster and the heatmap showing expression patterns is shown (Fig. 3b and Additional file 1: Table S3): c1 = 91, c2 = 275, c3 = 9, c4 = 13, and c5 = 245. Note that clusters c2 and c5 display the highest numbers of lncRNA genes, and c3 and c4 the lowest. Interestingly, 16 out of 245 lncRNAs in cluster c5, depleted in TP53 mutations (Fig. 2c, d), have been shown to be overexpressed in TP53 wild type HNSCC samples [20]. In addition, 12 out of 245 c5 lncRNA genes were found upregulated in HPV-infected samples [20], in line with high frequency of HPV-positive samples in this cluster (Fig. 2d and Additional file 1: Table S5).

Fig. 3
figure 3

Guilt-by-association (GBA) analysis. a Schema graph showing approach used to search for lncRNA genes as surrogates of each cluster. Briefly, Pearson correlation was computed between all lncRNAs and artificial expression vectors or templates (templates t1 to t5) whereby maximum expression in individual clusters with respect to the others was interrogated. Threshold used: r > 0.3 and p val <1 × 10−6. Additional filtering criteria include average RPKM ≥0.1 within specific clusters and overexpression between normal tissue and specific clusters (T test, corrected p val <0.05). b Heatmap exhibiting selected lncRNAs upon Pearson correlation and the corresponding gene numbers per cluster. Heatmaps of PCGs directly (overexpressed PCGs) (c) or inversely (underexpressed PCGs) (d) correlated with the lncRNAs selected per cluster in b. Threshold used: r > 0.3 and p val <1 × 10−9 for overexpressed PCGs; r < −0.3 and p val <1 × 10−9 for underexpressed PCGs

Associated PCGs to surrogate lncRNAs

Predicting the biological functions of lncRNAs is challenging. GBA analysis has been proposed on the basis that the function of a poorly characterized lncRNA gene can be inferred from known/predicted functions of protein coding genes which are coexpressed [31]. We leveraged our RNA-sequencing data using GBA analysis to generate hypotheses on functional significance by comparing the expression of lncRNAs to protein coding genes of known function. Therefore, we performed correlation analysis to interrogate what coding genes are directly or inversely correlated with the surrogate lncRNA genes of each cluster. Accordingly, we use five different expression vectors or templates, one per cluster, calculated from the mean expression value of lncRNAs selected per cluster above (Fig. 3b). The number of correlated genes per cluster and the heatmap showing their expression patterns are shown (Fig. 3c, d) for directly (overexpressed PCGs) or inversely (underexpressed PCGs) correlated PCGs. Thresholds used are r > 0.3 or r < −0.3 and p val <1 × 10−9 (Additional file 1: Table S4).

The results shown that each lncRNA cluster has associated a significant number of coding genes, whose functions might be related with the biological roles of the correlated lncRNAs.

Gene ontology analysis of lncRNA clustering-associated PCGs

In order to predict possible functions of the surrogate lncRNAs in each cluster, we perform enrichment analysis of GO Biological Processes (GOBP) on the associated PCGs, which would find enriched pathways and larger processes made up of the activities of the PGC gene products. Therefore, we analyze individually the lists of PCGs directly or inversely correlated with each lncRNA cluster (Fig. 4). Cluster c1 is characterized by the underexpression of PCGs involved in T cell, natural killer cell, and myeloid leukocyte activation; these processes are overexpressed in c5, suggesting important differences in immune response between c1 and c5. Importantly, PD-L1 (CD274), an immune inhibitory receptor ligand whose inactivation using antibodies is being currently used for cancer treatment successfully, is underexpressed in c1. Cluster c2 is characterized by the expression of positive regulators of transcription, tissue morphogenesis, and neuronal markers. In addition, c2 exhibits depletion of PCGs involved in ubiquitin-dependent protein cleavage, including some protein component of the proteasome complex (PSM proteins). Cluster c3 express coding genes involved in cell migration, and underexpress PCGs involved in glucose metabolism. Cluster c4 contain overexpressed PCGs involved in epidermal development, including late cornified envelope (LCE) cluster genes and small proline-rich protein (SPRR) genes. Both LCE and SPRR proteins are expressed in terminally differentiated stratified epithelia, such as skin or head and neck mucosa. Underexpressed coding genes in c4 are involved in protein translation, similar to those in c5. Cluster c5 has many overexpressed PCGs involved in DNA replication and cell cycle processes, RNA splicing or transcription. Overall, the results highlight important differences between lncRNA clusters in terms of predicted functions, somehow validating the unsupervised clustering and Pearson correlation analysis performed. Whether the surrogate lncRNAs of each cluster are also involved in these biological functions remains to be demonstrated.

Fig. 4
figure 4

GOBP terms selected by cluster. Gene ontology analysis was done using DAVID v6.7 web tool (see the “Methods” section). Overexpressed or underexpressed PCGs are analyzed per cluster, and representative GO Biological Processes (GOBP) terms shown. Numbers at the end of each bar represent genes belonging to each GOBP term. Dashed red line denotes threshold for significant enrichment (p val <0.05)

Clinical and pathological features in lncRNA groups

We wanted to investigate whether our lncRNA clusters could display different clinical and pathological characteristics, possibly providing a better understanding of the patient’s tumor behavior or response to current therapies. We analyze whether patient sample clusters display follow-up differences in recurrence or in death, after treatment. Therefore, we performed Kaplan-Meier survival plots for both end-points (recurrence or overall survival) and the five clusters. Results show significant differences in overall survival at 5 years (p val = 0.0065, log-rank test) as previously described [16], with c5 patients having the highest probabilities of survival (Fig. 5a). In addition, patients within c1 and c2 clusters exhibit the lowest survival. No significant statistical differences were found, however, in recurrence after treatment (Additional file 2: Figure S1).

Fig. 5
figure 5

LncRNA-based clustering and clinical parameters. a Kaplan-Meier plot of TCGA HNSCC patients stratified by lncRNA clusters using overall survival as endpoint. p val was calculated with the log-rank test. n: number of patients with available follow-up information. b Significant associations between lncRNA clusters and various clinical parameters, including tobacco, histologic grade, and sub-anatomical location. Odds ratios and significance values were calculated using Fisher’s exact test. c Box plots of patient age for significant differences between c1 or c2 patients. p values were calculated using T test

Also, we examined tobacco smoking history and found more smokers in c1 and c2 than expected by chance (p val = 0.02 and 0.0002, respectively, upon Fisher’s test) (Fig. 5b). No differences were found in alcohol consumption between patient clusters. Interestingly, we show that tumor histology grade (G1, G2, G3, or G4) is associated with lncRNA clustering (p val = 6.58 × 10−10, Chi-square test). More specifically, we found that c5 tumors have frequently poorly differentiated and undifferentiated histology (G3 and G4, respectively) compared with the other clusters, while c4 have more differentiated or moderately differentiated tumors (G1 or G2, respectively). Furthermore, sub-anatomic location of the tumors and cluster subdivision were significantly associated (p val = 5.1 × 10−41, Chi-square test), so that c1 and c2 tumors are frequently found in the larynx, c4 in the tongue, and c5 in the tonsil (Fig. 5b). Interestingly, 4 lncRNA genes from c1 and 10 from c2 have been shown to be overexpressed in the larynx (Additional file 1: Table S6) [20]. Finally, we found that c1 patients are younger than the remaining patients (58.2 versus 61.6 years, p val = 0.018 after T test), and c2 patients older (65 versus 60.2 years, p val = 0.001 after T test) (Fig. 5c).

In summary, lncRNA clusterization is associated with important molecular aberrations and clinicopathologic features, providing important genomic and phenotypic relationships. Table 1 summarizes the main associated features of each lncRNA cluster.

Table 1 Summary of significant associations with lncRNA simple clusters


The recent implication of lncRNAs in many biological functions has established a new scenario to better understand complex processes like cancer [12, 13]. Traditional genomic characterizations have produced minimal improvements in patient clinical outcome, with a mortality rate of 50%. Therefore, lncRNAs may represent a promising field to discover novel diagnostic and therapeutic strategies. Here we analyzed the lncRNA expression patterns of HNSCC from 426 TCGA primary tumor samples to generate insights into the landscape of lncRNAs in HNSCC.

Interestingly, we discover five clusters of samples based on lncRNA expression with significant differences in molecular aberrations and clinicopathologic features. lncRNA clustering resembles clustering based on DNA methylation and mRNA features. DNA methylation is a chemical modification of genomic DNA by the addition of a methyl group (–CH3) to the cytosine or adenine DNA nucleotides. Typical DNA methylation occurs in a CpG dinucleotide context, where predominantly CpG sites are methylated in the genome. Most of the CpG clusters, known as CpG islands, occur near transcriptional start sites (TSSs) where they are predominantly unmethylated. The establishment and maintenance of methylation patterns resulting in modulation of gene expression is one of the key steps in epigenetic regulation during normal developmental programs. The significant overlapping between lncRNA and DNA methylation clustering might be due to frequent methylation of CpG clusters closed to TSSs of lncRNAs altering their landscape. However, we cannot discard cross-regulation between DNA methylation and specific lncRNAs, such as has been described for H19 [32] or ecCEBPA [33] recently. Interestingly, c1 contains almost all “hypomethylated” samples from the DNA methylation clusters (Fig. 1a), concomitant with enrichment in NSD1 inactivating mutations. Similar mutations have also been found in squamous cell carcinoma of the skin [34]. This correlation between NSD1 mutations and the “hypomethylated” samples has been previously reported in the TCGA HNSCC cohort [5, 35]. NSD1 is a histone 3 Lys 36 (H3K36) methyltransferase similar to SETD2, which is frequently mutated in the clear cell variant of renal cell carcinoma, and associated with DNA hypomethylation [36]. Moreover, a DNA hypomethylation signature has been reported in Sotos syndrome, a monogenic disorder defined by germline NSD1 mutations [37]. Several reports showed that H3K36 methylation is linked to the binding of de novo DNA methyltransferases (DNMT3A and DNMT3B) [38, 39]. Therefore, the recruitment of DNMT3A and DNMT3B could be impaired in NSD1 mutant tumors, leading to the global DNA hypomethylation observed in the c1 cluster. Whether this DNA hypomethylation is affecting c1 lncRNA expression regulation warrants further investigation.

TP53 mutations have been associated with poor survival in many cancer types [40]. We believe that the aggressiveness of the disease in c1 and c2 patients might be, at least partially, explained by the increased TP53 mutation frequency. The reverse would apply for c5, almost depleted on mutations, and having the best overall survival of all clusters.

In relation with other clinical parameters analyzed, the anatomical location was the most significantly associated with lncRNA clustering, with c1 and c2 in larynx, c4 in tongue, or c5 in tonsil. This finding might be explained by the already known high tissue specificity of lncRNA expression [41] and highlights an important issue: HNSCC as a group includes different tumor locations and different etiologies. Therefore, we consider that some of the surrogate lncRNAs in the clusters might be specific for sub-anatomical locations in the head and neck regions, and may help to analyze location-specific associations between molecular and phenotype features. For example, c5 overexpressed coding genes are enriched in proteins involved in “lymphocyte activation,” a process that occurs in the tonsils, which is an immune defense organ in the aerodigestive tract constituting the first line of defense against ingested or inhaled foreign pathogens. Naturally, both B and T cell activation occurs in the tonsils after the uptake of antigens produced by pathogens by specialized antigen capture cells called M cells. Whether the enrichment in “lymphocyte activation” found in c5 is due to tonsil-dependent functions or to a specific immune response to HPV-positive tumors remains to be ascertained.

Between the PCGs underexpressed in c1, we found programmed death-ligand 1 (PD-L1; also called CD274). PD-L1, which is expressed on many cancer and immune cells, plays an important part in blocking the “cancer immunity cycle” by binding programmed death-1 (PD-1) and B7.1 (CD80), both of which are negative regulators of T-lymphocyte activation [42]. Binding of PD-L1 to its receptors suppresses T cell migration, proliferation, and secretion of cytotoxic mediators, and restricts tumor cell killing. The PD-L1-PD-1 axis protects the host from overactive T-effector cells not only in cancer but also during microbial infections. Blocking PD-L1 should therefore enhance anticancer immunity, and successful treatment of many patients with advanced cancer using antibodies against PD-L1 has been demonstrated, including HNSCC [43]. Thus, the Food and Drug Administration (FDA) recently approved pembrolizumab, an antibody inhibiting PD-L1, for the treatment of some patients with recurrent or metastatic HNSCC that has continued to progress despite standard-of-care treatment with chemotherapy. However, little is known about predictive factors of efficacy of such therapies, although some recent reports described that across multiple cancer types, responses are observed in patients with tumors expressing high levels of PD-L1, especially when PD-L1 was expressed by tumor-infiltrating immune cells. Therefore, it is tempting to speculate that c1 primary tumors would be less sensitive to anti-PD-L1 therapies than the tumors from the other clusters. In addition, some of the c1 lncRNAs might be involved in this PD-L1-PD-1 “cancer immunity cycle” and could be subject of future investigations.

Coding genes involved in synaptic transmission and neuron differentiation are coexpressed with surrogate lncRNAs within cluster c2. Perineural invasion occurs in an important proportion of HNSCC samples. We found no significant higher frequency of perineural invasion in c2 samples compared with the remaining samples (Fisher’s exact test, p val = 0.5). Therefore, it is tempting to speculate that c2 samples might display some cellular plasticity from epithelial towards neuroendocrine lineage, as HNSCC tumors with neuroendocrine phenotype has been described [44]. Recently, neuroendocrine lineage plasticity enabled by the loss of TP53 and RB1 function was shown in prostate cancer, mediated by increased expression of the reprogramming transcription factor SOX2 [45]. Interestingly, c2 display higher frequency of TP53 mutations (Fig. 2c and Table 1). Possibly, lncRNAs within c2 might be involved in mechanisms regulating epithelial to neuroendocrine reprogramming.

Many overexpressed PCGs in c5 are related to “cell cycle” processes, suggesting that c5 carcinomas are more proliferative. This finding is in line with the oncogenic activities of HPV E7 oncoprotein, which binds and induces protein degradation of cell cycle regulators such as the retinoblastoma protein family, including pRb (RB1), p107 (RBL1) or p130 (RBL2) [46]. Degradation of these proteins, mainly pRb, allow E2F transcription factors to induce expression of genes involved in cell cycle progression [47], such as cyclins and cyclin-dependent kinases, or proteins involved in DNA replication, and mitotic division. In addition, E2F1 gene amplification was found in HPV-positive HNSCC tumors from the TCGA cohort [5], which correlates with a molecular profile of cell cycle deregulation. Therefore, some c5 lncRNAs might also be involved in these processes, and possibly in the E7-pRb-E2F axis of cell cycle deregulation of HPV-infected tumors. Tumors from c4 overexpresses PCGs involved in epidermal differentiation, such as LCE cluster SPRR genes, normally involved in terminal epithelial differentiation. This result might be in line with the high frequency of differentiated carcinomas in c4 (histologic grades G1 and G2) (Fig. 5 and Table 1), whereby G3 tumors are less frequent. Therefore, some of the c4 surrogate lncRNAs could be involved in epithelial differentiation.


We present the first comprehensive clustering analysis of HNSCC based on lncRNA expression performed to date. The results allow selection of surrogate lncRNA genes of five distinct tumor groups and propose possible functions associated with them, as well as phenotypic and clinicopathology features that may be consequence, in part, of their activities.


  1. Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer. 2015;136(5):86.

    Article  Google Scholar 

  2. Leemans CR, Braakhuis BJ, Brakenhoff RH. The molecular biology of head and neck cancer. Nat Rev Cancer. 2011;11(1):9–22.

    Article  CAS  PubMed  Google Scholar 

  3. Mao L, Hong WK, Papadimitrakopoulou VA. Focus on head and neck cancer. Cancer Cell. 2004;5(4):311–6.

    Article  CAS  PubMed  Google Scholar 

  4. Gillison ML, Koch WM, Capone RB, Spafford M, Westra WH, Wu L, Zahurak ML, Daniel RW, Viglione M, Symer DE, et al. Evidence for a causal association between human papillomavirus and a subset of head and neck cancers. J Natl Cancer Inst. 2000;92(9):709–20.

    Article  CAS  PubMed  Google Scholar 

  5. Cancer Genome Atlas N. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517(7536):576–82.

    Article  Google Scholar 

  6. Castilho RM, Squarize CH, Gutkind JS. Exploiting PI3K/mTOR signaling to accelerate epithelial wound healing. Oral Dis. 2013;19(6):551–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Stransky N, Egloff AM, Tward AD, Kostic AD, Cibulskis K, Sivachenko A, Kryukov GV, Lawrence MS, Sougnez C, McKenna A, et al. The mutational landscape of head and neck squamous cell carcinoma. Science (New York, NY). 2011;333(6046):1157–60.

    Article  CAS  Google Scholar 

  8. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):55.

    Google Scholar 

  10. Ravasi T, Suzuki H, Pang KC, Katayama S, Furuno M, Okunishi R, Fukuda S, Ru K, Frith MC, Gongora MM, et al. Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 2006;16(1):11–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Batista PJ, Chang HY. Long noncoding RNAs: cellular address codes in development and disease. Cell. 2013;152(6):1298–307.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Huarte M. The emerging role of lncRNAs in cancer. Nat Med. 2015;21(11):1253–61.

    Article  CAS  PubMed  Google Scholar 

  13. Schmitt AM, Chang HY. Long noncoding RNAs in cancer pathways. Cancer Cell. 2016;29(4):452–63.

    Article  CAS  PubMed  Google Scholar 

  14. Liu SJ, Horlbeck MA, Cho SW, Birk HS, Malatesta M, He D, Attenello FJ, Villalta JE, Cho MY, Chen Y, et al. CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells. Science (New York, NY). 2017;355(6320).

  15. Iyer MK, Niknafs YS, Malik R, Singhal U, Sahu A, Hosono Y, Barrette TR, Prensner JR, Evans JR, Zhao S, et al. The landscape of long noncoding RNAs in the human transcriptome. Nat Genet. 2015;47(3):199–208.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Li J, Han L, Roebuck P, Diao L, Liu L, Yuan Y, Weinstein JN, Liang H. TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res. 2015;75(18):3728–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Yan X, Hu Z, Feng Y, Hu X, Yuan J, Zhao SD, Zhang Y, Yang L, Shan W, He Q, et al. Comprehensive genomic characterization of long non-coding RNAs across human cancers. Cancer Cell. 2015;28(4):529–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Gomes CC, de Sousa SF, Calin GA, Gomez RS. The emerging role of long noncoding RNAs in oral cancer. Oral Surg Oral Med Oral Pathol Oral Radiol. 2017;123(2):235–41.

    Article  PubMed  Google Scholar 

  19. Zou AE, Zheng H, Saad MA, Rahimy M, Ku J, Kuo SZ, Honda TK, Wang-Rodriguez J, Xuan Y, Korrapati A, et al. The non-coding landscape of head and neck squamous cell carcinoma. Oncotarget. 2016;7(32):51211–22.

    PubMed  PubMed Central  Google Scholar 

  20. Nohata N, Abba MC, Gutkind JS. Unraveling the oral cancer lncRNAome: identification of novel lncRNAs associated with malignant progression and HPV infection. Oral Oncol. 2016;59:58–66.

    Article  CAS  PubMed  Google Scholar 

  21. Salyakina D, Tsinoremas NF. Non-coding RNAs profiling in head and neck cancers. Npj Genomic Medicine. 2016;1:15004.

    Article  CAS  Google Scholar 

  22. The Atlas of ncRNAs in Cancer: TANRIC []. Accessed 15 Feb 2016

  23. The cBioPortal for Cancer Genomics [] Accessed 15 Feb 2016

  24. UCSC Xena []. Accessed 15 Feb 2016

  25. Comprehensive genomic characterization of head and neck squamous cell carcinomas, 2015 []. Accessed 15 Feb 2016

  26. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Bioconductor: open source software for bioinformatics []. Accessed 15 Feb 2016

  28. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 2003;34(2):374–8.

    CAS  PubMed  Google Scholar 

  29. The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.7 []. Accessed 15 Feb 2016

  30. da Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57.

    Article  CAS  Google Scholar 

  31. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009;458(7235):223–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Zhou J, Yang L, Zhong T, Mueller M, Men Y, Zhang N, Xie J, Giang K, Chung H, Sun X, et al. H19 lncRNA alters DNA methylation genome wide by regulating S-adenosylhomocysteine hydrolase. Nat Commun. 2015;6:10221.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Di Ruscio A, Ebralidze AK, Benoukraf T, Amabile G, Goff LA, Terragni J, Figueroa ME, De Figueiredo Pontes LL, Alberich-Jorda M, Zhang P, et al. DNMT1-interacting RNAs block gene-specific DNA methylation. Nature. 2013;503(7476):371–6.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Quintana RM, Dupuy AJ, Bravo A, Casanova ML, Alameda JP, Page A, Sanchez-Viera M, Ramirez A, Navarro M. A transposon-based analysis of gene mutations related to skin cancer development. J Invest Dermatol. 2013;133(1):239–48.

    Article  CAS  PubMed  Google Scholar 

  35. Papillon-Cavanagh S, Lu C, Gayden T, Mikael LG, Bechet D, Karamboulas C, Ailles L, Karamchandani J, Marchione DM, Garcia BA, et al. Impaired H3K36 methylation defines a subset of head and neck squamous cell carcinomas. Nat Genet. 2017;49(2):180–5.

    Article  CAS  PubMed  Google Scholar 

  36. Cancer Genome Atlas Research N. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature. 2013;499(7456):43–9.

    Article  Google Scholar 

  37. Choufani S, Cytrynbaum C, Chung BH, Turinsky AL, Grafodatskaya D, Chen YA, Cohen AS, Dupuis L, Butcher DT, Siu MT, et al. NSD1 mutations generate a genome-wide DNA methylation signature. Nat Commun. 2015;6:10207.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Baubec T, Colombo DF, Wirbelauer C, Schmidt J, Burger L, Krebs AR, Akalin A, Schubeler D. Genomic profiling of DNA methyltransferases reveals a role for DNMT3B in genic methylation. Nature. 2015;520(7546):243–7.

    Article  CAS  PubMed  Google Scholar 

  39. Dhayalan A, Rajavelu A, Rathert P, Tamas R, Jurkowska RZ, Ragozin S, Jeltsch A. The Dnmt3a PWWP domain reads histone 3 lysine 36 trimethylation and guides DNA methylation. J Biol Chem. 2010;285(34):26114–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Leroy B, Anderson M, Soussi T. TP53 mutations in human cancer: database reassessment and prospects for the next decade. Hum Mutat. 2014;35(6):672–88.

    Article  CAS  PubMed  Google Scholar 

  41. Cabili MN, Trapnell C, Goff L, Koziol M, Tazon-Vega B, Regev A, Rinn JL. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011;25(18):1915–27.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Topalian SL, Drake CG, Pardoll DM. Immune checkpoint blockade: a common denominator approach to cancer therapy. Cancer Cell. 2015;27(4):450–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Seiwert TY, Burtness B, Mehra R, Weiss J, Berger R, Eder JP, Heath K, McClanahan T, Lunceford J, Gause C, et al. Safety and clinical activity of pembrolizumab for treatment of recurrent or metastatic squamous cell carcinoma of the head and neck (KEYNOTE-012): an open-label, multicentre, phase 1b trial. Lancet Oncol. 2016;17(7):956–65.

    Article  CAS  PubMed  Google Scholar 

  44. Mahomed F. Neuroendocrine cells and associated malignancies of the oral mucosa: a review. J Oral Pathol Med. 2010;39(2):121–7.

    Article  PubMed  Google Scholar 

  45. Mu P, Zhang Z, Benelli M, Karthaus WR, Hoover E, Chen CC, Wongvipat J, Ku SY, Gao D, Cao Z, et al. SOX2 promotes lineage plasticity and antiandrogen resistance in TP53- and RB1-deficient prostate cancer. Science. 2017;355(6320):84–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Jones DL, Thompson DA, Munger K. Destabilization of the RB tumor suppressor protein and stabilization of p53 contribute to HPV type 16 E7-induced apoptosis. Virology. 1997;239(1):97–107.

    Article  CAS  PubMed  Google Scholar 

  47. DeFilippis RA, Goodwin EC, Wu L, DiMaio D. Endogenous human papillomavirus E6 and E7 proteins differentially regulate proliferation, senescence, and apoptosis in HeLa cervical carcinoma cells. J Virol. 2003;77(2):1551–63.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable


This research was supported by FEDER cofounded MINECO grants SAF2012-34378 and SAF2015-66015-R (JMP), Comunidad Autónoma de Madrid grant S2010/BMD-2470 (Oncocycle Program) (JMP), ISCIII grants ISCIII-RETIC RD12/0036/0009 (JMP), PIE15/00076 (JMP), CB/16/00228 (JMP), Swiss National Science Foundation Grants 144637 and 153099 (RGE), and Swiss Cancer League Grant KLS 3243-08-2013 (RGE).

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the TANRIC (, TCGA ( repositories, and webtools such as cBioPortal ( and UCSC Xena (

All data generated or analyzed during this study are included in this published article [and its supplementary information files].

Authors’ contributions

PGdeL, APG, and RGE performed the genomic and statistical analyses. PGdeL and RGE wrote the manuscript. PGdeL, JMP, and RGE interpreted the results. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable

Ethics approval and consent to participate

Not applicable

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ramón García-Escudero.

Additional files

Additional file 1: Table S1.

500 lncRNA genes with the most variable expression in 426 HNSCC samples from TCGA. Table S2. lncRNA-based HNSCC TCGA sample clusters. Table S3. Surrogate lncRNA genes per sample cluster. Table S4. Common lncRNAs between surrogate c5 and overexpressed in TP53 WT-infected samples from Nohata et al. (ref [20]). Table S5. Common lncRNAs between surrogate c5 and overexpressed in HPV-infected samples from Nohata et al. (ref [20]). Table S6. Common lncRNAs between surrogate c1 to c5 samples and suborgan-specific from Nohata et al. (ref [20]). Table S7. Protein-coding genes correlated with surrogate lncRNAs per sample cluster. (XLSX 125 kb)

Additional file 2: Figure S1.

Kaplan-Meier plot of TCGA HNSCC patients stratified by lncRNA clusters using recurrence as endpoint. P val was calculated with the log-rank test. n: number of patients with available follow-up information. (PDF 472 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Lena, P.G., Paz-Gallardo, A., Paramio, J.M. et al. Clusterization in head and neck squamous carcinomas based on lncRNA expression: molecular and clinical correlates. Clin Epigenet 9, 36 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: