Long GGC repeat expansion in four asymptomatic fathers
Four affected individuals were previously reported to have possible de novo GGC repeat expansion of NOTCH2NLC (Fig. 1a). II-3 in Family F1 (ID3661 in a previous report) was diagnosed as sporadic NIID [3]. II-1 in Family F2 (Patient 11 in a previous report) had leukoencephalopathy [10]. II-2 in Family F3 was diagnosed as NIID with mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episode (MELAS) in her clinical course of polyneuropathy [9]. II-2 in Family F4 is an affected twin with oculopharyngodistal myopathy (OPDM) (patient 7 in a previous report) [8]. The clinical features of these subjects are summarized in Additional file 2: Table S1, and have been previously described in detail [3, 8,9,10]. All fathers and mothers of the four affected individuals were enrolled and examined genetically in a per-trio basis, and are an invaluable resource in such late-onset adult diseases.
To investigate the genetic basis underlying these possible de novo mutations, trio samples (patient plus parents) of all four families were sequenced by targeting the 4-kb genomic region (chr1:149389497-149393469 in hg38) by Cas9-mediated PCR-free enrichment with Nanopore long-read sequencing [3]. We analyzed Cas9-mediated enrichment sequencing data by tandem-genotypes to evaluate the number of GGC repeat units in NOTCH2NLC [15]. Tandem-genotypes was able to discriminate non-expanded and expanded alleles with different repeat copy-numbers. As expected, all four patients—II-3 (F1-patient), II-1 (F2-patient), II-2 (F3-patient) and II-2 (F4-patient)—had disease-causing GGC expansions with median repeat copy-numbers of 93, 117, 162 and 140, respectively [80, 104, 149 and 127, respectively, relative to the reference (13 copies) in tandem-genotypes output], whereas there were no expanded alleles in their healthy mothers (Fig. 1b and Additional file 2: Table S2). Despite negative results of repeat-primed polymerase chain reaction (RP-PCR), we noticed expanded alleles in all paternal sequencing samples (Fig. 1b and Additional file 1: Fig. S1). Unexpectedly, the repeats in the paternal samples were much longer than those in their affected offspring and were extremely variable in repeat length (Fig. 1b and Additional file 2: Table S2). Such high variability had never been seen before, but the substantial numbers of reads (39, 1070, 88 and 505 reads for F1-father, F2-father, F3-father and F4-father, respectively) supported the presence of expansion alleles (Additional file 2: Table S2). To confirm this result, we performed Southern blot analysis using a probe near the NOTCH2NLC GGC repeat, and confirmed the long repeat expansion in paternal samples in three families (Southern blot analysis could not be performed in F1 family because of insufficient amount of DNA) (Additional file 1: Fig. S2). The four fathers were clinically asymptomatic based on careful interviews by professional neurologists, suggesting that the fathers are asymptomatic carrier males with extremely long NOTCH2NLC repeat expansion.
Complete GGC repeat expansion sequence
We next completely sequenced the GGC repeat expansion. Given the high deletion and insertion error rates in Nanopore sequencing, constructing the consensus sequence from multiple Nanopore reads (i.e., inter-molecular consensus) is preferable. However, repeat length was too variable to generate an inter-molecular consensus sequence (Fig. 1b). To overcome this difficulty, we used the PacBio no-amplification (No-Amp) targeted sequencing method, which can generate high-fidelity consensus reads (HiFi reads) from multiple passes of subreads taken from a single template molecule (i.e., intra-molecular consensus) [16]. Indeed, the No-Amp method uncovered the whole expansion sequence in the paternal samples, which had not been well characterized before (Fig. 2a and Additional file 2: Table S3). The repeat size in each HiFi read varied, again indicating repeat instability and/or possible somatic mosaicism of long expanded alleles, as suggested by Nanopore analysis (Fig. 1b and Additional file 2: Table S2). The F1-father and F4-father had pure (GGC)n configuration, whereas the F2-father and F3-father had the (GGC)n followed by [(GGA)n(GGC)n]n repeats (Fig. 2a and Additional file 2: Table S3). Importantly, patients had the same repeat configurations as their fathers, although repeat size was contracted, indicating the paternal origin of the pathogenic allele (Fig. 2a). Consistently, all non-expanded alleles were transmitted from their mothers, as indicated by amplicon-length PCR (AL-PCR) analysis (Additional file 1: Fig. S3).
Detection of DNA modification by PacBio SMRT sequencing
As described above, the disease-causing allele was inherited from their asymptomatic carrier fathers. The lack of clinical symptoms in the carrier fathers indicates differences in the functional consequences of GGC expansion between patients and carrier fathers. We previously reported that there is no DNA methylation difference between expanded and non-expanded alleles in NIID patients [3]. Therefore, we were curious about the methylation status of the asymptomatic carrier fathers. We examined the base modification, which can be inferred from measuring the kinetics of replication during the PacBio single-molecule real-time (SMRT) sequencing run (polymerase kinetics), using the No-Amp data. Intriguingly, base incorporation in the GGC repeat region (i.e., DNA polymerase speed) was much slower in the four asymptomatic carrier fathers (0.9 bases/second in the repeat expansion allele vs. 2.0 bases/second in the non-expanded allele), suggesting DNA hypermethylation in the repeat expansion allele (Fig. 2b and Additional file 1: Fig. S4). To confirm this observation, we performed Southern blot analysis using the methylation-sensitive restriction enzyme HpaII and the methylation-insensitive restriction enzyme MspI. Genomic DNA was initially digested with NheI, and then the DNA methylation status was compared by examining the HpaII and MspI digestion efficiencies of the GGC repeat-containing NheI fragment (Fig. 3a). The NheI DNA fragments with the GGC repeats were completely resistant to HpaII digestion in the fathers’ samples, but not in patients or mothers, confirming the presence of fully hypermethylated CpG in all three fathers tested (Southern blot analysis could not be performed in family F1 because of insufficient DNA) (Fig. 3b).
DNA methylation landscape of the NOTCH2NLC region assessed by Nanopore sequencing
To investigate the hypermethylated region at a higher resolution, we developed a custom program to detect 5-mC from Nanopore sequencing. Our analysis was based on Guppy, a basecalling program from Oxford Nanopore Technologies, which directly identifies 5-mC by calculating the likelihood of base modification [17]. The modified base information from Guppy was assigned to the genomic position by our custom program methylstat, and used for methylation calling at that genomic position using methylcall. This modified base information was also used for generating BAM files containing the in silico bisulfite-like base-converted reads for IGV visualization (ont2bisul) (see the Methods section for further detail).
Our analysis revealed heterozygous gain of 5-mC in the father of patient F2 (F2-father) which can be properly detected by our methylation calling method methylcall (Additional file 1: Fig. S5). A comparison of the methylation calls between the patient (F2-patient) and his father (F2-father) revealed that the 2.2 kb genomic region was hypermethylated en bloc, encompassing regions 700-bp upstream and 1,000-bp downstream of the GGC expanded repeat (Additional file 1: Fig. S5).
Taking advantage of long-read sequencing, we performed haplotype-phasing based on GGC repeat copy-number from the tandem-genotypes results. This analysis clearly showed that only the non-pathogenic long expansion allele in asymptomatic fathers, but not the disease-causing expansion allele in patients, was hypermethylated (Fig. 4a). The read-level plot also enabled us to observe the methylation status of individual DNA molecules, confirming that some reads with long expansion were not hypermethylated, indicating epigenetic mosaicism (Fig. 4a, inset).
DNA methylation status within the GGC repeat sequence
We next investigated the 5-mC status of the NOTCH2NLC GGC repeat sequence using a reference-free approach. This requires the entire repeat expansion DNA sequence and the corresponding methylation information for the subject. Taking advantage of two technologies, PacBio and Nanopore, we generated high-quality consensus GGC repeat sequences from PacBio HiFi reads (Additional file 1: Fig. S6), and then mapped the 5-mC annotated nanopore reads to their respective GGC repeat expansion sequence (HiFi consensus as the expansion reference). Not all, but a proportion of CpG sites in the long GGC expansion in the asymptomatic father (F2-father) had the 5-mC modification, whereas the disease-causing expansion allele in the patient (F2-patient) was completely unmethylated (Fig. 4b).
DNA methylation boundary
A region spanning approximately 700 bases upstream and 1,000 bases downstream of the GGC repeat was completely unmethylated (chr1: 149390115-149391841) in the non-expanded allele (Fig. 4a and Additional file 1: Fig. S7). The region outside of this unmethylated region was hypermethylated (Fig. 4a and Additional file 1: Fig. S7). We identified two distinct transitional zones between the completely unmethylated and hypermethylated regions. The methylation status of these transitional zones consisted of methylated and unmethylated CpGs, indicative of DNA methylation mosaicism, similar to the DNA methylation boundary in the FMRP translational regulator 1 (FMR1) promoter [18] (Fig. 4c). Four (chr1: 149389972-149389973, 149390041-149390041, 149390068-149390069 and 149390075-149390076) and eight (chr1: 149392058-149392059, 149392099-149392100, 149392169-149392170, 149392326-149392327, 149392338-149392339, 149392376-149392377, 149392434-149392435 and 149392440-149392441) CpG dinucleotides were characterized by a mosaic pattern of DNA methylation, upstream and downstream of the GGC repeat (Fig. 4c). The position of the transition zones was conserved in all 12 individuals tested in this study (Additional file 1: Fig. S7). These transitional zones were lost in the repeat-expanded allele in the four asymptomatic fathers, and the CpG dinucleotides were fully methylated (Fig. 4c).
In summary, the combination of the two long-read sequencing technologies allowed us to characterize the genomic and epigenomic landscape of the pathogenic (disease-causing) and non-pathogenic NOTCH2NLC repeat regions.
Pathological cell-context consequences of pathogenic NOTCH2NLC repeat expansion
As described above, we observed differential DNA methylation between patients and their asymptomatic fathers. Gain of DNA methylation in asymptomatic carriers likely suppresses the toxic effect of the long GGC expansion allele. To dissect the pathological cell-context consequences of disease-causing and non-pathogenic long expansions, we investigated the formation of nuclear inclusion bodies using immunocytochemistry combined with FISH for detecting GGC-repeat expansion RNA in lymphoblastoid cell lines (LCLs) derived from F3 and F4 families. Approximately 3% of LCLs showed ubiquitin- and p62-double positive intranuclear inclusions in both the F3-patient and F4-patient, with the disease-causing allele, but not in their healthy mothers or asymptomatic fathers (Fig. 5a, Additional file 2: Table S4). Importantly, intranuclear inclusions were negative in fathers with extremely long expanded GGCs, indicating the distinct pathological consequences of two different classes of GGC expansion among fathers and their affected offspring. These intranuclear inclusions were co-localized with FISH-labeled CGG-RNA (only in affected offspring), suggesting that NOTCH2NLC mRNA with disease-causing GGC expansions was not only distributed diffusely in the nucleus, but also co-aggregated with the inclusions, as reported previously on skin biopsy samples [12]. These observations confirmed that fathers are indeed clinicopathologically normal.
Fragile X-associated tremor/ataxia syndrome (FXTAS), which is caused by a CGG repeat expansion in the 5ʹ-UTR region of FMR1, and NIID display striking similarities in clinical features and histological findings of intranuclear inclusions [19]. The CGG repeat expansion in FMR1 mRNA results in the formation of G4 structures, which co-aggregate in intranuclear inclusions in brain tissues in FXTAS model mice [20]. Notably, we found that immunoreactivities of G4 foci co-localized with p62-positive intranuclear inclusions, suggesting the involvement of RNA G4 in NIID development in humans (Fig. 5b).
Next, we investigated the expression of NOTCH2NLC mRNA by qPCR in LCLs derived from F3 and F4 families. Consistent with DNA methylation analysis, expression of NOTCH2NLC mRNA in LCLs from both asymptomatic fathers was significantly decreased compared with the corresponding healthy mothers and patients (F3 father: 0.66 ± 0.038 (relative to F3 mother) vs. F3 mother (p < 0.05) and F3 patient (p < 0.05); F4 father: 0.70 ± 0.070 (relative to F4 mother) vs. F4 mother (p < 0.01) and F4 patient (p < 0.01) in Fig. 5c). Transcriptional repression of NOTCH2NLC in fathers was also supported by the relatively lower nuclear signal intensity of GGC RNA FISH (Fig. 5a). These evidences indicate extremely long expanded GGCs in fathers promote epigenetic changes that silence NOTCH2NLC. Only two families were studied in this study. Hence, further investigation with new families is needed to confirm the relationship between DNA methylation and NOTCH2NLC mRNA expression.
Simple and fast identification of biologically important differentially methylated regions
Our methylation calling method (methylcall) can correctly detect hypermethylated bases, and is potentially useful for medical research for identifying differentially methylated regions among samples. However, methylcall cannot quantitatively measure the methylation status (Additional file 1: Fig. S5). Given the zygosity and epigenetic mosaicism, we decided to establish a quantitative method for Nanopore methylation data analysis. We calculated the methylation level by counting the ratio of 5-mC/(5-mC + C) at each base using a custom script (mtcall2mtkit), and analyzed it using Methylkit, which was originally developed for short-read methylation sequencing [21].
The quantitative measurement enabled us to compare methylation levels across samples. As described above, single-sample analysis revealed the detailed methylation profile of the NOTCH2NLC region in the F2 family (Fig. 4). The other three families (F1, F3 and F4) had similar DNA methylation signatures (Additional file 1: Fig. S7). This similarity was confirmed by principal component analysis and hierarchical clustering analysis, revealing two clusters, one from asymptomatic carrier fathers and the other from patients and their mothers (Fig. 6a, b).
Methylkit can also extract and visualize differentially methylated bases between samples. As expected, the percent methylation values between the F2-mother and F2-patient had high similarity (Pearson’s correlation score of 0.998) (Fig. 6c). By contrast, the F2-father and F2-patient showed relatively less similarity (Pearson’s correlation score of 0.955) because a proportion of the differentially methylated cytosines were unmethylated in the F2-patient and methylated in the F2-father with the percent methylation values of mostly < 0.5, indicating heterozygous gain of 5-mC in the father (Fig. 6c and Additional file 2: Table S5).
Such differential DNA methylation is more useful when it is analyzed in the genomic context, such as repeats, CpG islands and promoters. We summarized percent methylation scores based on the RepeatMasker annotation from UCSC (http://genome.ucsc.edu/). In general, these regions are difficult to analyze by conventional methods using short-read next-generation sequencing because its repetitive nature hampers correct mapping of reads. Long-read sequencing has the advantage of being able to distinguish these repetitive elements from other copies throughout the human genome by spanning entire repeat sequences and adjacent unique regions as well as together with 5-mC. Moreover, these differentially methylated regions were evaluated in the context of gene annotation, such as the distance to transcription start site (TSS) and nearest gene name using Methylkit (Additional file 2: Table S6) [21]. Indeed, our analysis enabled the simple and fast identification of differentially methylated GGC repeats at NOTCH2NLC (Fig. 6d, e and Additional file 2: Table S6).