MMR assessment
Tumour MMR expression data was previously generated by immunohistochemistry (IHC) and assessed as described (ANECS [27], RENDOCAS [28], MCCS [29, 30]). Briefly, cases with nuclear staining of all MMR proteins in tumour cells were considered MMR-proficient and classified as MSS. Cases were reported as MMR-deficient when tumour cells showed total or partial nuclear loss of expression in one or more of the MMR proteins and were classified as MSI.
Candidate SNP meta-analysis
GWAS data for meta-analysis was collated from four endometrial cancer genome-wide association studies [20, 21, 31]—Australian National Endometrial Cancer Studies (ANECS-Illumina genotyped, ANECS-ICOGS genotyped), Registry of Endometrial Cancer in Sweden (RENDOCAS) and Melbourne Collaborative Cohort Study (MCCS). IMPUTE2 was used to impute genotypes to the positive strand of the 1000 Genomes project, v3, phase 1 dataset. Cases were of European ancestry with a confirmed EC diagnosis. Genotyping in each study was performed as previously described [20, 21]: ANECS-Illumina (MSI n = 66, MSS n = 254, controls n = 3,083) with Illumina Infinium 610K; ANECS-iCOGS (MSI n = 67, MSS n = 156, controls n = 1,956) and RENDOCAS (MSI n = 52, MSS n = 88, controls n = 7563) with an Illumina custom array designed by the Collaborative Oncological Gene environment Study initiative (iCOGS) [20] and MCCS (MSI n = 40, MSS n = 65, controls n = 980) with the Illumina OncoArray 534K genotyping ChIP [21]. Controls were country-matched to cases and genotyped using the same platforms.
Total numbers used in the meta-analysis were as follows: MSI n = 225, MSS n = 563 and controls n = 13,582. Quality control consisted of exclusion of SNPs with < 95% call rates, MAFs < 1%, duplicated results or related individuals. Comprehensive sequencing for germline mutations has not been completed for all ANECS and RENDOCAS studies so it is possible a small number (< 3%) of undiagnosed Lynch syndrome patients are present in the data. SNPs for this candidate study were limited to those within chromosome 3, 1Mb upstream and downstream of MLH1 transcriptional start site (chr3:36,000,000–38,000,000 hg38; chr3:36024996–38024996 hg19). rs1800734 was directly genotyped in all datasets. To determine if our dataset (MSI and controls) was of a sufficient size, power calculations based on our CRC association study (OR = 1.95, MAF of 0.2, n = 13807, case rate = 0.016) indicated a power of 99% to discover a similar association to that seen in CRC. Using a more conservative OR of 1.4 in the same calculation indicated a power of 85%. Association statistics from individual GWAS’s were entered into PLINK 1.9 for a fixed-effects meta-analysis. P-threshold for candidate significance was 0.05. Standard Bonferroni methods were used to correct P-threshold for multiple testing. Confidence intervals are set at 95%.
TCGA-UCEC analysis
TCGA-UCEC methylation, gene expression data and MSI status were downloaded from the GDC portal (https://portal.gdc.cancer.gov/) using the GDC toolkit. The rs1800734 genotype was extracted from TCGA-UCEC whole genome sequencing sliced BAM files using Platypus variant calling software [24]. Data was downloaded, collated and pre-analysed using a custom script available on GitHub (https://github.com/kzkedzierska/mlh1_endo). For MLH1 promoter methylation, the beta median methylation level for CpG residues proximal (± 2000 bp) to rs1800734 was calculated. MLH1 transcript fragments per kilobase per million mapped reads upper quartile (FPKM-UQ) was used as a measure of expression. Samples with any missing values were excluded before data visualization and statistical analysis in R (MSI n = 206; MSS n = 349).
Cell lines
HEC1A and NOU1 cells were maintained in Dulbecco’s modified eagle medium (Gibco™), 10% FBS, 0.1% penicillin-streptomycin. rs1800734 was genotyped using KASPARTM technology (LGC) according to the manufacturer’s instructions using specific primers (Supplementary table 5).
Analysis of methylation
DNA was extracted from fresh cells using the DNeasy kit (QIAGEN). Bisulphite conversion of DNA was carried out using the EZ DNA methylation kit (Zymo Research) according to the manufacturer’s instructions. Converted DNA was amplified with Pyromark PCR kit (Qiagen) using CpG free primers (Supplementary table 5) with Illumina-specific sequence tags to ensure unbiased amplification of methylated and unmethylated template. Amplicons from each sample were barcoded together using a custom set of index tags and primers [32]. Sequencing was carried out using a 250-bp paired end kit on a MiSeq (Illumina) according to the manufacturer’s instructions. MiSeq output was demultiplexed and FASTQ files generated (Basespace, Illumina). The sequences were quality assessed and trimmed (FastQC and TrimGalore, Babraham Bioinformatics) then aligned and the methylation called by rs1800734 allele (Bismark, Babraham Bioinformatics).
Analysis of mRNA
RNA was extracted from fresh cells using the RNeasy kit (QIAGEN) and cDNA was generated (High Capacity cDNA Reverse Transcription Kit, Applied Biosystems) according to the manufacturer’s instructions. Gene expression was quantified and normalized using Taqman gene expression ready mixed assays (Applied Biosystems, Thermofisher). Allele-specific MLH1 expression was assessed by amplification of cDNA using Illumina tagged primers (Supplementary table 5) followed by NGS sequencing on a MiSeq (Illumina) as above. Trimmed FastQ sequences were aligned using bwa-mem and the rs1800734 variant called by Platypus [33].
Chromatin Immunoprecipitation
Approximately 108 cells were crosslinked for 10 min with 1% formaldehyde, neutralized with 125 mM glycine, washed with ice-cold PBS and scraped. After 2 further PBS washes, cells were resuspended in lysis buffer, (1% SDS, 10 mM EDTA, 50 mM Tris-HCl, protease inhibitors) sonicated using a Bioruptor (Diagenode) for 7-15 x 15 s cycles, centrifuged at max speed for 10 min at 4 °C and diluted 1:10 in IP dilution buffer (1% triton-100, 2 mM EDTA, 150 mM NaCl, 20 mM Tris). Immunoprecipitation (IP) with 5 μg of antibody (anti-TFAP4 Santa Cruz Biotechnology, sc-18593X) was carried out overnight at 4 °C and then incubated for 4 h with 50 μl of protein G Dynabeads (Invitrogen). For each chromatin sample, a mock IP with no antibody was carried out in parallel with the TFAP4 IP, and for all subsequent steps of the assay, as a negative control. Bead/antibody and mock complexes were washed with TSEI (0.1% SDS, 1% TritonX-100, 2 mM EDTA, 20 mM Tris, 150 mM NaCl), TSEII (0.1% SDS, 1% TritonX-100, 2 mM EDTA, 20 mM Tris, 500 mM NaCl), LiCl buffer (0.25LiCl, 1% NP-40, 1% deoxycholate, 1 mM EDTA, 10 mM Tris-HCl) and TE according to standard protocols and eluted with 1% SDS, 0.1 M NaHCO3. One microliter of DNA was analysed in duplicate or triplicate by SYBR green qPCR using PowerUp SYBR™ Green Master Mix (Thermofisher) and primers covering the MLH1 promoter region (Supplementary Table 6). The results were calculated with the ∆∆CT method using Ct values from the input chromatin to normalize (∆CT) and then expressed relative to a primer set outside the TFAP4 binding site (∆∆Ct) and the relative fold change calculated using the equation 2−∆∆Ct. No amplification was observed from DNA extracted from the mock IPs.
5-Aza-2′-deoxycytidine treatment
Adherent semiconfluent MSI NOU1 cells in exponential growth were treated with 5uM 5-Aza-2′-deoxycytidine in standard medium (AzaC, Sigma A3656) for 48 h (with replenishment of AzaC after 24 h). AzaC was removed and cells washed with PBS and then cultured in standard medium for 0, 4, 7 and 11 days. RNA and DNA were extracted simultaneously using the AllPrep kit (Qiagen) and MLH1 mRNA expression and promoter methylation assessed as described above. ChIP was carried out post AzaC treatment as described above.
Plots and statistics
R software and associated packages (tidyverse, gridExtra, ggplot2, ggsci, dylpr and ggforce) were used to generate all graphs and carry out statistical tests including ANOVA, Tukey, Kruskal-Wallis, paired Wilcoxon, t test and Pearson’s. Power calculations were carried out using the genpwr package