5-Hydroxymethylcytosine profiles of cfDNA are highly predictive of R-CHOP treatment response in diffuse large B cell lymphoma patients

Background Although R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone) remains the standard chemotherapy regimen for diffuse large B cell lymphoma (DLBCL) patients, not all patients are responsive to the scheme, and there is no effective method to predict treatment response. Methods We utilized 5hmC-Seal to generate genome-wide 5hmC profiles in plasma cell-free DNA (cfDNA) from 86 DLBCL patients before they received R-CHOP chemotherapy. To investigate the correlation between 5hmC modifications and curative effectiveness, we separated patients into training (n = 56) and validation (n = 30) cohorts and developed a 5hmC-based logistic regression model from the training cohort to predict the treatment response in the validation cohort. Results In this study, we identified thirteen 5hmC markers associated with treatment response. The prediction performance of the logistic regression model, achieving 0.82 sensitivity and 0.75 specificity (AUC = 0.78), was superior to existing clinical indicators, such as LDH and stage. Conclusions Our findings suggest that the 5hmC modifications in cfDNA at the time before R-CHOP treatment are associated with treatment response and that 5hmC-Seal may potentially serve as a clinical-applicable, minimally invasive approach to predict R-CHOP treatment response for DLBCL patients.


Introduction
Diffuse large B cell lymphoma (DLBCL) is the primary type of invasive lymphoid tissue tumor, accounting for about 30% of non-Hodgkin's lymphoma [1]. Although the majority of the DLBCL patients are elderly patients, this disease is found in all ages [2]. Since rituximab (R) joined cyclophosphamide, adriamycin, vincristine, and prednisone (CHOP) chemotherapy regimen ten years ago, the overall survival rate of DLBCL patients has improved significantly [3].
However, 30-50% of patients are not sensitive to this standard treatment [4], and exiting methods fail to predict the treatment response before R-CHOP treatment accurately or efficiently [5,6]. Currently, positron emission tomography (PET)-CT is the gold standard to evaluate the efficacy of different treatment regimens for DLBCL. However, it is generally used after the treatment and thus cannot predict the treatment response [7]. The International Prognostic Index (IPI) is the primary prognostic risk assessment method for DLBCL, especially in high-risk patients, and is used for R-CHOP chemotherapies [8][9][10]. However, IPI cannot accurately predict the therapeutic effect of R-CHOP in DLBCL patients [11]. Furthermore, recent studies have demonstrated that the detection of the apoptosis inhibitor, survivin [12], activation induced cytidine deaminase (AID) [13], plasma miRNA [14], exosome miRNA [15], and genes polymorphism [16,17], as well as the presence of CD3 and FoxP3 in the immune microenvironment [18], were all potential indicators of treatment efficacy in DLBCL patients. However, these predictors showed contradictory results that have not been well solved. Therefore, an accurate and effective method to predict the response of R-CHOP regimen is highly necessary.
In recent years, cell-free DNA (cfDNA) in the circulating blood, which carries genetic and epigenetic information from cells of origin, has emerged as a promising noninvasive approach for the diagnosis and prognosis in cancer [19]. 5-Methylcytosines (5mCs) of DNA is an important epigenetic feature that plays an important role in gene expression and cancer development [20]. Kristensen et al. [21] found that the methylation of DAPK1 in cfDNA from patients with DLBCL can be used to assess the effect of R-CHOP treatment. In the human genome, 5-methylcytosines (5mCs) in cfDNA are dynamic and reversible [22,23] and can be oxidized into 5-hydroxymethylcytosines (5hmCs) through the ten-eleven translocation (TET) enzymes in an active DNA-demethylation process [24,25]. Therefore, 5hmC, as an oxidation product of DNA demethylation(5mC), may also be used to assess the effect of R-CHOP treatment. Recently, a study has also shown that 5hmC is associated with the prognosis of DLBCL [26]. However, its role in the prediction of treatment response of R-CHOP scheme for DLBCL patients is not established.
In this study, we used 5hmC-Seal technique to obtain genome-wide 5hmC profiles in plasma cfDNA from 86 DLBCL patients, before they received R-CHOP chemotherapy. Our results demonstrated that responders and non-responders of R-CHOP treatment had distinct 5hmC profiles and that 5hmC markers selected by bioinformatics tools and machine learning algorithms could be used to predict treatment response of R-CHOP treatment in DLBCL patients.

Study participants
From 2017 to 2019, 86 diffuse large B cell lymphoma (DLBCL) patients from multicenter studies including Peking University Third Hospital, Fifth Medical Center of PLA General Hospital, and Cancer Hospital Chinese Academy of Medical Sciences were included in this study. All patients had signed the patient consent form. In all cases, the diagnosis of DLBCL was made using appropriate diagnostic criteria from the 2016 WHO classification of lymphoid tumors with combinations of histologic, immunohistochemical, and cell of origin (coo) defined according to the Hans algorithm [27]. Medical records were reviewed for demographic and clinical data. Laboratory tests, white blood cell count (WBC), renal and hepatic function examinations, lactate dehydrogenase (LDH), and β2 microglobulin (β2MG) and cfDNA from peripheral blood samples were collected before any treatment. Then, all patients received standard R-CHOP chemotherapy. Other baseline assessments including bone marrow biopsy and PET/CT were conducted in all patients in the follow-up care. The disease stage was defined by the Ann Arbor staging system. Treatment efficacy was evaluated after four cycles of treatment according to Lugano 2014 criteria [28], and patients were divided into PD (progressive disease), SD (stable disease), PR (partial response), and CR (complete response) based on the treatment outcome. This study was conducted in accordance with the Declaration of Helsinki.

Study design
This study aimed to discover 5hmC markers to predict the curative effectiveness of R-CHOP scheme through high-efficiency hmC-Seal technology. Among the 86 patients recruited, PR and CR patients were grouped as responders (n = 57), and PD and SD patients were grouped as non-responders (n = 29) to R-CHOP treatment. We split 86 patients into a training and validation cohort. The objective of the first part of the study was to screen candidate genes with differential 5hmC modifications in these two groups from the training cohort. The objective of the second part of the study was to predict treatment outcome, using the model developed in the first part, in the validation cohort (Fig. 1).

Clinical samples collection and cfDNA preparation
Eight milliliters of peripheral blood from DLBCL patients was collected into Cell-Free DNA Collection Tubes (Roche). Within 24 h, plasma was prepared by centrifuging twice at 1350×g for 12 min at 4 °C and 13,500×g for 12 min at 4 °C. Then, the plasma samples were immediately stored at -80 °C. The plasma cfDNA was extracted using the Quick-cfDNA Serum & Plasma Kit (ZYMO) and then stored at − 80 °C. The fragment size of all the cfDNA samples was verified by nucleic acid electrophoresis before library preparation. 5hmC library construction and high-throughput sequencing 5hmC libraries for all samples were constructed with high-efficiency hmC-Seal technology [29]. Due to the highly sensitive nature of the chemical labeling method, the input cfDNA can be as low as 1-10 ng. According to the requirements of next-generation sequencing, the cfDNA extracted from plasma was end-repaired, 3′-adenylated using the KAPA Hyper Prep Kit (KAPA Fig. 1 Overview of study design. A total of 86 cfDNA samples were collected at the time of diagnosis from patients with DLBCL before R-CHOP or R-CHOP-like treatment. Based on treatment outcome in the follow-up care, patients were divided into the responder group (PR & CR) and non-responder group (PD & SD). A logistic regression model was trained by the training cohort that was used to predict treatment response in the validation cohort Biosystems), and then ligated with the Illumina compatible adapters. The ligated cfDNA was added in a glycosylation reaction in 25 μL solution containing 50 mM HEPES buffer (pH 8.0), 25 mM MgCl 2 , 100 μM UDP-6-N3-Glc, and 1 μM β-glucosyltransferase (NEB) for 2 h at 37 °C. Next, the cfDNA was purified using DNA Clean & Concentrator Kit (ZYMO). The purified DNA was incubated with 1 μL of DBCO-PEG4-biotin (Click Chemistry Tools, 4.5 mM stock in DMSO) for 2 h at 37 °C. Similarly, the DNA was purified using the DNA Clean & Concentrator Kit (ZYMO). Meantime, 2.5 μL streptavidin beads (Life Technologies) in 1 × buffer (5 mM Tris pH 7.5, 0.5 mM EDTA, 1 M NaCl, and 0.2% Tween 20) was added directly to the reaction for 30 min at room temperature. Finally, the beads were subsequently washed eight times for five minutes with buffer 1-4. All binding and washing steps were performed at room temperature with gentle rotation. Then, the beads were resuspended in RNase-free water and amplified with 14-16 cycles of PCR amplification. The PCR products were purified using AMPure XP beads (Beckman), according to the manufacturer's instructions. The concentration of libraries was measured with a Qubit 3.0 fluorometer (Life Technologies). Paired-end 39-bp high-throughput sequencing was performed on the NextSeq 500 platform.

Mapping and identifying 5hmC-enriched regions
FastQC (version 0.11.5) was used to assess the sequence quality. Raw reads were aligned to the human genome (version hg19) with bowtie2 (version 2.2.9) [30] and further filtered with SAMtools (version 1.3.1) [31], (parameters used: SAMtools view -f 2 -F 1548 -q 30 and SAMtools rmdup) to retain unique non-duplicate matches to the genome. Pair-end reads were extended and converted into BedGraph format normalized to the total number of aligned reads using bedtools (version 2.19.1) [32], and then converted to bigwig format, using bedGraphToBig-Wig from the UCSC Genome Browser for visualization in the Integrated Genomics Viewer. Potential 5hmCenriched regions (hMRs) were identified using MACS (version 1.4.2), and the parameters used were macs 14 -p 1e-3 -f BAM -g hs [33]. Peak calls were merged using bedtools merge, and only those peak regions that appeared in more than 10 samples and that were less than 1000 bp were retained. Blacklisted genomic regions that tend to show artifact signals, according to ENCODE, were also filtered. The hMRs for each patient were generated by intersecting the individual peak call file with the merged peak file. The hMRs within chromosome X and Y were excluded and used as an input for the downstream analyses.

Feature selection, model training, and validation
A two-step procedure was used to select optimal hMRs for distinguishing the non-responder group from the responder group prior to R-CHOP treatment. In step 1, DLBCL patients were randomly divided into training and validation cohorts in a stratified manner, using train_test_split in Scikit-Learn (version 0.22.1) [34] package in Python (version 3.6.10. In the training cohort, we identified differentially modified 5hmC regions (DhMRs) using EdgeR package (version 3.24.3) [35] in R (version 3.5.0), with the filtering threshold (p value < 0.01 & log-2FoldChange > 0.5). In step 2, the dhMRs were further filtered using the recursive feature elimination algorithm (RFECV) in Scikit-Learn (parameters used: estimator = LogisticRegressionCV (class_weight = 'balanced' , cv = 2, max_iter = 1000), scoring = 'accuracy').
Then, we trained the logistic regression CV model (LR) with the features selected from step 2 (parameter used: maxiter = 100, method = "lbfgs"). The trained LR model was used to predict the treatment outcome for patients in the validation cohort. Receiver operating characteristics (ROC) analysis was used to evaluate model performance. Area under the curve (AUC), best cutoff point, sensitivity, and specificity were computed with sklearn.metrics module.

Exploring functional relevance of the 5hmC markers
We annotated the dhMRs from step 1 using the ChIPseeker package (version 1.20.0) [36], and genes that were closest to the marker regions were used for the following functional analyses. The GO enrichment analysis (Biological Process) was done by the ClueGO (version 2.5.5) and CluePedia (version 1.5.5) plug-in from Cytoscape software (version 3.7.2) (parameters used: medium network specificity, Bonferroni step-down pV correction and two-sided hypergeometric test). We used the Search Tool for the Retrieval of Interacting Genes (STRING) database (version 10.0, https ://strin g-db.org) to find protein-protein interactions for 5hmC markers. Then, the Cytoscape software was used to construct the PPI network.

Survival analysis and gene expression correlation analysis in TCGA-DLBC
For survival analysis, we downloaded the mRNA HTseq-FPKM data of 48 DLBLC patients from the TCGA-DLBC dataset [37] in the GDC Data Portal using gdc-client (version 1.5.0) and downloaded manually curated clinical data, including overall survival (OS), disease-specific survival (DSS), disease-free interval (DFI), and progression-free interval (PFI) from UCSC Xena [38]. Survminer package (version 0.4.6) and Surviva packages (version 2.44-1.1) in R were used for survival analysis. Forty-eight patients were divided into the high-expression group and low-expression group according to the cutoff points determined by the maximally selected rank statistics algorithm (maxstat) [39]. Survival analysis of each gene was assessed by Kaplan-Meier curves [40] and the logrank test [41]. For the survival analysis, p value < 0.05 was considered statistically significant. For gene expression correlation analysis, we used a web tool called TIMER2.0 [42], which incorporated all TCGA expression data, to explore the mRNA expression relationship between 5hmC markers and other genes of interests in the TCGA-DLBC dataset. The correlation analysis was done using Spearman rank correlation.

Statistical analysis
For clinical data, continuous variables are presented as mean (SD) and categorical variables are presented as count (percentages). To understand the relationship between categorical/continuous variables and treatment outcome, Kruskal-Wallis test by ranks [43] and χ 2 test [44] were used, respectively. A two-sided p value of < 0.05 was considered to indicate statistical significance. The predictive power of clinical data was estimated by glm function in R-base and pROC package (version 1.15.3) in R.

Clinical characteristics of Diffuse large B Cell lymphoma (DLBCL) patients
The clinical summary, including baseline characteristics and laboratory data, of all 86 patients is shown in Table 1. Of the 86 patients with DLBCL patients, 46 were male and 40 were female. The median age of all the patients was 54.6 years, and 63.9% of patients had advanced disease (including stage III and stage IV). Importantly, all patients were newly diagnosed with DLBCL and received standard R-CHOP chemotherapy. Treatment efficacy was evaluated in all patients after 4 cycles of treatment.

5hmC profiles differ between responders and non-responders to R-CHOP treatment in the training cohort
Eighty-six DLBCL patients were randomly divided into the training cohort (n = 56) and validation cohort (n = 30) (Fig. 1). We used hmC-Seal to generate genome-wide 5hmC profiles for patients in the training set, including 35 responders and 21 non-responders to R-CHOP treatment. The overall 5hmC enrichment (all hMRs) was most common in intronic, intergenic, and promoter regions for both responders and non-responders, even though no statistically significant difference was found between these two groups for any genomic feature types (Fig. 2a). Meanwhile, we conducted differential analysis (EdgeR; p < 0.01, fold change > 0.5) and observed 205 DhMRs, including upregulate (n = 124) and downregulate (n = 81) regions in responders compared to nonresponders (Fig. 2b). For instance, FBXL4 (Fig. 2c) was highly enriched in hydroxymethylation for responders (p = 0.00087), and CTDP1 (Fig. 2d) was highly enriched in hydroxymethylation for non-responders (p = 0.00033). In addition, for the top 205 DhMRs, the most significant enrichment was found in intronic, intergenic, and promoter regions, consistent with previous studies [45,46] (Fig. 2e). Finally, heatmap results, using default clustering methods, demonstrated that these 205 DhMRs could effectively separate responders from non-responders (Fig. 2f ).

Pathway analysis and function exploration
Pathway analysis of 205 5hmC markers (Additional file 1: Table 1) in DLBCL patients suggested functional enrichment in certain canonical pathways. The top enriched GO biological pathways included signaling like alphabeta T cell differentiation, protein-lysine N-methyltransferase activity, and histone H3-K9 modification (Fig. 3a). Among these pathways, signaling by alpha-beta T cell differentiation was known to be relevant to tumor growth and apoptosis, which suggested that the DhMRs might be involved in the immunity system [47][48][49]. Meanwhile, the hubs of the GO functional interaction networks (Fig. 3b) showed that these genes, including BCL2 apoptosis regulator (BCL2), PR/SET domain 1 (PRDM1), prostaglandin E receptor 4 (PTGER4), SMAD family member 7 (SMAD7), H2.0 like homeobox (HLX), dedicator of cytokinesis 2 (DOCK2) and SH3 domain containing ring finger 1 (SH3RF1), participated in the regulating T cell activation and differentiation pathway.

5hmC markers showed prediction performance superior to clinical indicators for R-CHOP treatment response
Similarly, we generated genome-wide 5hmC profiles for patients in the validation set, including 22 responders and 8 non-responders to R-CHOP treatment. By using the recursive feature elimination algorithm based on the logistic regression CV estimator, we further reduced the number of 5hmC markers from 205 to 13, which achieved the best cross-validation score (Additional file 2: Figure S1). Further, we found that the 13 5hmC markers (Table 2), selected by the LR model, could distinguish responders from non-responders in both the training and validation cohorts (Fig. 4a, b). Meantime, these 13 5hmC markers could effectively predict responders and non-responders to R-CHOP treatment in the training (AUC = 1.00) and the validation cohorts (AUC = 0.78) (Fig. 4c), achieving 0.82 sensitivity and 0.75 specificity in the validation cohort (Fig. 4d). Finally, we also calculated the individual AUC for each of the 13 5hmC markers in the training and validation cohorts (Additional file 2: Figure S2A, B). Among these, ARHGEF12 and ZNF280D showed the best predictive performance, yielding an AUC of 0.76 in the validation cohort. We also investigated the association between available clinical indicators, including stage, pathology, IPI, LDH, β2MG and WBC, and R-CHOP treatment response. Among all those clinical indicators, only LDH (continuous variable, p = 0.03474) and stage (categorical variable, p = 0.004453) showed a significant association with treatment response (Additional file 2: Table 3). Thus, we used these two indicators to build logistic regression models to predict treatment response. As expected, LDH level, stage, and LDH combined with stage (LDH + stage) could also predict treatment response to a certain level. However, the AUC of LDH level (AUC = 0.646), stage (AUC = 0.658), and LDH combined with stage (AUC = 0.669) were lower than that of 5hmC markers (AUC = 0.78) (Fig. 4e).

Potential associations between 5hmC markers and R-CHOP treatment response in DLBCL patients
To further understand the potential associations between those 13 5hmC-modified marker genes and R-CHOP treatment response, we investigated their mRNA expression profiles and compared them to that of B-lymphocyte antigen CD20 (MS4A1), a rituximab target gene, in 48 DLBCL patients from the TCGA-DLBC dataset. Among those 13 marker genes, we found that the mRNA expression of MS4A1 was positively correlated with the mRNA expression of ARHGEF12 (rho = 0.385), FBXL4 (rho = 0.376), GOLGB1 (rho = 0.434), LMBR1 (rho = 0.45) (Additional file 2 Figure S3A-D). We decided to further investigate the potential mechanism of Rho Guanine Nucleotide Exchange Factor 12 (ARHGEF12),  for its mRNA expression was positively associated with MS4A1, and it achieved the highest AUC in the validation cohort among 13 5hmC-modified marker genes. According to recent studies, 5hmC enrichment in promoter regions was positively associated with gene expression levels [25,50]. In our study, ARHGEF12 was highly enriched in hydroxymethylation in the non-responders (p = 0.022) (Fig. 5a), and the hydroxymethylation site was in the promoter region (Additional file 3: Table 2). Therefore, we speculated that the change in 5hmC enrichment in the promoter region of ARHGEF12 might lead to the change in the mRNA expression of this gene. In addition, from the PPI network constructed from the STRING database, we identified several genes linked to ARHGEF12, including Ras Homolog Family Member A (RHOA), Ras Homolog Family Member B (RHOB), Ras Homolog Family Member C (RHOC), Cell Division Cycle 42 (CDC42), Rho Associated Coiled-Coil Containing Protein Kinase 1 (ROCK1), G Protein Subunit Alpha 12 (GNA12) and G Protein Subunit Alpha 13 (GNA13) (Fig. 5b). Interestingly, we found that all of these gene expressions (RHOA (rho = 0.667), RHOB (rho = 0.604), CDC42 (rho = 0.676), ROCK1 (rho = 0.832), GNA12 (rho = 0.721), GNA13 (rho = 0.784)) were highly positively associated with that of ARHGEF12 (Fig. 5c-h). Moreover, from survival analysis results in the TCGA-DLBC dataset, we found that the overall survival time (OS, days) of patients with high expression of ARHGEF12 and CDC42 was significantly lower than that of patients with low expression in these 2 genes (Fig. 5i, j). Also, we found that the mRNA expression of ARHGEF12 was positively associated with several immune-related genes, such as CD44, CD47, CD53, CD59, and CD274 (Additional file 2: Figure S4A-E). Finally, we conducted a GO enrichment analysis (Fig. 5k) for all the genes associated with ARHGEF12 (Fig. 5b) and found that the main GO enrichment was in the Rho signaling pathway which was consistent with the PPI network constructed from the STRING database.

Discussion
Even though previous studies have reported that 5hmC modifications could serve as potential diagnostic and prognostic markers in DLBCL patients [26], its role in the prediction of treatment response of R-CHOP scheme was not fully studied. Therefore, an accurate, noninvasive prediction test for treatment response of R-CHOP is highly desirable, and to this end, the emergence of liquid biopsy technology has shown to be a promising approach. In this study, we aimed to develop a model to predict R-CHOP scheme treatment response for DLBCL patients based on the 5hmC profiles derived from plasma cfDNA before R-CHOP treatment using hmC-Seal sequencing method.
In our cohort, we found that responders and nonresponders to R-CHOP scheme had distinctive differences in 5hmC enrichment, containing 205 DhMRs detected by differential analysis method. Additionally, pathway analysis of the 205 marker genes with differentially modified 5hmC between responders and nonresponders suggested enrichment in alpha-beta T cell activation and differentiation signaling pathway. As we all known, tumor progression and drug resistance are highly associated with the physiological state of the tumor microenvironment (TME), and thus, the tumor microenvironment (TME) represents an attractive therapeutic target and closely related to the curative effect of tumor therapy [47]. The composition of tumor microenvironment is complex, which mainly include tumor cells, stromal elements, extracellular matrixes, inflammation, and immune cells [48], which are closely related to tumor development, metastasis, and tumor therapy [49]. Importantly, cfDNA is not only derived from tumor cells, but also from the tumor microenvironment [51]. Therefore, these 5hmC marker genes could be related to the effect of R-CHOP treatment.
Furthermore, we found that 13 5hmC markers filtered by machine learning algorithms could well distinguish non-responders from responders in both the training and validation cohorts. Meantime, the prediction performance of the logistic regression model, established by 13 5hmC markers, achieving 0.82 sensitivity and 0.75 specificity (AUC = 0.78), was superior to existing clinical indicators, such as LDH (AUC = 0.646) and stage (AUC = 0.658). Furthermore, when combining the LDH and stage, the AUC was also lower than 13 5hmC markers. Taken together, these findings indicated that 5hmC markers derived from cfDNA may serve as effective biomarkers for (See figure on next page.) Fig. 5 ARHGEF12 and its potential relevance in DLBCL patients and treatment response. a Boxplot of ARHGEF12 grouped by treatment response (PDSD vs PRCR). Log2 transformed of TMM normalized 5hmC enrichment values were plotted, and Wilcoxon t test was used. b Functional protein-protein interaction networks (PPI) from the STRING database. c-h Correlation plots of the mRNA expression of ARHGEF12 with the mRNA expressions of genes in the RHO pathway, including RHOA, RHOB, CDC42, ROCK1, GNA12 and GNA13 in DLBCL in the TCGA-DLBC dataset. i, j Overall survival curves of DLBCL patients with low or high gene expressions in ARHGEF12 or CDC42 in the TCGA-DLBC dataset. The x-axis represents the OS time (days), and the y-axis represents the survival probability. (K) GO enrichment bar plot for genes associated with ARHGEF12 as shown in the PPI network (*p = 0.005-0.05, **p = 0.0005-0.005)