Detection of maternal contamination
Our first indication of potential maternal contamination of cord blood came from unusual patterns in the DNAm data during quality control. Quality control MDS plots of un-normalized data showed 17 of 86 male participants’ DNAm profiles clustered with female children or in between male and female, which was confirmed by plotting principal components 1 and 2 (Fig. 1a). Investigating the X and Y chromosome probes prior to probe filtering and normalization in more detail, we observed that these male children showed a DNAm pattern on the X chromosome that was intermediate between the normal male and normal female patterns (Fig. 1b). Together, this was suggestive of female blood being mixed with the cord blood of the newborn males, which could have occurred across the placenta during labor or after delivery.
Investigation of the cord blood collection procedure revealed that maternal contamination of the resulting cord blood after delivery was the most likely hypothesis to explain these unexpected DNAm patterns. With this insight, we then divided samples into three groups based on principal component 2 (PC2) of the full data and DNAm at cg05533223 on the X chromosome. As initially observed, PC2 clearly separated male from female samples, but was not associated with the major variables in the sub-study, ethnicity (ANOVA p > 0.8) or trauma exposure (t test p > 0.3). The CpG used, cg05533223, in the X-inactivation specific transcript (XIST) should be highly methylated in males and ~50% methylated in females [18]. Based on these two criteria, 17 males were contaminated (C), 64 were not contaminated (NC) and 5 were unclear (U) (Additional file 1: Figure S1 in Additional file 1). As we relied on X chromosome methylation levels, which would not differ between XX mothers and their XX daughters, this method was only applicable to XY male children. Since it called approximately 20% of male samples contaminated, we hypothesized that a similar proportion (approximately 13/64) of female children would also be contaminated. There was no reason to expect that the amount of maternal contamination due to sample collection would differ by sex, as all collection occurred in the same hospital using the same standard procedures.
Using epigenetic age and genotyping no-calls to identify contaminated samples
We thus sought a way of discriminating contaminated females using other data. First, we tested epigenetic age by comparing the C and NC male samples using published methods [19]. As epigenetic age of cord blood samples has been demonstrated to be below 1 year, we hypothesized that mixing with maternal blood would result in an increase in epigenetic age of the whole sample. Though the DNAm age means were significantly different between C and NC, (two-sided Student’s t test p = 0.025), the large confidence intervals (−14.714880 to −1.077678) meant that this was not a sufficiently accurate test, despite the identification of at least 4 females who were likely contaminated (Additional file 1: Figure S2A). Using a similar method that estimates gestational age from DNAm data, we found similarly poor predictive value (Additional file 1: Figure S2B) [20].
Next, we used genotyping data to see whether a higher number of “no calls” from the Illumina PsychChip was associated with contamination. Our rationale was that mixing two blood samples together, even if genetically related, would result in a higher number of un-callable genotypes with signals falling between the three normal genotype groups. While performing better than epigenetic age, the extreme confidence intervals (34,281.73–10,811.97, p value <0.001), difference in basal number of no calls between males and females, and potential lack of genotyping data in other studies meant, in our opinion, this was not a suitable discriminatory screen either (Additional file 1: Figure S2C).
Identification of CpGs indicative of contamination
We next reasoned that since DNAm has been shown to be highly different between neonates and adults, it might serve to discriminate contaminated samples. Using linear modeling followed by a random forests approach, we determined that 10 CpGs could discriminate between contaminated and non-contaminated male samples at 99% confidence (Additional file 1: Figure S2A, Additional file 1: Table S2). Importantly, the calculated thresholds for identifying contaminated samples were sensitive to normalization method, and so we present thresholds for two common normalization methods; SWAN and BMIQ [21, 22].
To identify the contaminated female samples, we applied the thresholds of these 10 CpGs to all of our samples (Fig. 2b). This method identified 13 females as contaminated, including the 4 previously identified by epigenetic age, in line with the approximately 20% expected based on proportion of contaminated males, and all 5 unclear males were categorized as non-contaminated (Fig. 2b). This showed that these 10 CpGs were sufficient for screening previously generated DNAm data to identify maternal blood contamination in male and female children. However, we wished to refine this panel so that samples could be screened prior to being run on an array in cases where contamination might be expected.
Verification of screening CpGs using pyrosequencing
To ensure that this pre-screening method was quick and cost-effective, we focused on pyrosequencing and reduced the 10 identified CpGs to 3. These three CpGs had the best discrimination between contaminated and non-contaminated male samples and were sites for which a robust pyrosequencing assay could be designed (Table S2). After selecting cg25556035, cg15931839, and cg02812891, we performed pyrosequencing of these 3 sites on our original 150 samples (Fig. 2c). Interestingly, the assay that measured cg02812891 also measured cg13138089 as these CpGs are in close proximity. As these two CpGs were strongly correlated (r = 0.977) within the assay, we deemed cg13138089 to be redundant for the purpose of designing a minimal screen, though other groups may consider its inclusion in the screening process. A strict cut-off requiring all 3 CpGs to surpass the contamination threshold identified 14 male samples as contaminated, all consistent with the array and X chromosome data. A less stringent cut-off of 2 CpGs identified 17 male samples, with 1 false positive and 1 false negative. In females, the less stringent 2 CpG cut-off predicted 11 of the 13 samples called contaminated using the 450K array data, and the strict method predicted 6; neither had false positives. While this screen is not as accurate as the 10 CpG method from the 450K array data, it is sufficient to identify and eliminate the worst contaminated samples. All prediction methods and results are summarized in Fig. 3.
Validation on second data set
To validate this screening method, 189 additional samples from the same cohort study were screened using the pyrosequencing assays. Eighteen males and 15 females were identified as contaminated using the 2 CpG cut-off, again approximating the 20% contamination rate we initially observed (Fig. 4a). We ran all 156 uncontaminated samples and 2 contaminated male samples on the EPIC array. We chose male samples as validation, as we could use sex-specific differences in DNA methylation at XIST on the X chromosome as independent confirmation of our screening method. Initial principal components plots showed that only the two known contaminated male samples demonstrated the intermediate DNAm pattern indicative of contamination (Fig. 4b). We then examined the 10 CpGs identified in our discovery data set and, as expected, only the 2 known male samples were identified as contaminated (Fig. 4c). This supports that 3 CpGs are sufficient to correctly eliminate contaminated samples prior to running on an array.
Validation on publicly available data
To address the frequency with which maternal blood contamination occurs in DNAm studies, we used nine published cord blood DNAm data sets (GSE30870, GSE54399, GSE62924, GSE66459, GSE74738, GSE79056, GSE80310, GSE83334, and PREDO). We applied our post hoc maternal contamination assay with 10 CpGs across these studies and identified 2 data sets with contaminated samples (Fig. 5). GSE54399 had 2/24 (~10%, 1 male and 1 female) samples indicating contamination, and PREDO 8/834 (~1%, 4 males and 4 females). Across all studies, maternal blood contamination was present at a frequency of approximately 1% (10/1014), but the study-specific pattern suggests that contamination may be related to specific collection methods.
Finally, we examined our discovery samples, validation samples, and the publicly available data together to determine whether our 10 CpG method was affected by batch or technology. We compared the residuals of each sample’s methylation to thresholds of each of our 10 CpGs (Additional file 1: Figure S3). We observed similar distributions for each CpG in all studies except for the validation cohort, the only one to use the EPIC array. These data were normalized with methods consistent with the GEO data, so the effect is due to technology and not normalization method. This suggests that, despite successfully identifying the known contaminated samples in our EPIC cohort, the 10 CpG method is influenced by array technology and thus using all 10 CpGs is highly recommended when working with EPIC data.