DNA methylation levels are highly correlated between pooled samples and averaged values when analysed using the Infinium HumanMethylation450 BeadChip array
© Gallego Fabrega et al. 2015
Received: 31 March 2015
Accepted: 22 June 2015
Published: 31 July 2015
DNA methylation is a heritable and stable epigenetic mark implicated in complex human traits. Epigenome-wide association studies (EWAS) using array-based technology are becoming widely used to identify differentially methylated sites associated with complex diseases. EWAS studies require large sample sizes to detect small effects, which increases project costs. In the present study we propose to pool DNA samples in methylation array studies as an affordable and accurate alternative to individual samples studies, in order to reduce economic costs or when low amounts of DNA are available. For this study, 20 individual DNA samples and 4 pooled DNA samples were analysed using the Illumina Infinium HumanMethylation450 BeadChip array to evaluate the efficiency of the pooling approach in EWAS studies. Statistical power calculations were also performed to discover the minimum sample size needed for the pooling strategy in EWAS.
A total of 485,577 CpG sites across the whole genome were assessed. Comparison of methylation levels of all CpG sites between individual samples and their related pooled samples revealed highly significant correlations (rho > 0.99, p-val < 10−16). These results remained similar when assessing the 101 most differentially methylated CpG sites (rho > 0.98, p-val < 10−16). Also, it was calculated that n = 43 is the minimum sample size required to achieve a 95 % statistical power and a 10−06 significance level in EWAS, when using a DNA pool strategy.
DNA pooling strategies seems to accurately provide estimations of averaged DNA methylation state using array based EWAS studies. This type of approach can be applied to the assessment of disease phenotypes, reducing the amount of DNA required and the cost of large-scale epigenetic analyses.
Epigenetics refers to the stable, heritable and reversible modifications in DNA expression associated with transcriptional regulation without alterations in the nucleotide sequence . Epigenetic processes such as DNA methylation (DNAm), histone acetylation/deacetylation, non-coding mRNA expression and chromatin conformational changes  are essential for normal cellular development and differentiation. They have also been linked to some monogenic and complex human diseases [3, 4]. Nowadays DNA methylation is one of the most studied epigenetic modifications [5, 6] and alterations in methylation have been linked with some disease processes such as different types of cancer [4, 7, 8], as well as with aging and exposure to tobacco smoke [9–12].
Some of the most important technologies used to detect DNA methylation are: deep sequencing, high-throughput deep sequencing and array-based genome-wide studies such as Epigenome Wide Association (EWAS) .
In the “omics” era, Genome-wide Association Studies (GWAS) have been widely used to discover the genetic polymorphisms associated with human diseases. These studies have been more successful in finding genes associated with complex diseases, compared to classical candidate genes studies. However, GWAS needs higher sample sizes and specific arrays that increase project costs. Several papers have observed that the use of pooling strategies decreases the cost of GWAS, while providing similar results to individual sample analysis .
Pearson and colleagues reported that pooling-based GWAS was theoretically effective in identifying genetic associations in different types of disease . Applying these methods to experimental case–control data, they also demonstrated the successful identification of previously published susceptible loci for a rare monogenic disease, a rare complex disease and a common complex disease. In addition, Gaj et al. confirmed previously reported loci for colorectal cancer and prostate cancer in a Polish population, with a pooled-based strategy using GWAS .
Epigenome-wide association studies (EWAS) use the same strategy as GWAS, but for epigenetics. EWAS use array-based genotyping technology to detect the methylation levels at CpG sites across the genome. EWAS of human diseases are becoming increasingly common [4, 7, 17, 18]. Like GWAS, the EWAS are hypothesis-free approaches to finding differentially methylated sites instead of different allele frequencies. Nevertheless, pooled DNA strategies might be an affordable alternative that reduces study costs in array-based EWAS.
No current studies have analysed the accuracy of DNA pooling strategies in array-based EWAS. Our aim is thus to analyse the pooling strategies in EWAS studies in order to determine the effectiveness of these approaches in studying DNA methylation patterns in human samples.
In the present study, data from 20 individual DNA samples and 4 pooled DNA samples, analysed with the Illumina Infinium HumanMethylation450 BeadChip, were used to estimate the feasibility of the pooling approach, comparing the results of the individual samples to the results of the DNA pools of the same samples.
Results and discussion
Using values from the most significant DMC of pooled samples in the EWAS study, the optimum sample size to reach a 95 % statistical power and a 10−6 significance level, should be from 43 to 100 pooled samples per condition, considering Cohen’s d effect sizes of 1.5 to 0.95 respectively.
The accuracy and reproducibility of DNA pools for methylation array, using the Illumina Infinium HumanMethylation450 BeadChip array, was investigated by comparing data obtained from individual samples and the same samples after they had been pooled.
Our data indicate that the DNA methylation profile (β-values of CpG sites) from the pooled DNA samples using array technology are highly consistent with those obtained from the individual samples, even when evaluating the most significant DMCs separately (Group A: rho = 0.9808, p-value < 10−16; Group B: rho = 0.9872, p-value < 10−16).
A previous study analysing pooling strategies in methylation studies demonstrated that pools could be an alternative technique when small amounts of DNA are available or when a reduction in cost is necessary to undertake the experiments. In the study, Docherty et al. showed a correlation between 89 individual samples and 4 pool samples in 205 CpG sites spanning 9 genomic regions using Sequenom EpiTYPER . The overall correlation value in the study was 0.95 with a p-value < 2.210−16, similar to the results that we observed. However, in our study we found that pooling strategies can be also performed assessing whole genomes in array-based EWAS experiments, analysing more than 450,000 CpG sites. This finding expands the possibilities of Genome Wide studies in epigenetics. In a pooling-based GWAS study, Pearson et al. demonstrated successful identification of published genetic susceptibility loci for some human diseases: APOE-ε4 in Alzheimer disease, MAPT in progressive supranuclear palsy and TSPYL in sudden infant death with dysgenesis of the testes syndrome (SIDDT) . In EWAS we have yet to confirm whether previously reported genes can be found using pooling strategies. However, the higher correlation of the methylation levels between pools and individual samples indicates that the pooling strategies in EWAS are an accurate and interesting strategy to reduce time costs and DNA amount in such experiments.
Even though a DNA pooling strategy has important advantages, there are several drawbacks that have to be considered in the study design. Pool construction has to be really precise. DNA quantities have to be really accurate to assure that each sample in the pool provides equal quantities of DNA in order to minimize technical errors that may alter the estimated methylation levels [14, 20]. Only mean methylation levels, and not individual methylation data, can be obtained from pooled samples. In addition, adjusting for covariates is almost impossible, unless pooled samples are very homogeneous. Population stratification needs to be excluded. Furthermore, the error rate tends to be higher in pooled samples compared to individual ones . It is also important in the study design for EWAS with pooled samples to take into account the sample size needed to compute DMCs with confidence. According to the results obtained in our study, we suggest analysis of at least n = 43 pooled samples per condition in order to achieve a 10−06 significance level and 95 % statistical power, considering a Cohen’s d effect size =1.5. However, this number may vary depending on a study’s characteristics.
In summary, this is the first study that analyses a pooling strategy in EWAS approaches, it found that this strategy is an acceptable alternative to regular individual EWAS analysis, mainly in specific situations such as when lower quantities of DNA are available, or in studies with a limited budget.
The analysis of the data generated by 450,000 CpG sites across the whole genome in 20 individual samples demonstrates that DNA pooling strategies can be used to provide estimations of averaged DNA methylation state using the Illumina Infinium HumanMethylation450 BeadChip array. This approach may be useful to highlight genome regions to be studied in further epigenetic analysis, reducing the costs and the amount of DNA required.
Sample selection and pool construction
10 (50 %)
72,25 ± 8.4
72,5 ± 8.4
72 ± 8.7
16 (80 %)
8 (40 %)
8 (40 %)
4 (20 %)
2 (10 %)
2 (10 %)
8 (40 %)
4 (20 %)
4 (20 %)
4 (20 %)
2 (10 %)
2 (10 %)
2 (10 %)
1 (5 %)
1 (5 %)
6 (30 %)
3 (15 %)
3 (15 %)
DNA purification and sample pooling
Total genomic DNA was extracted from whole blood samples using the Gentra Puregene Blood Kit (Quiagen, Hilden, Germany) following the manufacturer’s instructions. The samples were maintained at −20 °C until the EWAS analysis.
Epigenome wide association analysis
Genome-wide DNA methylation was assessed using the Infinium HumanMethylation450 BeadChip (Illumina Inc., San Diego, Ca). This chip-based study quantitatively measures more than 450,000 CpG sites at single nucleotide resolution with a 99 % coverage of RefSeq Genes.
A Quality Control (QC) of all samples was performed as a first step to check DNA integrity using Invitrogen E-Gel 1 % Agarose Gels. The DNA samples showed no fragmentation or poor quality.
Genomic DNA from the 20 samples and the 4 pools was bisulphite converted using the Zymo EZ DNA MethylationTM Kit (Zymo Research, Orange, Ca) following the manufacturer’s instructions, but with alternative incubation conditions suggested for the Illumina Infinium Methylation Assay. All samples were processed in a single working batch using the Illumina Infinium MSA4 protocol, which includes amplification, fragmentation, hybridization and BeadChip scanning.
For QC, the fluorescence data generated for each CpG locus was analysed with the Illumina GenomeStudio software package. Samples and CpG sites with fluorescence detection p-values > 0.05 were removed . This p-vaule is the detection p-value that represents the confidence that a given methylation level on a CpG site can be considered to have been detected.
Quality control and normalization
R packages and instructions. Specific instructions used from each R package
Load Illumina methylation data into a MethyLumiSet object.
Density plots of methylation Beta values.
Filter data sets based on bead count and detection p-values
Multi-dimensional scaling (MDS) plots showing a 2-d projection of distances between samples.
Calculate normalized betas from Illumina 450 K methylation arrays.
Estimate methylation Beta-value matrix from eSet-class object (include methylated and unmethylated probe intensities)
Prior to the identification of differentially methylated CpG sites, data was pre-processed using a non-specific filter step. This step consists of removing CpG sites with detection p-value ≥ 0.05 in more than 1 % of the samples. Samples with detection p-value ≥ 0.05 in more than 1 % of the CpG sites, and CpG sites with beadCount < 3 in 5 % of samples . CpG sites containing documented single nucleotide polymorphisms (SNPs) were also removed . Multidimensional scaling (MDS) plots were used to evaluate gender outliers based on chromosome X data, where males and females were separated into two distinct clusters. An MDS plot was also used to check for unknown population structures, inside the sample. Then, CpG sites on the X and Y chromosomes were removed . Finally, a subset quantile normalization was performed using a background adjustment between-array normalization and a dye bias correction, following previous recommendations [27, 28].
All statistical analysis was also performed using R (version 3.0.1). The accuracy of DNA methylation level estimations from pooled DNA was assessed with a Spearman’s correlation, for non-parametric samples, between the β-values of each pool and the averaged β-values of the individual samples included in each pool .
We also performed a Spearman’s correlation between the β-values of the 101 most differentially methylated CpGs (DMCs) found in individual samples (Group A vs. Group B) and the β-values of the same CpG sites in pools. Differentially methylated CpG sites were determined by the Mann–Whitney U-test for non-parametric samples using the β-values, p-val < 10−06 adapted from Rakyan VK el al. . The DMCs analysis was performed comparing group A samples (n = 10) against group B samples (n = 10).
Minimum sample size needed for pool analysis in EWAS was calculated using the pwr package  with implemented power analysis as outlined by J. Cohen, 1988.
Ethical approval has been obtained from the ethical committee of the Vall d’Hebron Hospital (PR(AG) 03/2007). All patients were provided with oral and written information about the project, and each participant signed the informed consent for the study.
The Laboratory of Stroke Pharmacogenomics and Genetics is part of the International Stroke Genetics Consortium (ISGC, www.strokegenetics.com) and coordinates the Spanish Stroke Genetics Consortium (Genestroke, www.genestroke.com). I. F-C. is supported by the Miguel Servet programme (CP12/03298), Instituto de Salud Carlos III. This study was funded by the Miguel Servet grant (Pharmastroke project: CP12/03298) and by the Fundació Docència I Recerca MutuaTerrassa, Hospital Universitari Mutua de Terrassa grant (EXCLOP project).
The Neurovascular Research Laboratory receives grants from the Spanish stroke research network (INVICTUS) and the European Stroke Network (EUSTROKE 7FP Health F2-08-202213).
- Langevin SM, Kelsey KT. The fate is not always written in the genes: epigenomics in epidemiologic studies. Environ Mol Mutagen. 2013;54:533–41.PubMed CentralPubMedView ArticleGoogle Scholar
- Gopalakrishnan S, Van Emburgh BO, Robertson KD. DNA methylation in development and human disease. Mutat Res. 2008;647:30–8.PubMed CentralPubMedView ArticleGoogle Scholar
- Jiang Y-H, Bressler J, Beaudet AL. Epigenetics and human disease. Annu Rev Genomics Hum Genet. 2004;5:479–510.PubMedView ArticleGoogle Scholar
- Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011;12:529–41.PubMed CentralPubMedView ArticleGoogle Scholar
- Bock C. Epigenetic biomarker development. Epigenomics. 2009;1:99–110.PubMedView ArticleGoogle Scholar
- Feinberg AP. Genome-scale approaches to the epigenetics of common human disease. Virchows Arch. 2010;456:13–21.PubMed CentralPubMedView ArticleGoogle Scholar
- Verma M. Epigenome-Wide Association Studies (EWAS) in Cancer. Curr Genomics. 2012;13:308–13.PubMed CentralPubMedView ArticleGoogle Scholar
- Shen J, Wang S, Zhang Y-J, Wu H-C, Kibriya MG, Jasmine F, et al. Exploring genome-wide DNA methylation profiles altered in hepatocellular carcinoma using Infinium HumanMethylation 450 BeadChips. Epigenetics. 2013;8:34–43.PubMed CentralPubMedView ArticleGoogle Scholar
- Horvath S, Zhang Y, Langfelder P, Kahn RS, Boks MP, van Eijk K, et al. Aging effects on DNA methylation modules in human brain and blood tissue. Genome Biol. 2012;13:R97.PubMed CentralPubMedView ArticleGoogle Scholar
- Johnson AA, Akman K, Calimport SRG, Wuttke D, Stolzing A, de Magalhães JP. The role of DNA methylation in aging, rejuvenation, and age-related disease. Rejuvenation Res. 2012;15:483–94.PubMed CentralPubMedView ArticleGoogle Scholar
- Shenker NS, Ueland PM, Polidoro S, van Veldhoven K, Ricceri F, Brown R, et al. DNA methylation as a long-term biomarker of exposure to tobacco smoke. Epidemiology. 2013;24:712–6.PubMedView ArticleGoogle Scholar
- Flom JD, Ferris JS, Liao Y, Tehranifar P, Richards CB, Cho YH, et al. Prenatal smoke exposure and genomic DNA methylation in a multiethnic birth cohort. Cancer Epidemiol Biomarkers Prev. 2011;20:2518–23.PubMed CentralPubMedView ArticleGoogle Scholar
- Gupta R, Nagarajan A, Wajapeyee N. Advances in genome-wide DNA methylation analysis. Biotechniques. 2010;49:iii–xi.PubMedView ArticleGoogle Scholar
- Sham P, Bader JS, Craig I, O’Donovan M, Owen M. DNA Pooling: a tool for large-scale association studies. Nat Rev Genet. 2002;3:862–71.PubMedView ArticleGoogle Scholar
- Pearson JV, Huentelman MJ, Halperin RF, Tembe WD, Melquist S, Homer N, et al. Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. Am J Hum Genet. 2007;80:126–39.PubMed CentralPubMedView ArticleGoogle Scholar
- Gaj P, Maryan N, Hennig EE, Ledwon JK, Paziewska A, Majewska A, et al. Pooled sample-based GWAS: a cost-effective alternative for identifying colorectal and prostate cancer risk variants in the Polish population. PLoS One. 2012;7:e35307.PubMed CentralPubMedView ArticleGoogle Scholar
- Xu X, Su S, Barnes VA, De Miguel C, Pollock J, Ownby D, et al. A genome-wide methylation study on obesity: differential variability and differential methylation. Epigenetics. 2013;8:522–33.PubMed CentralPubMedView ArticleGoogle Scholar
- Häsler R, Feng Z, Bäckdahl L, Spehlmann ME, Franke A, Teschendorff A, et al. A functional methylome map of ulcerative colitis. Genome Res. 2012;22:2130–7.PubMed CentralPubMedView ArticleGoogle Scholar
- Docherty SJ, Davis OSP, Haworth CMA, Plomin R, Mill J. Bisulfite-based epityping on pooled genomic DNA provides an accurate estimate of average group DNA methylation. Epigenetics Chromatin. 2009;2:3.PubMed CentralPubMedView ArticleGoogle Scholar
- Norton N, Williams NM, O’Donovan MC, Owen MJ. DNA pooling as a tool for large-scale association studies in complex traits. Ann Med. 2004;36:146–52.PubMedView ArticleGoogle Scholar
- Teumer A, Ernst FD, Wiechert A, Uhr K, Nauck M, Petersmann A, et al. Comparison of genotyping using pooled DNA samples (allelotyping) and individual genotyping using the affymetrix genome-wide human SNP array 6.0. BMC Genomics. 2013;14:506.PubMed CentralPubMedView ArticleGoogle Scholar
- Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–9.PubMed CentralPubMedView ArticleGoogle Scholar
- Davis S, Du P, Bilke S, Triche T J and BM. methylumi: Handle Illumina methylation data. R Packag version 2100. 2014.Google Scholar
- Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24:1547–8.PubMedView ArticleGoogle Scholar
- Schalkwyk LC, Pidsley R, Wong CC, Touleimat wfcbN, Defrance M TA and MJ. wateRmelon: Illumina 450 methylation array normalization and metrics. R Packag version 140. 2013.Google Scholar
- Price ME, Cotton AM, Lam LL, Farré P, Emberly E, Brown CJ, et al. Additional annotation enhances potential for biologically-relevant analysis of the Illumina Infinium HumanMethylation450 BeadChip array. Epigenetics Chromatin. 2013;6:4.PubMed CentralPubMedView ArticleGoogle Scholar
- Touleimat N, Tost J. Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics. 2012;4:325–41.PubMedView ArticleGoogle Scholar
- Pidsley R, Wong CC Y, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14:293.PubMed CentralPubMedView ArticleGoogle Scholar
- Basic Functions for Power Analysis. [http://cran.r-project.org/web/packages/pwr/pwr.pdf].
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.