DNAm was measured in cord blood at birth. Bisulfite conversion was performed with the EZ-96 DNAm kit (shallow) (Zymo Research Corporation, Irvine, USA). DNAm levels were then measured with the Illumina Infinium HumanMethylation450 BeadChip array (Illumina Inc., San Diego, USA). Preprocessing in ALSPAC was performed with the meffil package . Quality control check included on a sample levels mismatched genotypes, mismatched sex, incorrect relatedness and on a CpG level low concordance with other time points, extreme dye bias, and poor probe detection. In Generation R, pre-processing was performed with the CPACOR workflow . Quality control exclusion criteria included on a CpG level failed bisulfite conversion, hybridization or extension, and on a sample level sex mismatches and call rate < = 95%. Both cohorts were normalized using a combined dataset, using meffil functional normalization with ten control probe principal components and slide included as a random effect, see Mulder et al.  for further details. To lessen the influence of methylation outliers while retaining a consistent sample size, extreme values were winsorized. Per CpG site, DNAm levels exceeding three times the interquartile range above the third or below the first quartile (3*IQR criterion) were replaced by the maximum or minimum value, respectively, of the sample below the exclusion criterion. Only autosomal probes were considered in this study for consistent interpretation of effects between sexes. This resulted in 470,870 and 473,864 CpG probes in ALSPAC and Generation R, respectively, which were used for the computation of the methylation similarity matrix.
Outcomes and covariates
In ALSPAC, birthweight was recorded by healthcare professionals at the time of birth and extracted from birth records . Gestational age at delivery was also extracted from birth records. Obstetric practice and antenatal care at the time means that for most participants gestational age will have been estimated based on the last menstrual period, supplemented by ultrasound scans and paediatric/obstetric assessment of the newborn at birth.
In GenR, midwife and hospital registries were used to obtain information on birth weight. Gestational age was based on ultrasound examinations for mothers who enrolled in early or mid pregnancy, but based on last menstrual period for late pregnancy .
In ALSPAC, measurements of height and weight, with the participant in light clothing and without shoes, were obtained at clinic visits when the children were seven years of age to calculate BMI. Non-verbal IQ at age 8 years was measured by the Wechsler Intelligence Scale for Children WISC-III UK . ADHD symptomatology was assessed via maternal ratings at age 7, with the Development and Well-Being Assessment interview (DAWBA) .
In Generation R, when children were 6.0 (SD = 0.15) years old, children’s height and weight were measured at the research center without shoes or heavy clothing and used for the calculation of BMI (kg/m2). Non-verbal IQ was assessed at the same age using the Snijder-Oomen nonverbal intelligence test . ADHD symptoms were rated by a primary caregiver (90% mothers) using the Conners’ Parent Rating Scale-Revised (CPRS-R) questionnaire at age 8.1 (SD = 0.15) .
In ALSPAC, mothers were asked about their smoking during pregnancy, and these data were used to generate a binary variable of any smoking during pregnancy. Maternal education was collapsed into whether they had achieved a university degree or not. White cell proportions were estimated with the Houseman method using a combination of four reference panels specific for cord blood . In Generation R, maternal age was obtained at enrollment. Maternal smoking was defined as either “Never smoked”, “Quit smoking in early pregnancy”, “Continued smoking during pregnancy”. Maternal education during pregnancy was categorized as “no education”, “primary education”, “secondary education first phase”, “secondary education second phase”, higher education first phase”, higher education second phase”. See Table 1 for descriptive statistics of all variables.
We adapted the GREML approach to estimate R2Methylation. The GREML procedure consists of two steps: (1) Compute a genetic relatedness matrix (i.e. how genetically similar two individuals are based on SNP data), (2) Regress the outcome similarity between participants on the genetic relatedness (i.e. to establish whether greater genetic similarity between individuals relates to greater phenotypic similarity).
We refer to a methylation similarity matrix (M) as opposed to a genetic relatedness matrix (G). However, both M and G can be calculated with the same algorithm. First the methylation in beta values were z-score standardized. The resulting matrix (X) of methylation z-scores (columns: CpG sites, rows: participants) was then multiplied with the transpose resulting in XX’. XX’ was then standardized by dividing the matrix with the mean of the diagonal, resulting in an average value of 1 for the diagonal of M. We used the R package BGData 2.1.0  to compute the similarity matrix.
The next step is to regress the outcomes on M and covariates using a mixed effects model fitted with REML. Fixed effects covariates included several variables known to be associated with DNAm levels: sex, maternal age, maternal smoking, maternal education, cell type proportions, gestational age, birth weight (unless a variable was the outcome). M and batch were defined as random effects.
Multiple imputation was used to avoid potential bias due to missing data and to make analyses more comparable between outcomes by including the same set of participants. We used the covariate and outcome variables to predict missing variables in 10 imputations with 30 iterations using MICE  in R.
According to power analyses with genetic data, to accurately estimate the variance explained using GREML methods, large sample sizes are necessary. Especially with less heritable traits sample sizes above 5000 participants are recommended . Currently, studies that have measured DNAm and child outcomes in more than 1000 participants are rare. While the power requirements for DNAm data are unclear, there is nevertheless a high risk of sampling variance, with results randomly changing heavily depending on a particular sample composition. We attempted to reduce these risks by estimating R2Methylation in two independent cohorts, as well as by performing cross-validation within cohorts.
Cross-validation (CV) was applied in the following way: (1) M was estimated across all participants and stayed constant independent of later training and test assignment. (2) Ninety percent of the sample was randomly chosen as training sample and the GREML model was fitted in this training sample. (3) Based on the results of the training sample the best linear unbiased predictions (BLUP) were extracted for the test sample. The BLUP estimates are continuous values, which reflect the extent to which participants are predicted to have above or below average outcome values, based on their similarity in genome-wide methylation to other participants. (4) The outcome is predicted based on M and the covariates in the test set. (5) The predictions are correlated with the actual observed outcome in the test set and squared to obtain the variance explained by the model. (6) The variance explained by a covariate only model is subtracted to obtain the variance explained by DNAm beyond the other tested variables (ΔR2Methylation). (7) Step 1–6 are repeated to have results for 100 random training–testing splits (Monte-Carlo cross-validation). (8) The mean estimate of ΔR2Methylation, with standard deviation across the cross-validation splits are extracted. (9) Steps 1–8 are repeated for each imputed dataset (n = 10) and then pooled using Rubin’s rule. (10) Results of both cohorts are averaged, weighted by the inverse of the cross-validation variance.
We reran all analyses with an M matrix based on independent methylation features only. To construct such a matrix, we used EnMCB 1.7.3 to identify MCBs . We aggregated neighboring CpG sites with r = > 0.3 correlation and maximum 1000 bp gap into MCBs. We then averaged the methylation betas of all CpGs within one MCB resulting in a single value, representing the methylation status of a MCB. The MCB methylation levels were then combined with independent CpG sites (r < 0.3) and used to compute an M matrix. ALSPAC used 60,722 MCBs and 167,440 independent CpGs to produce an independent M matrix and Generation R used 71,175 MCBs alongside 206,062 independent CpGs sites.
These analyses were run with the qgg package in R, which has implemented GREML models with cross-validation . We wrote additional functions, which can be found in the omicsR2 package: https://github.com/aneumann-science/omicsR2. The provided functions simplify the process of comparing the predictive performance of DNA methylation compared to a covariates-only baseline model and add support for multiply imputed data.