Quality Control and Selection of Damaging Rare Variants
The quality control filters for case exome data were chosen to closely match the criteria published by gnomAD. They can be categorized as being focused on variant-specific parameters and sample-specific parameters.
The Genome Analysis Toolkit (GATK) provides many variant-specific parameters. Variants retained for analysis include those with a GATK variant quality score recalibration value of “PASS”, root mean squared of mapping quality (MQ) ≥ 20, and inbreeding coefficient < -0.3. Of the single nucleotide variants (SNVs) filtered from the WES data, we evaluated only those SNVs that lay within any coding transcript or splice sites of the 523 genes of interest. In order to ensure that the allele counts compared at each locus (single nucleotide position) are representative of each population, we only compared SNVs that have ≥ 89.5% coverage of DP ≥ 10 in both the myelomeningocele cases and the gnomAD population used as references.
The remaining quality control filters target the quality of a variant within an individual sample. We kept samples that met the following at each locus: a genotype with alternate allele depth ≥ 25%, a read depth ≥ 10, and genotype quality score ≥ 20. A variant site is considered “covered” if its position had a DP ≥ 10 in 89.5% of samples.
In addition to evaluating only high quality variants and samples in the case population, the analysis was further focused to include only variants that were rare (defined as having an allele frequency in the gnomAD control population less than 1%) and predicted to be deleterious (coding for a stop gain, stop loss, splice site missense, or having a combined annotation dependent depletion phred score greater than 20). This applies to both myelomeningocele cases and gnomAD controls. We refer to rare, deleterious variants that met the above quality control standards as “qualifying variants.”
Mutational Burden Analysis
To find any genes with higher mutational burden in a case population compared to its gnomAD control, a two-by-two table for each gene was constructed for Fisher’s exact test. The Fisher’s exact test compares the number of affected individuals and unaffected individuals from case and the control populations, where an “affected individual” is someone containing at least one RDV in the gene being evaluated and an “unaffected individual” is someone containing zero RDVs in that gene.
Individual subject data was not readily available from gnomAD. Therefore, affected and unaffected control population numbers for each gene were estimated using the Hardy-Weinberg equations which utilize the number of qualifying RDVs and the total allele number within a given gene across the gnomAD population. We tested this estimation approach by first applying it to our case population, generating Q-Q plots that all included a λ value of 2. Any λ value greater than 1 indicates overestimation of affected individuals. For the actual burden analysis, we only applied this estimation to our control data and not our case data. So, this test indicates that our estimation approach errs on the side of overestimating affected individuals in the reference population. This one-sided overestimation of affected individuals makes our analysis less likely to suggest falsely high mutational burden for genes in the case population.
A Bonferroni correction for multiple comparisons was applied in order to find any genes with significant mutational burden in the case populations. For the Bonferroni correction, the conventional alpha value 0.05 was divided by the number of genes that were compared in a population’s mutational burden analysis. The resulting Bonferroni value is a strict P value cutoff for statistical “significance”. It is important to note that genes were only compared if they harbored qualifying variants, so not all 523 original genes were ultimately compared. Regardless of the Bonferroni correction, the term “nominally significant” refers to genes with Fisher’s exact test P values ≤ 0.05.