Quality Control and Selection of Damaging Rare Variants
The quality control filters for case exome data were chosen to closely
match the criteria published by gnomAD. They can be categorized as being
focused on variant-specific parameters and sample-specific parameters.
The Genome Analysis Toolkit (GATK) provides many variant-specific
parameters. Variants retained for analysis include those with a GATK
variant quality score recalibration value of “PASS”, root mean squared
of mapping quality (MQ) ≥ 20, and inbreeding coefficient <
-0.3. Of the single nucleotide variants (SNVs) filtered from the WES
data, we evaluated only those SNVs that lay within any coding transcript
or splice sites of the 523 genes of interest. In order to ensure that
the allele counts compared at each locus (single nucleotide position)
are representative of each population, we only compared SNVs that have ≥
89.5% coverage of DP ≥ 10 in both the myelomeningocele cases and the
gnomAD population used as references.
The remaining quality control filters target the quality of a variant
within an individual sample. We kept samples that met the following at
each locus: a genotype with alternate allele depth ≥ 25%, a read depth
≥ 10, and genotype quality score ≥ 20. A variant site is considered
“covered” if its position had a DP ≥ 10 in 89.5% of samples.
In addition to evaluating only high quality variants and samples in the
case population, the analysis was further focused to include only
variants that were rare (defined as having an allele frequency in the
gnomAD control population less than 1%) and predicted to be deleterious
(coding for a stop gain, stop loss, splice site missense, or having a
combined annotation dependent depletion phred score greater than 20).
This applies to both myelomeningocele cases and gnomAD controls. We
refer to rare, deleterious variants that met the above quality control
standards as “qualifying variants.”
Mutational Burden
Analysis
To find any genes with higher mutational burden in a case population
compared to its gnomAD control, a two-by-two table for each gene was
constructed for Fisher’s exact test. The Fisher’s exact test compares
the number of affected individuals and unaffected individuals from case
and the control populations, where an “affected individual” is someone
containing at least one RDV in the gene being evaluated and an
“unaffected individual” is someone containing zero RDVs in that gene.
Individual subject data was not readily available from gnomAD.
Therefore, affected and unaffected control population numbers for each
gene were estimated using the Hardy-Weinberg equations which utilize the
number of qualifying RDVs and the total allele number within a given
gene across the gnomAD population. We tested this estimation approach by
first applying it to our case population, generating Q-Q plots that all
included a λ value of 2. Any λ value greater than 1 indicates
overestimation of affected individuals. For the actual burden analysis,
we only applied this estimation to our control data and not our case
data. So, this test indicates that our estimation approach errs on the
side of overestimating affected individuals in the reference population.
This one-sided overestimation of affected individuals makes our analysis
less likely to suggest falsely high mutational burden for genes in the
case population.
A Bonferroni correction for multiple comparisons was applied in order to
find any genes with significant mutational burden in the case
populations. For the Bonferroni correction, the conventional alpha value
0.05 was divided by the number of genes that were compared in a
population’s mutational burden analysis. The resulting Bonferroni value
is a strict P value cutoff for statistical “significance”. It
is important to note that genes were only compared if they harbored
qualifying variants, so not all 523 original genes were ultimately
compared. Regardless of the Bonferroni correction, the term “nominally
significant” refers to genes with Fisher’s exact test P values ≤
0.05.