Introduction:
Natural history repositories represent invaluable collections of
specimens for scientific use across diverse fields (Blagoderov,
Kitching, Livermore, Simonsen, & Smith, 2012; Lane, 1996; Lister &
Group, 2011; S. L. Williams, 1999). Many of these specimens represent
populations of plants and animals that no longer exist due to land use
change and human alterations of landscapes over the past century (Smith
et al., 2013). Additionally, museum specimens often represent the few or
only representatives of endangered or rare species, and provide
important vouchers for comparison with modern samples, as well as
genetic resources for species which may be logistically difficult or
impossible to sample in wild habitats (W. Miller et al., 2009; White,
Mitchell, & Austin, 2018). As such, usage of museum specimens for
modern research incorporating DNA analysis is increasing. In turn,
destructive sampling requests are increasing, many of which propose
molecular sequencing from specimens as a justification for consumption
of source material.
The degraded DNA associated with museum specimens is known to require
extra measures of stringency in order to combat issues with exogenous
DNA sequences (Paabo et al., 2004; Rizzi, Lari, Gigli, De Bellis, &
Caramelli, 2012), and the use of PCR based methods have identified
issues with nuclear copies of mitochondrial DNA that confound degraded
or ancient DNA mitochondrial sequence results (den Tex, Maldonado,
Thorington, & Leonard, 2010). The extracted DNA in each sample is often
contaminated by exogenous sources (humans, bacteria, pests) and the
endogenous DNA can be highly fragmented (Campana et al., 2012; Hawkins,
Hofman, et al., 2016; McDonough, Parker, Rotzel McInerney, Campana, &
Maldonado, 2018). Studies which reliably sequence DNA from museum
specimens undergo stringent protocols to combat the low quantity and
highly fragmented nature of museum specimen extracts. As such, these
studies must process the specimens with additional precautions in order
to prevent cross contamination of samples, and should be processed in
appropriate lab spaces dependent on the material. Downstream from wet
lab procedures additional bioinformatic steps should be taken to ensure
that the resulting genetic sequence data represents the target taxa.
Truly ancient samples (derived from archaeological samples, permafrost
specimens, coprolites, sediments, mummies, and others) have been shown
to offer patterns of degradation associated with misincorporation of
various nucleotides – namely cytosine to uracil deamination – from
which characteristic patterns can be tested for to provide authenticity
to the recovered sequences (Hofreiter, Serre, Poinar, Kuch, & Paabo,
2001; Jónsson, Ginolhac, Schubert, Johnson, & Orlando, 2013). Patterns
of degradation are only starting to be understood, and vary depending on
the type of samples being processed (Shapiro, 2012; Weiß et al., 2016),
with museum specimens lacking the characteristic cytosine to uracil
deamination (McDonough et al., 2018).
A study of mitochondrial genome enrichment from museum specimens
(Hawkins, Hofman, et al., 2016) found that sample type was more
predictive of amplification success rather than age. Another study also
concluded that success rates as well as endogenous DNA content varied
widely depending on the type of consumed sample (McDonough et al.,
2018), and Campana et al., (2012) found that recovery of longer
mitochondrial (D-loop) PCR products did not correlate with the success
of nuclear DNA amplification. When granted destructive sampling
permissions, institutions often set individual policies on what types of
samples are provided to approved research projects. As such, the most
desired sample types may not be approved for consumption in DNA
extraction.
Short Tandem Repeats (STRs), also commonly referred to as microsatellite
loci, have been useful markers for numerous applications, such as
forensics, cancer diagnosis, and widely implemented in the fields of
conservation genetics to evaluate genetic diversity and population
structure in organisms ranging from bacteria, to plants and animals
(e.g. Bilska & Szczecińska, 2016; Thatte, Joshi, Vaidyanathan,
Landguth, & Ramakrishnan, 2018). Historically, microsatellites were
isolated from a specific species of interest for use on population level
analyses, a process which took time and funding to develop prior to any
analysis on the taxa of interest (Fisher, Gardner, & Richardson, 1996;
Glenn & Schable, 2005; Lian, Wadud, Geng, Shimatani, & Hogetsu, 2006).
Cross species amplification has been shown to work in some taxa, but
comparisons across different species must be done cautiously due to
issues with homoplasy and ascertainment bias (Bailey et al., 2015;
Crawford et al., 1998; Estoup, Jarne, & Cornuet, 2002; Grimaldi &
Crouau-Roy, 1997; Li & Kimmel, 2013).
Next generation sequencing technology has allowed for a much more rapid
identification of microsatellite loci in non-model organisms (Duan, Li,
Sun, Wang, & Zhu, 2014; Griffiths et al., 2016; M. P. Miller, Knaus,
Mullins, & Haig, 2013; Silva, Martins, Gouvea, Pessoa-Filho, &
Ferreira, 2013) by allowing tandem repeat regions to be identified at a
genomic scale, and allowing the simultaneous sequencing of thousands of
putative microsatellite loci as compared to traditional cloning based
methods (Glenn & Schable, 2005). In addition to the cost reduction of
microsatellite isolation, some of the issues known to occur when
genotyping microsatellites via capillary electrophoresis (CE hereafter)
can be alleviated using next generation sequencing technologies (Vartia
et al., 2016). For example, fragment size analysis via CE has been known
to provide (albeit sometimes predictably) shifted sizes when samples are
run on different machines (Morin, Manaster, Mesnick, & Holland, 2009).
Access to the raw sequences from next generation sequencing would allow
precise sizing of alleles (Darby, Erickson, Hervey, & Ellis-Felege,
2016).
A number of studies have evaluated how to transform these sequence based
microsatellite reads into genotypes recovered from capillary sequencing
(Barbian et al., 2018; Darby et al., 2016; De Barba et al., 2017;
Jónsson et al., 2013; Pimentel et al., 2018; Šarhanová, Pfanzelt,
Brandt, Himmelbach, & Blattner, 2018; Vartia et al., 2016; Zhan et al.,
2017). Each of these genotyping by synthesis (GBS hereafter) studies has
evaluated some aspects of the biases induced when comparing sequences
from high-throughput sequencing platforms as opposed to fragment size
analysis genotyping. For instance, GBS studies have resulted in recovery
of additional alleles due to the reconstruction from DNA sequences as
opposed to fragment size analysis from CE. Some of the other most
commonly addressed issues included evaluation of stutter, PCR artifacts
and size homoplasy (Barbian et al., 2018; De Barba et al., 2017).
Although challenges exist for direct comparison of high-throughput
sequencing based microsatellite genotypes with those from capillary
sequencers via fragment size analysis, the ability to generate
comparable datasets is paramount in order to build off previous
research, and inform larger, potentially landscape based conservation
plans.
Despite the wide range of studies already published on genotyping using
high throughput sequencing, there are no studies which have specifically
evaluated the degree of variation which occurs from museum specimen
sourced DNA. GBS studies have estimated the amount of allelic dropout
from chimpanzees (Barbian et al., 2018) and bears (De Barba et al.,
2017) from fecal samples, as well as tissue (Vartia et al., 2016). Here
we explore a high throughput sequencing method to evaluate the amount of
variation found within DNA extracts from museum specimens for previously
characterized microsatellites across various PCR replicates. We analyzed
three types of datasets: a dataset containing individual PCR replicates,
a pooled dataset where the individual replicates were mixed together
prior to library preparation, and a bioinformatically pooled dataset
where the replicates were combined via bash scripting. The rates of
allelic dropout generated here will serve as the first for high
throughput sequencing of museum specimens and provide best practices for
subsequent studies on museum derived specimens.