Introduction:
Natural history repositories represent invaluable collections of specimens for scientific use across diverse fields (Blagoderov, Kitching, Livermore, Simonsen, & Smith, 2012; Lane, 1996; Lister & Group, 2011; S. L. Williams, 1999). Many of these specimens represent populations of plants and animals that no longer exist due to land use change and human alterations of landscapes over the past century (Smith et al., 2013). Additionally, museum specimens often represent the few or only representatives of endangered or rare species, and provide important vouchers for comparison with modern samples, as well as genetic resources for species which may be logistically difficult or impossible to sample in wild habitats (W. Miller et al., 2009; White, Mitchell, & Austin, 2018). As such, usage of museum specimens for modern research incorporating DNA analysis is increasing. In turn, destructive sampling requests are increasing, many of which propose molecular sequencing from specimens as a justification for consumption of source material.
The degraded DNA associated with museum specimens is known to require extra measures of stringency in order to combat issues with exogenous DNA sequences (Paabo et al., 2004; Rizzi, Lari, Gigli, De Bellis, & Caramelli, 2012), and the use of PCR based methods have identified issues with nuclear copies of mitochondrial DNA that confound degraded or ancient DNA mitochondrial sequence results (den Tex, Maldonado, Thorington, & Leonard, 2010). The extracted DNA in each sample is often contaminated by exogenous sources (humans, bacteria, pests) and the endogenous DNA can be highly fragmented (Campana et al., 2012; Hawkins, Hofman, et al., 2016; McDonough, Parker, Rotzel McInerney, Campana, & Maldonado, 2018). Studies which reliably sequence DNA from museum specimens undergo stringent protocols to combat the low quantity and highly fragmented nature of museum specimen extracts. As such, these studies must process the specimens with additional precautions in order to prevent cross contamination of samples, and should be processed in appropriate lab spaces dependent on the material. Downstream from wet lab procedures additional bioinformatic steps should be taken to ensure that the resulting genetic sequence data represents the target taxa. Truly ancient samples (derived from archaeological samples, permafrost specimens, coprolites, sediments, mummies, and others) have been shown to offer patterns of degradation associated with misincorporation of various nucleotides – namely cytosine to uracil deamination – from which characteristic patterns can be tested for to provide authenticity to the recovered sequences (Hofreiter, Serre, Poinar, Kuch, & Paabo, 2001; Jónsson, Ginolhac, Schubert, Johnson, & Orlando, 2013). Patterns of degradation are only starting to be understood, and vary depending on the type of samples being processed (Shapiro, 2012; Weiß et al., 2016), with museum specimens lacking the characteristic cytosine to uracil deamination (McDonough et al., 2018).
A study of mitochondrial genome enrichment from museum specimens (Hawkins, Hofman, et al., 2016) found that sample type was more predictive of amplification success rather than age. Another study also concluded that success rates as well as endogenous DNA content varied widely depending on the type of consumed sample (McDonough et al., 2018), and Campana et al., (2012) found that recovery of longer mitochondrial (D-loop) PCR products did not correlate with the success of nuclear DNA amplification. When granted destructive sampling permissions, institutions often set individual policies on what types of samples are provided to approved research projects. As such, the most desired sample types may not be approved for consumption in DNA extraction.
Short Tandem Repeats (STRs), also commonly referred to as microsatellite loci, have been useful markers for numerous applications, such as forensics, cancer diagnosis, and widely implemented in the fields of conservation genetics to evaluate genetic diversity and population structure in organisms ranging from bacteria, to plants and animals (e.g. Bilska & Szczecińska, 2016; Thatte, Joshi, Vaidyanathan, Landguth, & Ramakrishnan, 2018). Historically, microsatellites were isolated from a specific species of interest for use on population level analyses, a process which took time and funding to develop prior to any analysis on the taxa of interest (Fisher, Gardner, & Richardson, 1996; Glenn & Schable, 2005; Lian, Wadud, Geng, Shimatani, & Hogetsu, 2006). Cross species amplification has been shown to work in some taxa, but comparisons across different species must be done cautiously due to issues with homoplasy and ascertainment bias (Bailey et al., 2015; Crawford et al., 1998; Estoup, Jarne, & Cornuet, 2002; Grimaldi & Crouau-Roy, 1997; Li & Kimmel, 2013).
Next generation sequencing technology has allowed for a much more rapid identification of microsatellite loci in non-model organisms (Duan, Li, Sun, Wang, & Zhu, 2014; Griffiths et al., 2016; M. P. Miller, Knaus, Mullins, & Haig, 2013; Silva, Martins, Gouvea, Pessoa-Filho, & Ferreira, 2013) by allowing tandem repeat regions to be identified at a genomic scale, and allowing the simultaneous sequencing of thousands of putative microsatellite loci as compared to traditional cloning based methods (Glenn & Schable, 2005). In addition to the cost reduction of microsatellite isolation, some of the issues known to occur when genotyping microsatellites via capillary electrophoresis (CE hereafter) can be alleviated using next generation sequencing technologies (Vartia et al., 2016). For example, fragment size analysis via CE has been known to provide (albeit sometimes predictably) shifted sizes when samples are run on different machines (Morin, Manaster, Mesnick, & Holland, 2009). Access to the raw sequences from next generation sequencing would allow precise sizing of alleles (Darby, Erickson, Hervey, & Ellis-Felege, 2016).
A number of studies have evaluated how to transform these sequence based microsatellite reads into genotypes recovered from capillary sequencing (Barbian et al., 2018; Darby et al., 2016; De Barba et al., 2017; Jónsson et al., 2013; Pimentel et al., 2018; Šarhanová, Pfanzelt, Brandt, Himmelbach, & Blattner, 2018; Vartia et al., 2016; Zhan et al., 2017). Each of these genotyping by synthesis (GBS hereafter) studies has evaluated some aspects of the biases induced when comparing sequences from high-throughput sequencing platforms as opposed to fragment size analysis genotyping. For instance, GBS studies have resulted in recovery of additional alleles due to the reconstruction from DNA sequences as opposed to fragment size analysis from CE. Some of the other most commonly addressed issues included evaluation of stutter, PCR artifacts and size homoplasy (Barbian et al., 2018; De Barba et al., 2017). Although challenges exist for direct comparison of high-throughput sequencing based microsatellite genotypes with those from capillary sequencers via fragment size analysis, the ability to generate comparable datasets is paramount in order to build off previous research, and inform larger, potentially landscape based conservation plans.
Despite the wide range of studies already published on genotyping using high throughput sequencing, there are no studies which have specifically evaluated the degree of variation which occurs from museum specimen sourced DNA. GBS studies have estimated the amount of allelic dropout from chimpanzees (Barbian et al., 2018) and bears (De Barba et al., 2017) from fecal samples, as well as tissue (Vartia et al., 2016). Here we explore a high throughput sequencing method to evaluate the amount of variation found within DNA extracts from museum specimens for previously characterized microsatellites across various PCR replicates. We analyzed three types of datasets: a dataset containing individual PCR replicates, a pooled dataset where the individual replicates were mixed together prior to library preparation, and a bioinformatically pooled dataset where the replicates were combined via bash scripting. The rates of allelic dropout generated here will serve as the first for high throughput sequencing of museum specimens and provide best practices for subsequent studies on museum derived specimens.