Discussion

Ancient DNA from lake sediments constitutes a valuable resource to investigate the response of populations to past environmental changes. Previous studies using metabarcoding or shotgun sequencing have not yet explored the full potential of this resource. Here, we applied shotgun sequencing and hybridization capture using PCR-generated baits of theLarix chloroplast genome, to retrieve nearly complete chloroplast genomes and study past changes in the population history of larches in northern Siberia.

Taxonomic classification - majority of shotgun data could not be assigned

In the shotgun dataset, only 0.3% of quality filtered reads could be classified against the nt database. Considering that fewer than 15,000 organisms of the 10 to 15 million of eukaryotic and perhaps trillions of bacterial and archaeal species have currently been completely or partially sequenced (Lewin et al., 2018), this is a comparatively large fraction. However, other studies have reported higher rates of assigned sequences. Slon et al. (2017), assigned more than 10% of their shotgun reads that derived from cave sediments and Ahmed et al. (2018) reported 16% of assigned sequences from ancient lake sediments and 70% of hits over multiple taxa. On the other hand, Pedersen et al. (2016) assigned 0.3% of the quality filtered reads obtained from ancient lake sediments, and Parducci et al. (2019) reported a read assignment of 3.4% of sed aDNA shotgun sequencing. Different sample material, wet lab processing and sequence databases used for assignment make it difficult to properly compare the assignment rate. In addition to these differences, the applied bioinformatic approach has an impact on the results (Harbert, 2018). Here we used kraken2 (Wood, Lu, & Langmead, 2019), a new version of kraken, which is a particularly conservative tool compared to others, reporting less false-positive but also less true-positive hits than others tools, even with default values (Harbert, 2018). We used it with a high confidence threshold as we found it gives the best results in terms of vegetation composition based on our knowledge of the vegetation history (Epp et al., 2018), but with the consequence of very low overall assignment rates. Indeed, when we use the default values of kraken2, we could assign 10–16% of the reads. However, more lenient classification causes a reinforcement of the database bias: few deeply sequenced taxa are more likely to be assigned than the majority of shallowly or fragmentarily sequenced taxa (Parducci et al., 2017).

Capture success - Viridiplantae reads increased by orders of magnitude

Hybridization capture resulted in an increase of taxonomically classified reads by orders of magnitude – especially with respect to the ratio of unclassified reads but also in total. These results show for the first time that DNA capturing of whole chloroplast genomes is effective even for DNA libraries that constitute a high diversity and low on-target rates such as DNA from ancient lake sediments. Along withLarix many other plant taxa were enriched, as is shown by the high number of Viridiplantae classified reads (Figure 1). This can be explained by: (1) chloroplast genomes of land plants and green algae sharing essentially the same set of protein-coding genes and ribosomal RNAs and differing mainly in the presence/absence of introns and repeats (B. R. Green, 2011), and (2) DNA libraries built from lake sediments contain a very high sequence divergence, higher than those from pooled individuals, which in the past were used to measure the capability of capturing highly divergent reads (Paijmans, Fickel, Courtiol, Hofreiter, & Förster, 2016; Peñalba et al., 2014). We hypothesize that the captured reads classified as other plants share, to some extent, sequence similarities with conserved regions like genes or ribosomal RNA.
The comparison of captured Larix- classified reads versus the full capture samples gives a first hint of this: if we align the complete dataset and the subset with the same relaxed parameters to theLarix chloroplast genome, the coverages are unequally distributed between the different functional regions. The conserved protein coding genes, especially genes coding for the photosystem complex, transfer RNA and ribosomal RNA, have coverages significantly higher in the complete dataset than in the Larix- classified dataset. In the case of ribosomal RNA the difference is even orders of magnitude higher (Figure 2). This shows that sequences from species other than Larix can align to these especially conserved regions. Several studies already demonstrate enrichment of diverged sequences using hybridization capture. Peñalba et al. (2014) show the applicability of hybridization capture to analyse taxa from 13 families with up to 27% sequence divergence, by targeting conserved gene regions. On ancient material, Paijmans et al. (2016) captured mitochondrial sequences with up to 20 million years divergence time with up to 25% sequence dissimilarity. With a modified relaxed double capture method, Li et al. (2013) demonstrate a successful enrichment of nuclear genes across distantly related species with up to 300 million years divergence time (with up to 40% sequence dissimilarity). All these studies demonstrate that hybridization capture can be used to study a wide range of phylogenetically divergent taxa, although they all note a decrease in coverage with increasing evolutionary distance.

Near-complete retrieval of Larixchloroplast genomes

The number of reads classified as the genus Larix, using the nt database increased 800 to 1600-fold from shotgun to hybridization capture dataset. However, in all samples there was a high level of PCR duplicates due to the high number of amplifications after capturing and possibly also due to oversequencing of reads. Future projects could reduce the number of PCR cycles and increase the number of samples pooled together to reduce duplicates.
The deduplication of reads prior to mapping against the reference genome reduced the enrichment to 6–16-fold. This is in the range of results from enrichment studies performed on bone material (Avila-Arcos 2015, Carpenter 2013), although higher enrichment ranges have also been reported (Mohandesan 2017). Here, most likely the sample material plays a role, as the lake sediments contained a low content of the target larch chloroplast DNA together with a high sequence diversity. Nevertheless, the target enrichment rate could potentially be increased, for example, by capturing at higher temperatures or using a touch-down approach or by using RNA instead of DNA for baits (Carpenter et al., 2013; Paijmans et al., 2016; Peñalba et al., 2014).
In the capture dataset, near-complete chloroplast sequences could be retrieved from the three oldest samples. But even the sample with the highest coverage, i.e. the oldest from 6700 cal-BP, contained gaps in the alignment (91% covered at 1-fold coverage). When comparing the coverage of different functional groups of the genome (Figure 2), we see the lowest coverage for ribosomal RNA, followed by protein coding genes. This is due to the bioinformatics approach of extracting reads as classified to the lowest common ancestor in kraken2. As these coding regions are highly conserved across taxa (B. R. Green, 2011), the short reads can also be attributed to other organisms and are consequently classified to a higher taxonomic rank than Larix.Accordingly, when assigning the full, not taxonomically binned samples in the same manner to functional groups, this trend reverses.
Analysis of DNA damage patterns revealed C to T substitution rates typical for ancient DNA (Supplementary Figures S4, S5). Typical for the preparation of single-stranded libraries, these substitutions could be observed both at the 3’ and the 5’ end of the molecules (Gansauge & Meyer, 2013). C to T substitution rates increased with sample age, in line with previous observation (Pedersen et al., 2016; Sawyer, Krause, Guschanski, Savolainen, & Pääbo, 2012). Read lengths ranged from 40 bp to 340 bp, showing the short fragment length typical for ancient DNA (Green et al., 2008).

L. sibirica variants present over time

When comparing the ancient reads to chloroplast reference genomes fromL. gmelinii and L. sibirica , the great majority of reads carry the L. gmelinii variants with a low frequency of L. sibirica variants in all four samples. In contrast, the analysis of one mitochondrial marker derived from the same core by Epp et al . (2018), showed a mixture of both species, with relatively high rates of L. sibirica – except for the most recent sample, which showed clear dominance of the L. gmelinii mitotype – pointing to a co-occurrence of both species across most of the sediment core. In the genus Larix, chloroplasts are predominantly inherited paternally (Szmidt, Aldén, & Hällgren, 1987) whereas mitochondrial DNA is inherited maternally (DeVerno, Charest, & Bonen, 1993), a phenomenon which has been reported for almost all members of the conifers (Neale & Wheeler, 2019). This bi-parental inheritance results in different rates of gene flow and subsequently asymmetric introgression patterns (Du, Petit, & Liu, 2009; Petit et al., 2005). Simulations (Currat, Ruedi, Petit, & Excoffier, 2008) and molecular studies on a range of Pinaceae (Du et al., 2011, 2009; Godbout, Yeh, & Bousquet, 2012) showed that the seed-transmitted mitochondria which experience little gene flow introgress more rapidly than the pollen dispersed chloroplasts which experience high gene flow. A second finding of these studies is that introgression occurs asymmetrically from the resident species into the invading species. An expected result of introgression is therefore a population carrying mitotypes of the former local species and chlorotypes of the invaded species.
In the case of the population history of L. sibirica and L. gmelinii in their contact zone, Semerikov et al. (2013) found evidence for the asymmetric introgression of L. sibiricamitotypes in a population carrying only L. gmeliniichlorotypes, confirming the natural invasion of L. gmelinii into the range of L. sibirica. Here we corroborate these findings with a distinct discrepancy between relatively high rates of L. sibirica mitotypes as reported before (Epp et al., 2018) and low rates of L. sibirica in the chloroplast reads found in this study.
This points to an invasion of L. gmelinii in a former population of L. sibirica prior to the date of our oldest sample (6700 cal-BP). Further evidence in support of this scenario is found in the results from a lake sediment core 250 kilometres south-west of the study site (Epp et al., 2018), where samples reaching back to 9300 cal-BP show exclusively L. sibirica mitotypes, before they were gradually replaced by the L. gmelinii mitotype.
Our study shows that by capturing for the complete chloroplast genome, we achieve a high resolution and can detect species specific variants even at low frequencies. Further studies should also include mitochondrial sequences in the target enrichment to collect data from several markers or potentially the complete mitochondrial genome. By combining the two organelle genomes in a hybridization capture experiment it would be possible to study hybridization and introgression events in detail, which would help to deepen our understanding of population dynamics over long time scales.

Larch forest decline over the last 7000 years

When looking at the overall retrieval of Larix reads among the samples, most reads could be recovered from the oldest sample, of around 6700 cal-BP and the least number of reads in the most recent sample. This is in line with the findings of Klemm et al . (2016) and Epp et al . (2018), who describe a general vegetation turnover from larch forest to an open tundra with only sparse Larix stands during the last 7000 years at this site. This gradual change of vegetation in response to late-Holocene climate deterioration has also been observed by various studies (Andreev et al., 2004; MacDonald, Gervais, Snyder, Tarasov, & Borisova, 2000; MacDonald, Kremenetski, & Beilman, 2008; Pisaric, MacDonald, Velichko, & Cwynar, 2001) and is in correspondence with the reconstructed global cooling trend of the middle to late Holocene (Marcott, Shakun, Clark, & Mix, 2013).