Population genomics
Diversity and differentiation. We estimated summary
statistics in 500 kbp sliding, non-overlapping windows along the genome.
We chose this window size to ensure we obtained a large number of
variable sites per window, given the small sample sizes per taxon.
Differentiation and divergence statistics (FST and
DXY) were calculated separately for each species, and
only sites with no missing genotypes within a species were used.
Heterozygosity was measured for all genotyped sites passing filters per
individual. We used custom R scripts to measure observed heterozygosity
for each individual and two measures of genetic differentiation
(FST and DXY) between populations on
either side of the GRV. We used the Reich and colleagues (2009)
estimator of FST because this has been shown to be an
unbiased FST estimator when using low sample sizes
(e.g., two or more chromosomes per population) and high numbers of
genetic markers (Willing, Dreyer, & Van Oosterhout, 2012). We
calculated both FST and DXY to measure
how they covary across the genome, as well as have measures of genetic
differentiation that both do (FST) and do not
(DXY) rely on within population genetic diversity. If
the genomes of differentiating populations are largely under the
influence of linked selection, we would expect a genome-wide negative
relationship between FST and DXY. In
contrast, if differentiating populations exhibit gene flow across most
of their genomes, with restricted gene flow in only small genomic
regions (e.g., genomic islands of differentiation), we would expect a
genome-wide positive relationship between FST and
DXY.
Population genetic structure. We estimated population
genetic structure with an a priori assumption of two populations using
the program STRUCTURE (Pritchard, Stephens, & Donnelly, 2000). Here,
our goal was to assess whether we could identify distinct population
clusters with individuals segregating on either side of the GRV despite
largely different levels of genetic differentiation between populations
in different species (see RESULTS; Table 1). We limited SNPs for each
species to those that were separated by a minimum of 20 kbp to reduce
effects of linkage as well as reduce computation time. This resulted in
42,542 to 49,443 SNPs per species. We initially ran STRUCTURE to infer
the lambda parameter while estimating the likelihood of one population
(k = 1) (Pritchard et al., 2000). We then used the inferred value of
lambda for subsequent analyses, where we performed five replicates of
likelihood estimation for two genetic clusters. We assumed correlated
allele frequencies, an admixture model, and performed analyses for a
burn-in of 100,000 steps and a subsequent 500,000 MCMC iterations.
Phylogenomics. We estimated phylogenies for all the
individuals combined to (1) identify phylogenetic splits between
populations separated by the GRV and (2) analyze trait data (e.g., HWI,
genetic diversity) in the context of phylogenetic independent contrasts
(PICs). Here, we estimated gene trees from non-overlapping 100 kbp
alignments using RAxML v8.2.12 (Stamatakis, 2014) with the GTRCAT model
of sequence evolution. For sites to be included, we required 80% of all
individuals to be genotyped at that site and included only invariant and
biallelic sites. For an alignment’s inclusion, it needed at least 10 kbp
retained sites. We created a consensus tree from the gene trees using
the sumtrees.py script, part of the DendroPy Python package (Sukumaran
& Holder, 2010), for use in calculating PICs. PICs are used to estimate
correlations among traits while accounting for evolutionary history of
the samples (i.e., because sampled lineages are not independent from one
another). The PIC method assumes Brownian motion of trait evolution, and
transforms the sampled data (i.e., at the tips of the phylogeny) into
statistically independent contrasts that may be used in regressions
(Felsenstein, 1985). We estimated PICs in the R package ape (Paradis,
Claude, & Strimmer, 2004). We estimated a species tree using ASTRAL III
v 5.6.3 (Zhang et al., 2018), using the quartet frequencies as a measure
of local support (Sayyari & Mirarab, 2016).
Demography. We used the program MSMC2 v1.1.0 (Schiffels
& Durbin, 2014) to estimate demographic history for each individual.
MSMC uses information about the spatial arrangement of variant sites
within an individual’s two chromosomes using a modified version of the
Pairwise Sequentially Markovian Coalescent (PSMC). We masked regions of
the genome that were not genotyped, as MSMC would otherwise mistake
these regions as runs of homozygosity. Of note, the MSMC method is
accurate in panmictic populations, but the presence of population
structure or changes in connectivity between populations through time
may mimic changes in population sizes (Chikhi et al., 2018; Mazet et
al., 2016). Therefore, some caution should be used when interpreting the
demographic histories. We ran MSMC for each individual allowing up to 20
iterations (default setting), and up to 23 inferred distinct time
segments. To determine the number of time segments to use, we ran
preliminary MSMC runs with the default number of time segments (n = 28),
and decreased that number in subsequent runs until spurious results and
model overfitting were diminished. For example, we merged some recent
time segments to prevent overfitting in the most recent (e.g., past 50
kya) and very old times, as these would sometimes have spurious jumps in
population sizes to unrealistically high values of NE(e.g., many millions). To assess how signal may vary using different
parts of the genome, we performed ten bootstrap replicates for each
individual, bootstrapping 1 Mbp segments of the genome. Here, we chose
to run MSMC for each individual rather than aggregating data from
individuals for several reasons: (1) we have low to moderate sequencing
coverage for each of the samples, and masking low coverage regions
across multiple individuals drastically decreases the amount of the
genome sampled, (2) to reduce the impacts of uneven sample sizes per
population and/or species, and (3) with demographic results for each
individual we can assess consistency across individuals within species
and populations.
The output of MSMC is presented in terms of a species’ generation times
and mutation rate. As such, the observed demographic patterns are
reliant on accurate estimates of both these parameters. In the species
studied here, there are no published estimates of generation times. As
such we used a proxy used previously in the literature, where we use a
conservative generation time of double the age of sexual maturity in
closely-related species (Nadachowska-Brzyska et al., 2015). We estimated
the age of sexual maturity for each species by finding closely-related
species (same genus or closely-related genus) using the Animal Aging and
Longevity Database (Tacutu et al., 2017) (available here:
genomics.senescence.info/species). We discuss caveats associated with
assumed generation times in the RESULTS section.
Because no estimates for rates of molecular evolution exist for any of
the species in this study, we estimated substitution rates for each
species included here. To do this, we extracted coding regions of the
Zebra Finch genome that were genotyped in at least one individual from
each species included in this study using the BEDtools v2.27.1 (Quinlan
& Hall, 2010) intersect command, with the Zebra Finch CDS
regions and the GATK vcf outputs as input for BEDtools. We filtered any
CDS regions that included the same start codon to remove overlapping
alignments. Any null alleles or missing data in alignments were replaced
with Ns and only alignments that included a minimum of 95% of the
coding region were included for further analysis. Any codons including
Ns were removed from the alignments. Lastly, we extracted all four-fold
degenerate sites and concatenated them into a single alignment using the
R packages Biostrings, seqinr, and rphast (Charif & Lobry, 2007;
Hubisz, Pollard, & Siepel, 2010; Pagès et al., 2017). The final
alignment included a total of 496,080 sites. From this alignment, we
estimated a model of sequence evolution using jModelTest (Darriba et
al., 2012) and chose a best-fitting model with the Akaike Information
Criterion (best model = GTR + I + G). Next, we estimated a phylogeny
using maximum likelihood with the program PhyML (Guindon et al., 2010).
We used the branch lengths of the resulting phylogeny to estimate
substitution rates for each taxon. Here, we used split times for each of
the respective genera from a fossil-calibrated phylogenetic tree of
Passeriformes (Oliveros et al., 2019), where we could put the amount of
evolution on each branch in our four-fold degenerate sites phylogeny in
the context of time. All of the rates calculated here ranged between
2.06 × 10-9 and 2.30 × 10-9substitutions / site / year (Table 1) and are similar to other
Passeriformes species such as the Zebra Finch (2.21 ×
10-9 substitutions / site / year) (Nam et al., 2010).
These mutation rate estimates assume that birds in the same genera do
not have greatly differing terminal branch lengths, as this would
erroneously either increase or decrease mutation rates. Regardless,
these rates are similar to those estimated in other Passeriformes
species, and even with some error, we expect these estimates to be
broadly appropriate for general interpretations of demographic
histories.