Nuclear SNPs
Reads from 26 louse genomes were individually mapped with Bowtie2
(Langmead & Salzberg, 2012) to the reference genome described above.
Before mapping, the reference sequence was indexed using Samtools (Li et
al., 2009) and a dictionary file was made with CreateSequenceDictionary
in Picard v.2.0.1 (https://broadinstitute.github.io/picard/). After
mapping with Bowtie2, resulting SAM files were sorted to the BAM files
and indexed using Samtools. Duplicated sequences were removed from
sorted BAM files with Picard v.2.0.1 and quality of mapping was verified
with QualiMap (http://qualimap.bioinfo.cipf.es/). SNPs for
population-level analysis were called using the GATK Genome Analysis
Toolkit following the “Best Practices” guide from the Broad Institute
(Van der Auwera et al., 2013). SNPs were jointly called for all 26P. serrata samples and filtered with QD (quality by depth)
<2.0, FS (Fisher strand test) >60.0, MQ (mapping
quality) <40.0, and MQRankSum (mapping quality rank sum test)
<–12.5. The SNPs from GATK were filtered with minor allele
frequency (MAF) threshold 0.05 in PLINK 1.9
(https://www.cog-genomics.org/plink/1.9/). SNPs in linkage
disequilibrium (LD) were pruned with the squared coefficient of
correlation (r2) 0.2 and missing data threshold 0.2.
Resulting 47,595 variants passed filters and quality control in 26P. serrata samples and used in population structure estimation.
PCA was performed in R package SNPRelate (DOI:
10.18129/B9.bioc.SNPRelate)
and phylogenetic tree based on Maximum likelihood analysis was
reconstructed using SNPhylo pipeline (Lee, Guo, Wang, Kim, & Paterson,
2014) with 1000 bootstraps. Ten independent runs of admixture analysis
were performed to assess the proportion of individual ancestry with
different numbers of hypothetical populations, using the ADMIXTURE
software v.1.3.0 (Alexander et al., 2009). An unsupervised model
approach based on maximum likelihood was applied to the genotype matrix
for 1-10 populations (K=1-10). Most probable estimate of K was
calculated with cross-validation procedure. Summarization and graphical
output of independent runs for each K was obtained with CLUMPAK
(Kopelman et al., 2015). Based on the obtained results samples were
divided into three groups (SW, SE, and hybrids), and their diversity
measures (heterozygosity and pairwise FST) were
estimated in VCFtools (Danecek et al., 2011) and SNPRelate,
respectively..
Legionella polyplacis genomes reconstruction and comparison
The contigs corresponding to Legionella polyplacis were visually
identified using ORF prediction done in the Geneious package (the
prokaryotic gene arrangement could be readily recognized by the density
of predicted ORFs). In most samples, the genome of L. polyplaciswas assembled into a single complete contig. In three assemblies
(DPH41,19JA1_SW, 46MAN_SW) the symbiont genome was highly fragmented
or could not be found, despite the good assembly quality of the louse
genome (these samples were removed from the L. polyplacisanalysis). Complete L. polyplacis genomes were annotated in RAST
(Aziz et al., 2008) and aligned using Mafft algorithm implemented in
Geneious. In three samples (Ne125b_Kot_SW, 99b_Pro_SE, 98c_Pro_SE)
the rRNA region did not assemble correctly and was only represented by
short fragments. To extend these fragments into a full length, we used
the program aTRAM 2.0 (Allen, LaFrance, Folk, Johnson, & Guralnick,
2018). Phylogenetic analysis of the resulting 530,063 bp long matrix was
performed in Phyml (Guindon et al., 2010). The evolutionary model
GTR+I+G was determined for the whole concatenated matrix (considering
the strong phylogenetic signal in the matrix, the model selection is a
purely formalistic step in the presented analysis) by a corrected Akaike
information criterion in jModelTest2 (Darriba et al., 2012) based on the
AIC. A comparison of gene content was done manually using the annotated
alignment. We were taking into consideration the following criteria:
presence of the annotations across all genomes; presence of the
corresponding sequence (in case of missing annotation); distribution of
differences (i.e. does the difference reflect the SE/SW split).