Haplotype inference to illustrate genomic divergence
The portion of the genome unaffected by gene flow increases as speciation proceeds (Feder, Egan, & Nosil, 2012; Feder, Flaxman, Egan, Comeault, & Nosil, 2013; Nadeau et al., 2013; Wu, 2001; Wu & Ting, 2004). As subspecies are somewhere in the speciation continuum, how is differentiation distributed across the genome? The pattern can be visualized by inferring haplotypes of loci and comparing the haplotype networks. The method developed by He et al. (2019) was used to infer haplotypes. This method uses SNP linkage information in each short-read pair to infer haplotypes and frequency of each haplotype in the population, following an expectation-maximization algorithm (Bilmes, 1998; Dempster, Laird, & Rubin, 1977). If two adjacent SNPs were not covered by any read pair, we broke the gene into segments. In this case, the midpoint of the two adjacent SNPs is defined as the breakpoint of two consecutive segments. The accuracy of this method in inferring haplotypes has been validated by sequencing individuals using the Sanger method (He et al., 2019). We selected eight populations representing different subspecies and different regions for inferring haplotypes: twoeucalyptifolia (CA and DW), two australasica (AK and BS), and four marina populations (BB, LS, TN, and SY). Genes were split into 454 linked segments and haplotypes were inferred for each segment (Table S2). Before constructing haplotype networks, we filtered out segments with length less than 100 bps or with missing data. For each of the 231 retained segments, we computed a haplotype network using the NETWORK software (Polzin & Daneshmand, 2003). For some segments, the sequences were blasted against the database of National Center for Biotechnology Information for function annotation.