Jie Zhang

and 7 more

Techniques of reduced-representation sequencing (RRS) have revolutionized ecological and evolutionary genomics studies. Precise establishment of orthologs is a critical challenge for RRS, especially when a reference genome is absent. The proportion of shared heterozygous sites across samples is an alternative criterion for filtering paralogs, as divergent lineages should be less likely to share heterozygosity. In the prevailing pipeline for variant calling of RRS data - PYRAD/IPYRAD, maxSharedH is an often overlooked parameter with implications to detecting and filtering paralogs according to shared heterozygosity. Using empirical GBS data of two primroses (Primula alpicola Stapf and Primula florindae Ward) and their putative hybrids, we explore the impact of maxSharedH on filtering paralogs and further downstream analyses. Our study sheds light on the simultaneous validity and risk of filtering paralogs using maxSharedH, and its significant effects on downstream analyses of outlier detection, population assignment, and demographic modelling, emphasizing the importance of attention to detail during bioinformatics processes. The mutual confirmation between results of population assignment and demographic modelling in this study suggested maxSharedH = 0.10 has a potentially excessive and asymmetrical effect on the removal of truly shared heterozygous sites as paralogs. These results indicate that hybridization origin hypotheses of putative hybrids represented by results with maxSharedH = 0.25 and 0.50 are more credible. In conclusion, we revealed the critical hazard of paralogs filtration according to sharing heterozygosity at first, so that we propose to use specific protocols, rather than maxSharedH, to filter potential paralogs for closely related lineages.

Yi Zou

and 8 more

Jie Zhang

and 6 more

Techniques of reduced-representation sequencing (RRS) have revolutionized ecological and evolutionary genomics studies, especially favoring species without reference genome. But it is a great challenge for RRS data to precisely establish homologous loci, which is strongly associated with accuracy of downstream analyses and reliability of biological inferences. maxSH is an overlooked parameter with respect to detecting paralogs, belonging to PYRAD/IPYRAD──a prevailing pipeline for genotyping RADseq and GBS data. Using GBS data of two primroses (Primula alpicola Stapf and P. florindae Ward) and their putative hybrids, as empirical study, we explore the efficiency of maxSH on filtering paralogs and its impact on downstream analyses. At the same time, we try to assess if putative hybrids are truly speciated from hybridization. Our study sheds light on the efficiency of maxSH on filtering paralogs, and significant effects of maxSH, together with clustering threshold and missing data, on downstream analyses of outlier detection, population assignment, and demographic modelling, emphasizing the significance of carefully coping with bioinformatics process. On the other hand, although putative hybrids exhibit a genetic mixture of P. alpicola and P. florindae according to most STRUCTURE and PCA results, we cannot clearly draw a conclusion on the origin of putative hybrids due to conflicting demographic scenarios mainly resulted from altering maxSH value among nine chosen datasets. However, gene flow patterns of most optimal models from multiple maxSH values collectively indicate incomplete reproductive isolation between putative hybrids and two primroses, and the existence of indirect introgression between P. alpicola and P. florindae.