The core genome is significantly correlated with many genome features (Figure 2). Because the wild species have larger genetic diversity than the cultivars, we defined two sets of core genomes: one is including all of the genome we examined (denoted as Core 1), the other is only considering the cultivars (denote as Core2). The ρ of Spearman rank correlation coefficient between Core1 and Core 2 is 0.95, indicating that the conserved region of the wild species and the cultivars are consistent between wide species and cultivars. We also found that the core genome is positively correlated with gene density, but negatively correlated with density of TEs, and also significantly negative correlated with the most abundant TE family, Gypsy.
Marker design and statistics
In order to find markers that work universally in diverse germplasm, we designed markers on the core genome, which is the regions that are present in all of the genomes. Several other features need to be considered to design rhAmpseq markers. First, variances should be avoided on the probes to decrease the off-target amplification. Second, the polymorphism level of the markers should be considered. Markers with little polymorphism have little power to distinguish different germplasms, while markers with large polymorphism will increase the error rate in locating these markers. This information
is limited if we only consider the nine genomes we assembled and collected. From the public
date sources, we downloaded resequencing data of 47 accessions in Vitis genus with sequencing depth greater than 3 × , 27 of them are from wild species and 23 of them are cultivars. Principal component analysis reveals that the diverse genetic background of wild species and limit diversity of cultivated lines (Figure 3A). To have a balanced composition of wild species and cultivars in the sample, we randomly selected 40 lines, containing half samples for each, for the downstream analysis. The median polymorphism level across 40 accessions for core genome is 0.32, and comparing to the core genes, there is no significantly differences in polymorphism level by Wilcoxon-Mann-Whitney test (Figure 3B). To focus on regions with moderate polymorphism level, we filtered out the regions that fall outside the 25th to 75th percentile range. We also compared the genotyping missing rate of SNPs in the core genome regions and the dispensable genome regions for this 40 accessions, as we expected, the core genome regions have significantly lower missing rate than those in the dispensable genome regions (Figure 3C). The last consideration of the marker design is the physical distribution across the genome, we first randomly chose markers from core genome region to make sure one marker per 200 Kb. And then we added markers on according to the gene density. More markers are designed for gene-rich region, to improve efficiency in gene mapping. We sent the candidate region to IDT for primer design. rhPCR markers with product size ranges from 200 to 270 bp can be designed and pooled for 98% of the regions we provided. A total of 2000 rhPCR markers were designed and synthesized by IDT.
Marker validation in four F1 or F2 families
Discussion
Genus-wide Transferable markers design based on core genome
A set of 'universal' genetic markers that work for close related species is desired in many areas of studies. In molecular assisted breeding, universal markers can be used in distant hybridization and alien gene introgression \cite{Chagné2004,Brondani2006,Diaz2011} . In molecular ecology and evolutionary studies, universal markers allow comparison genetic characters among related species\cite{Singh_2012,Bernardes_2018}. In some clade/genus that is enriched in economically important species, universal markers that transferable across species will largely decrease the time and effort in developing unique markers for each species\cite{Kuleung2004,Pan_2018}. The transferability for the microsatellite markers and SSR marker ranges from 27% to 77% in the different taxonomic groups of plants and animals \cite{Barbará2007}. However, the transferability of high-throughput genetic markers are as low as 2% when transferring markers across species \cite{Vezzulli2008}\cite{Chagné2012}. In this study, we developed a pipeline for develop universal markers that work for the whole Vitis genus with diverged 20 Mya. Using rhAnpSeq platform, 93% of markers return data for all four population we test. And also around 82% of markers are polymorphic in all the population. And the genetic map built for these four population are consistent, which indicating that the consistency of marker order and segregation pattern. Although for each population there are 10% to 20% with unexpected Mendelian segregation ratio, these markers are population specific. We compare the distance between distorted markers and a random sampling of the same number of markers from the entire set, the distance between distorted markers is significantly smaller than the random expectation ( Mann-Whitney test, p<1e-13). This reduced distance between distorted markers indicating that they are clustered on the chromosome, which indicating the distorted markers are linked. In addition, when we combine individuals from these four population to form a meta-population, only XX% of markers are distorted. This further suggested that the majority of the markers are informative in constructing a genetic map, a small portion of them might fail due to different genetic background. We also found that in two population that the sex loci were located, the markers that explain the biggest phenotypic variation are the same, which indicated that not only the random markers are transferable, the functional markers are also transferable. In one word, we validated that the markers designed based on the genus-wide core genome and genome polymorphism are transferable at different levels, which includes the amplification level, polymorphism level, segregation level, and marker-trait association level.
The key to develop transferable markers is the construction of the genus-wide core genome considering the colinearty. Previously markers that designed based on resequecing a large number of samples has limited transferability. By DNA resequencing, rich information of small genetic variantion can be accessed, however large and complex structure variation is often missing. The long collinear blocks conserved within the genus are suggestive of strong selection against structure variation in this region, which increase the probability to identify markers that has consistent occurrence in the genome and also consistent segregation pattern. Out result indicate that in order to identify inter species transferable markers, core genome with collinearity should be considered. [core transcriptome 70% design rate]. Many marker design approach has considered the polymophism of the markers. In our study, we used the genus-wide polymophism to detect regions with moderate diversity and also used the polymorphism to design highly conserved probes, which contain no or very little variations across the whole genus, to ensure the binding specificity between the probe and the target sites.
The advantage of rhAmpSeq genotyping platform for highly diverse and heterozygous species
In our previous study, we have found that Ampseq genotyping platform outperforms GBS or other NGS based genotyping platform for highly diverse and heterozygous species, due to limited missing data, increased coverage and accuracy at heterozygote sites, and elevated transferability among distinct species. Different from SNP array or KASP which target one specific polymorphism site, another advantage of AmpSeq genotyping platform is that it allows the identification of novel alleles and a short haploblock because the entire amplified region (from 90 bp to 250bp long) are sequenced through NGS. Therefore, for a pair of individuals with a genetic diversity greater than 1 SNP per 250 bp, the amplified region should contain at least one SNP to distinguish the two individual, which increase the information content of the markers and make it suitable not only for distant but also for close related bi-parental population. This high coverage and unbiased sequencing of amplicons make this platform is applicable in population genetics and ecology studies. Comparing to AmpSeq, the rhAmpSeq technology add an RNA base and a blocker in the primer, and this RNA-base and the blocker can only be cleaved by RNase H2 enzyme only when the match is perfect between the primer and the templates \cite{Dobosy2011}. This step increases the specificity in the genotyping and also increases the throughput to 5000 probes in one tube (rhAmpSeq, IDT ).