Syntenic core-genome construction
The repetitive regions in the both reference genome and the assemblies were filtered using a kmer frequency based approach with BBduk, a tool implemented in the BBTools package (version 35.50)
\cite{bushnell2014bbmap}. The sequences with a Kmer (K=31) frequency larger than 2 were replaced with N. The masked assembly was aligned to the reference genome, PN40024 (version 12X.2)
\cite{Jaillon2007}, using Minimap2 with presets of parameters tuned for cross-species alignment, denoted as "asm5" in the manual
\cite{Li_2018}. The result is transformed to a bed-like format for chaining and identifying one-to-one matches using quota_alignment (
https://github.com/tanghaibao/quota-alignment)
\cite{Tang2011}. The coverage of the core genome is defined as the times of the genomic regions are covered by the syntenic one-to-one genome alignment. Two sets of core genome were defined, one set considers all of the nine genomes we examined, the other set only consider the six cultivars including the three accessions of hybrid grape. The coverage of the core genome, gene density, TE density, and other genome features were calculated with window size of 1 Mb using BEDTools (v2.27.1). The correlation between each genome feature are calculated using Spearman's rank correlation coefficient in R.
Genus wide variation calling
A total of 47 accessions with 2-93× pair-ended Illumina sequencing data were downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) . If one accession has reads more than 20 ×, we randomly sample 20 × reads for variants calling. We also sequenced 8 accessions with 5 -10 × pair-ended sequencing at the Illumina HiSeq 2500 platform. The insert size is 400bp(?) and the read length is 2 × 150bp. The raw sequencing data have been
deposited in the SRA at NCBI under BioProject ID: PRJNAXXXXXX. All the information of SRA, including project Number, total base pairs, and name of accession are list in supplemental table X. Variants were called using the Sentieon DNA Software Package (version, Golden helix )\cite{Kendig396325} with default settings. This Sentieon package is a speed-up software that rebuilt the Genome Analysis Toolkit HaplotypeCaller and returns the same result as GATK 3.3. Principal component analyses (PCA) was conducted using R/Bioconductor Package SNPRelate. To avoid the strong influence of the SNP clusters in the PCA analysis, SNPs are filtered using LD-based pruning algorithm implemented in SNPRelate with LD threshold 0.2. We randomly removed samples with very close relationship based on the eigenvalues from the PCA and kept 20 V. viniferea samples and 20 non-vinifera samples.
Marker design pipeline
The vcf file generated from the genus-wide variants calling is loaded in to R. For each non-gap alignable region in the core-genome, we checked their length, diversity and missing rate. The regions that are shorter than 200bp, with diversity larger than 7% or smaller than 2%, or has average missing rate large than 50% were dropped out. These steps were conducted in R using the bioconductor (version 3.8 ) package VariantAnnotation \cite{Obenchain2014}. The candidate regions are then picked to ensure one marker per 200Kb. If no qualified candidate region can be found in a 1 Mbp window, we included the regions that has highest coverage in the core genome construction. For each 1 Mbp sliding window, we randomly include more candidate region for the high gene density region. A total of 2500 candidate region were sent to IDT for primer design and pooling compatibility test. Primers can be designed for 99.6% of the regions and 98.1% of them are pooling compatible in one-tube-PCR. A total of 2000 rhPCR markers were synthesized by IDT.