Geographically distinct sites display a site-specific set of nodule Rhizobium alleles
Rhizobium leguminosarum is a species complex consisting of multiple genospecies that have been shown to co-exist in a field setting . Rlt core genes show little sign of introgression between genospecies, and phylogenies of individual core genes therefore most often follow the overall genospecies phylogenetic tree . A phylogenetic analysis of amplicons from the chromosomal core genes rpoB andrecA showed that the sampled bacteria from nodules are distributed throughout the five main genospecies clades previously identified from isolates originating from these exact fields (Figure 2A -D ). For the core genes rpoB andrecA , the majority of the alleles identified by MAUI-seq were also recovered in the isolates, while some additional alleles were found only in a small number of isolates, particularly for recA(Table 1 and Figure 2A -D ). Of these sequences, most were actually present in the MAUI-seq dataset, but were under the cumulative abundance threshold that we used. For the other three genes, MAUI-seq recovered more alleles than the isolates.
The accessory genes nodA and nodD belong to a group of co-located genes, known as the sym gene cluster, that are essential for initiating and maintaining an effective symbiotic relationship with legumes. The phylogeny of the accessory gene pool has previously been shown to often be incongruent with the core genes . This cluster is usually located on a conjugative plasmid in the Rl species complex . Occasionally, regions of the cluster are duplicated in the rhizobial genome and, due to the promiscuous nature of conjugative plasmids, they can cross genospecies boundaries . Using the set of 196 characterised Rlt isolates from the same sampling sites, we evaluated the level of duplication of nod genes to remove potential paralogs. In addition to the full nod gene region (nodXNMLEFDABCIJ ), a partial set of nod genes (nodDABCIJT ) is present in some of the Rlt isolates. nodAseq7 and nodDseq9 occurred only as secondary sequences in this partial nod region and were designated as nodAa and nodDa , respectively (Figure 2C-D ). A third type of nodD (nodD2 ) was observed in some genomes flanked by transposases and no other nod genes . Three nodD amplicons belong to this group. These five paralogous sequences were removed from all downstream analysis to avoid inflating the estimates of overall diversity. All 12 nodD alleles seen in the genomes were recovered by MAUI-seq, plus an additional 5 alleles. MAUI-seq detected 12 of the 14 nodA alleles seen in genomes, but found an additional 9 alleles (Table 1 andFigure 1 ). All of the abundant sequences with frequency > 0.15 have an exact match in the 196 Rlt genomes, and the allele frequencies are highly correlated between the two datasets (Figure 2E -H ). The sequences identified only by MAUI-seq are of low abundance, but appear to be genuine sequences (Figure 2A -D and Figure 3 ). Likewise, the sequences in the 196 genomes not found by MAUI-seq are only present in a small number of isolates and at low frequencies; 8 out of the 13 sequences are only found in a single isolate (Figure 2 ).
Principal component analysis of the amplicons from individual genes (Figure S2 ), revealed that different loci have different levels of resolution. recA separated the French samples well from all other locations, whereas the UK samples were clearly separated from the other two field trial sites for all four loci. The high level of diversity among and within DKO samples made it difficult to distinguish them from the F and DK samples for most amplicons.
Each breeding trial site (DK, F, and UK) showed a distinct set of amplicons, despite the nodules from each site being sampled from the same F2 clover families from the same seedstock and being under identical management (Table S1, Figure 3 ). The samples from the trial sites were relatively uniform within each site, and each sample had a low number of total observed amplicons, whereas the DKO samples appeared less homogeneous within each sample.