We collected two to three leaves for DNA extraction and dried these in silica gel (Fischer Scientific) and one mature healthy flower for phenotyping. Following the visual scoring system developed by (Whibley, 2006) we assigned plants to one of six flower phenotypes based on their inferred genotypes at the loci Rosea and Sulfurea(Figure 2 ). ROS/ROS plants have intense magenta pigmentation, whilst ros/ros plants have none, and heterozygotes are intermediate. The SULF allele is fully dominant over sulf, and SULF/SULF and SULF/sulf plants cannot be easily distinguished by human eyes, and will henceforth be referred to collectively as SULF/+. sulf/sulf plant are yellow throughout the face of the flower, whilst SULF/+ plants express aurones only in a small patch at the centre of the flower.
In August 2012 we collected a single, mature, wild-pollinated fruit from each of 60 mothers (figure [fig:map-2012]). In order to minimise disturbance to the population we only sampled from plants which had set a minimum of five mature fruits. These mothers were chosen to represent an even sample of pigmentation genotypes, spread as evenly as possible across the core of the hybrid zone where hybrids are most dense.

Genotyping

We grew seeds in 5cm plug trays filled with potting compost (Gramaflor) in a greenhouse under Sylvania GroLux lights on a 16-hour cycle. We sowed three seeds per plug for 50-70 plugs per maternal family and thinned seedlings to a single seedling per plug after cotelydons had appeared. We transferred approximately 1cm^2^ of fresh tissue from 1419 seedlingsto 96-well DNA-extraction plates (LGC Genomics, Berlin) and allowed tissue to dry using the sample bag and silica gel provided. For parental tissue from the hybrid zone we transferred approximately 1cm^{2} tissue dried in the field to the same plates. DNA extractions of the plated tissue samples were carried out by LGC Genomics [Is this in another publication yet I can cite?].
We genotyped tissue samples at 71 SNPs by KASPR sequencing (LGC Genomics). These SNPs are a subsample of a panel used for a wider survey of the hybrid zone (David Field, unpublished data). The total SNP panel is a mixture of [how many?] diagnostic (showing a gradient in allele frequency) and (how many?) parentage (with as even a gradient in allele frequency as possible) SNPs. For parentage loci we chose only biallelic loci with a minor allele frequency greater than 0.3 in each of inner four pools closest to the centre of the cline, selected to maximise mapping distance between loci. Diagnostic SNPs were either linked to pigmentation loci, or else showed sharp clines across the hybrid zone. To select a subsample of SNPs for this study, we selected markers which were at least 2cM apart to maximise mapping distance between individuals. We removed 474 offspring and four adults that had missing data at more than 7.5% of the SNPs. We also pruned 7 SNPs that showed more than 10% missing data, or less than 15% heterozygosity. This left us with a set of 984 offspring from 60 maternal families, with between two and 29 offspring per family (mean=16.4).

Pollen dispersal kernel

A useful function for describing plant dispersal kernels is the generalised normal distribution (GND). This is a generalisation of the exponential family of probability distributions and includes the exponential and standard normal distributions as special cases, but allows for fat and thin tails (Austerlitz et al., 2004; Clark, 1998; Nadarajah, 2005). The GND describes the probability of observing dispersal distance d through scale parameter a and shape parameter b:\[GND(d_{mj},a,b) = K \exp \left(\frac{-d}{a} \right)^b \]
where and is a normalising constant and \(\Gamma\) is Euler's gamma function. The function takes the form of the standard exponential and normal distributions when b =1 and b =2 respectively. Values of b<1 indicate leptokurtic distributions with tails that decay more slowly than would be expected under the exponential distribution. The mean dispersal distance is given by the standard deviation of the GND:
\[\sqrt\frac{a^2 \Gamma(^3/_b)}{\Gamma(^1/_b)}\]
We model the pollen dispersal kernel based on the distances between known mothers of full-sibling families and the plants inferred to be the fathers of those families. Leptokurtosis in this distribution may be caused by a long tail of genuine mating events with distant fathers. However, when unrelated individuals are incorrectly identifed as being fathers of a full sibship, we expect these individuals to be drawn at random from the population since the marker panel is designed to show no spatial structure, whereas true fathers will (on average) be close to maternal plants. This will inflate the apparent kurtosis in the data and underestimate b . To accommodate this we model the probability that candidate pollen donor individual j is the sire of a full sibship with mother m given distance \(d_{mj}\) between them as a mixture between a GND and a uniform distribution:
\[ \Pr(d_{mj}|a,b,\lambda) = \lambda GND(d_{mj},a,b) + \frac{1-\lambda}{N} \]
where N is the number of candidate fathers and parameter determines the proportion of the probability mass due to ’real’ dispersal. Conveniently, also provides an estimate of the false positive rate if we had run paternity analysis using genetic data alone.

Joint estimates of paternity, sibships and dispersal

Allowing for covariates in paternity inference

We have previously described a Python package FAPS which performs joint analysis of paternity and sibship relationships for sibling arrays based on SNP data, and allows for integrating out uncertainty in relationships (Ellis et al., 2018). Here we extended the software to allow for additional non-genetic information to be included (Hadfield et al., 2006).
FAPS is based on a matrix G , with a row of each offspring and a column for each candidate father, where element \(g_{ij}\) is the probability that candidate j is the father of offspring i . Rows in G sum to one, and describe a multinomial distribution of probabilities of paternity over all candidate fathers. The final column of G is the probability that the father of each offspring was missing from the sample of candidates, based on the probability of observing offspring alleles from population allele frequencies, and an estimate of the proportion of possible pollen donors in the population that had been sampled. FAPS then uses hierarchical clustering to partition offspring into a sets of plausible full sibships. The probability that j is the father of putative full sibship k is then \(\prod_i g_{ij}\).
Similarly, we can describe the probability that j is the father of sibship k based on non-genetic information through a suitable function of those data as \(z_j\prod_i g_{ij}\). Here, we model \(z_j\)as \(Pr(d_{mj}|a,b,\lambda)\), but this formulation is general to any suitable function relating non-genetic data to probabilities of paternity. The likelihood of the whole array is then the product of probabilities for each full sibship within a partition, summed over all possible partition structures. When there are multiple maternal families, the likelihood of the whole dataset is the product of per-family likelihoods over each maternal family.

Inference via MCMC

We can then infer paremeters of interest by finding combinations of values that maximise the likelihood of the data. Here, there are four parameters of interest, while using the algorithm in FAPS to integrate out uncertainty in paternity and sibship relationships: (1) the proportion of missing fathers, (2) the shape and (3) scale parameters of the generalised normal distribution describing dispersal, and (4) the mixture parameter \(\lambda\) describing the strength of signal from the GND distribution of dispersal.
We used the Metropolis-Hastings Markov-chain Monte Carlo algorithm to infer the posterior distribution of these parameters. We ran four independent chains beginning from distant areas of the parameter space. At each iteration, each parameter was peturbed by a a factor drawn from a normal distritution with a fixed standard deviation for each parameter (\(\sigma\)= 0.025 for proportion of missing fathers and \(\lambda\), \(\sigma\)= 0.05 for shape and \(\sigma\)= 2 for scale). We ran each chain for 60,000 iterations to ensure mixing, and subsequently removed the first 1000 iterations of each as burn-in. After checking chain convergence (supplementary figures 2-4) we thinned subsequent iterations to retain 250 posterior draws from each chain for further analyses.
We used beta prior distributions for proportion of missing fathers and \(\lambda\) (Beta(\(\alpha\)=3, \(\beta\)=15) and Beta(\(\alpha\)=1.1, \(\beta\)=1.1) respectively), and Gamma priors for dispersal shape and mean dispersal distance (\(\Gamma\)(shape=200, scale=200) and \(\Gamma\)\(\Gamma\)\(\Gamma\)(shape=2, scale=200) respectively). We used a prior for mean dispersal distance instead of the scale parameter of the GND because this has a more intuitive interpretation.

Flower colour and male fitness

We next investigated whether certain flower-colour genotypes sire more offspring than would be expected given local genotype frequencies, and whether this varies across the hybrid zone. We did this by comparing ’observed’ mating events, based on genetic and dispersal data, to ’expected’ mating events simulated based on dispersal only for mothers in each of five bins of approximately 300m (n=3, 6, 21, 27, 3 mothers per bin; supplementary figure 1). We calculated observed and expect mating probabilities for each of 1000 sets of dispersal parameters and proportions of missing fathers from the trimmed MCMC results.
To calculate observed mating probabilities we used the likelihoods fromFAPS that an individual candidate father had sired at least one offspring with a maternal plant based on genotype data, his distance from the mother, and dispersal parameters, integrating over uncertainty in sibling relationships (Ellis et al., 2018). For expected mating probabilities under random mating we calculated the probability of mating between each pair of mothers and candidate fathers based on the distance between them and dispersal parameters. We then summed these values over all candidates of each Rosea and Sulfureagenotype, and over mothers within each spatial bin. We normalised likelihoods for each maternal genotype to sum to one to give relative probabilities than mothers in each bin received pollen from males of each flower colour genotype.

Power analyses

One explanation for apparent leptokurtosis in dispersal is incorrect assignment of paternity to candidates far from maternal plants, either because the true father was not sampled or due to insufficient statistical power of the marker set. This would inflate the tail of the dispersal kernel even if true dispersal were not leptokurtic (i.e. cause the shape parameter of the GND to be less than one, even when it is really greater or equal to one).
To test this, we first examined how often we should expect incorrect paternity assignment using this marker set, and the extent to which incorrect assignment and missing fathers would inflate apparent dispersal. We used FAPS to simulate offspring based on observed maternal and candidate male genotypes (Ellis et al., 2018). For each of the 60 mothers we sampled mates in proportion to their distance, assuming that the pollen dispersal distances are exponentially distributed with a mean of 10, 50, 100 and 200m. Since clustering offspring into sibships is the most computationally demanding part of the analysis, we instead consider paternity of individuals rather than sibships; this is valid because he are primarily interested in mating events, not of sibling relationships themselves. We simulated one offspring for each of N fathers for each mother, where N is the most likely number of full sibship families in observed families rounded to the nearest integer (supplementary figure 2), to give a total of 433 offspring. We then calculated (1) the posterior probability of paternity for each true sire in the absence of information about dispersal, (2) mean distance between mothers and known true sires and (3) weighted mean distance between mothers and possible all sires, weighted by probability of paternity for all candidates with a probability >0.001 of having sired at least offspring. To test dependency on sampling effort we then removed 10%, 20%, 30% or 40% of true sires at random from the dataset and recalculated weighted-mean distances between mothers and sires. We repeated this procedure on 100 replicate simulated datasets.
We then used a second set of simulations to investigate the extent to which a joint analysis of paternity and dispersal could mitigate the inflation of leptokurtosis. We generated offspring as described above assuming exponential dispersal of 50m only, and performed a simplified joint-analysis of paternity and dispersal by grid interpolation. We multiplied the matrix of paternity probabilities from genetic data by a matrix of dispersal probabilities from \( \Pr(d_{mj}|a,b,\lambda)\) using combinations of scale (\(5 \leq a \leq 150\)) and shape (\(0.1 \leq b \leq 2\)) parameter values, with fixed \(\lambda\) at the posterior mean of 0.82. We calculated the likelihood of each combination of parameters by summing joint probabilities of paternity over each candidate father, and multiplying those likelihoods over all 433 offspring. We then recorded the combination of dispersal parameters with the highest (log) likelihood. We also repeated this procedure for datasets excluding true sires as described above. We repeated these analyses on 100 replicate simulated datasets.

Results

Dispersal is leptokurtic

FAPS grouped the 937 offspring into between 434 and 437 full-sibling families, and identified 316 mating events between the 60 mothers and candidate fathers with posterior probabilities greater than 0.99 (supplementary figure 2). Mixture parameter was strongly weighted towards signal coming from generalised normal dispersal rather than random draws from the population (mean=0.82, 95% credible intervals = 0.77, 0.93), indicating that most of the mating events reflect real mating events (supplementary figure 4). The data were compatible with broad range of plausible values for the proportion of missing fathers (mean=0.21, 95% credible intervals = 0.06, 0.42; supplementary figure 4), but the posterior distribution was very similar to the prior distribution (supplementary figure 3), suggesting that the data are not informative about missing fathers.
Pollen dispersal distances are characterised by many dispersal events between nearby plants, and a long tail of more distant mating events. 50% of the high-probability fathers were within 40m of the mother; distances to remaining fathers decayed slowly up to a maximum of 2398m. Consistent with this, the full pollen dispersal kernel inferred jointly with sibships and paternity consistently showed shape parameters less than one (posterior mean=0.40, 95% credible intervals: 0.329-0.483). These results indicate strong leptokurtosis in the pollen dispersal kernel.