We collected two to three leaves for DNA extraction and dried these in silica gel (Fischer Scientific) and one mature healthy flower for phenotyping. Following the visual scoring system developed by \citet{Whibley2006} we assigned plants to one of six flower phenotypes based on their inferred genotypes at the loci Rosea and Sulfurea (figure \ref{821839}). ROS/ROS plants have intense magenta pigmentation, whilst ros/ros plants have none, and heterozygotes are intermediate. The SULF allele is fully dominant over sulf, and SULF/SULF and SULF/sulf plants cannot be easily distinguished by human eyes, and will henceforth be referred to collectively as SULF/+. sulf/sulf plant are yellow throughout the face of the flower, whilst SULF/+ plants express aurones only in a small patch at the centre of the flower.
In August 2012 we collected a single, mature, wild-pollinated fruit from each of 60 mothers (figure \ref{666285}, figure \ref{120180}). In order to minimise disturbance to the population we only sampled from plants which had set a minimum of five mature fruits. These mothers were chosen to represent an even sample of pigmentation genotypes, spread as evenly as possible across the core of the hybrid zone where hybrids are most dense.

Genotyping

We grew seeds in 5cm plug trays filled with potting compost (Gramaflor) in a greenhouse under Sylvania GroLux lights on a 16-hour cycle. We sowed three seeds per plug for 50-70 plugs per maternal family and thinned seedlings to a single seedling per plug after cotelydons had appeared. We transferred approximately 1cm2 of fresh tissue from 1419 seedlings to 96-well DNA-extraction plates (LGC Genomics, Berlin) and allowed tissue to dry using the sample bag and silica gel provided. For parental tissue from the hybrid zone we transferred approximately 1cm^{2} tissue dried in the field to the same plates. DNA extractions of the plated tissue samples were carried out by LGC Genomics [Is this in another publication yet I can cite?].
We genotyped tissue samples at 71 SNPs by KASPR sequencing (LGC Genomics). These SNPs are a subsample of a panel used for a wider survey of the hybrid zone (David Field, unpublished data). The total SNP panel is a mixture of [how many?] diagnostic (showing a gradient in allele frequency) and (how many?) parentage (with as even a gradient in allele frequency as possible) SNPs. For parentage loci we chose only biallelic loci with a minor allele frequency greater than 0.3 in each of inner four pools closest to the centre of the cline, selected to maximise mapping distance between loci. Diagnostic SNPs were either linked to pigmentation loci, or else showed sharp clines across the hybrid zone. To select a subsample of SNPs for this study, we selected markers which were at least 2cM apart to maximise mapping distance between individuals. We removed 474 offspring and four adults that had missing data at more than 7.5% of the SNPs. We also pruned 7 SNPs that showed more than 10% missing data, or less than 15% heterozygosity. This left us with a set of 984 offspring from 60 maternal families, with between two and 29 offspring per family (mean=16.4).

Pollen dispersal kernel

A useful function for describing plant dispersal kernels is the generalised normal distribution (GND). This is a generalisation of the exponential family of probability distributions and includes the exponential and standard normal distributions as special cases, but allows for fat and thin tails\cite{Clark1998,nadarajah2005generalized}. The GND describes the probability of observing dispersal distance d through scale parameter a and shape parameter b:\[GND(d_{mj},a,b) = K \exp \left(\frac{-d}{a} \right)^b \]
where K  is a normalising constant and \(\Gamma\) is Euler's gamma function. The function takes the form of the standard exponential and normal distributions when b =1 and b =2 respectively. Values of b<1 indicate leptokurtic distributions with tails that decay more slowly than would be expected under the exponential distribution. The mean dispersal distance is given by the standard deviation of the GND:
\[\sqrt\frac{a^2 \Gamma(^3/_b)}{\Gamma(^1/_b)}\]
We model the pollen dispersal kernel based on the distances between known mothers of full-sibling families and the plants inferred to be the fathers of those families. Leptokurtosis in this distribution may be caused by a long tail of genuine mating events with distant fathers. However, when unrelated individuals are incorrectly identified as being fathers of a full sibship, we expect these individuals to be drawn at random from the population since the marker panel is designed to show no spatial structure, whereas true fathers will (on average) be close to maternal plants. This will inflate the apparent kurtosis in the data and underestimate b . To accommodate this we model the probability that candidate pollen donor individual j is the sire of a full sibship with mother m given distance \(d_{mj}\) between them as a mixture between a GND and a uniform distribution:
\[ \Pr(d_{mj}|a,b,\lambda) = \lambda GND(d_{mj},a,b) + \frac{1-\lambda}{N} \]
where N is the number of candidate fathers and parameter determines the proportion of the probability mass due to ’real’ dispersal. Conveniently, also provides an estimate of the false positive rate if we had run paternity analysis using genetic data alone.

Joint estimates of paternity, sibships and dispersal

Allowing for covariates in paternity inference

We have previously described a Python package FAPS which performs joint analysis of paternity and sibship relationships for sibling arrays based on SNP data, and allows for integrating out uncertainty in relationships \cite{Ellis2018}. Here we extended the software to allow for additional non-genetic information to be included.
FAPS is based on a matrix G, with a row of each offspring and a column for each candidate father, where element \(g_{ij}\) is the probability that candidate j is the father of offspring i . Rows in G sum to one, and describe a multinomial distribution of probabilities of paternity over all candidate fathers. The final column of G is the probability that the father of each offspring was missing from the sample of candidates, based on the probability of observing offspring alleles from population allele frequencies, and an estimate of the proportion of possible pollen donors in the population that had been sampled. FAPS then uses hierarchical clustering to partition offspring into a sets of plausible full sibships. The probability that j is the father of putative full sibship k is then \(\prod_i g_{ij}\).
Similarly, we can describe the probability that j is the father of sibship k based on non-genetic information through a suitable function of those data as \(z_j\prod_i g_{ij}\). Here, we model \(z_j\)as \(Pr(d_{mj}|a,b,\lambda)\), but this formulation is general to any suitable function relating non-genetic data to probabilities of paternity. The likelihood of the whole array is then the product of probabilities for each full sibship within a partition, summed over all possible partition structures. When there are multiple maternal families, the likelihood of the whole dataset is the product of per-family likelihoods over each maternal family.

Inference via MCMC

We can then infer paremeters of interest by finding combinations of values that maximise the likelihood of the data. Here, there are four parameters of interest, while using the algorithm in FAPS to integrate out uncertainty in paternity and sibship relationships: (1) the proportion of missing fathers, (2) the shape and (3) scale parameters of the generalised normal distribution describing dispersal, and (4) the mixture parameter \(\lambda\) describing the strength of signal from the GND distribution of dispersal.
We used the Metropolis-Hastings Markov-chain Monte Carlo algorithm to infer the posterior distribution of these parameters. We ran four independent chains beginning from distant areas of the parameter space. At each iteration, each parameter was peturbed by a a factor drawn from a normal distribution with a fixed standard deviation for each parameter (\(\sigma\)= 0.025 for proportion of missing fathers and \(\lambda\)\(\sigma\)= 0.05 for shape and \(\sigma\)= 2 for scale). We ran each chain for 60,000 iterations to ensure mixing, and subsequently removed the first 1000 iterations of each as burn-in. After checking chain convergence (figures \ref{384337}-\ref{281170}) we thinned subsequent iterations to retain 250 posterior draws from each chain for further analyses.
We used beta prior distributions for proportion of missing fathers and \(\lambda\) (Beta(\(\alpha\)=3, \(\beta\)=15) and Beta(\(\alpha\)=1.1, \(\beta\)=1.1) respectively), and Gamma priors for dispersal shape and mean dispersal distance (\(\Gamma\)(shape=200, scale=200) and \(\Gamma\)(shape=2, scale=200) respectively; figure \ref{804996}). We used a prior for mean dispersal distance instead of the scale parameter of the GND because this has a more intuitive interpretation.