We collected two to three leaves for DNA extraction and dried these in
silica gel (Fischer Scientific) and one mature healthy flower for
phenotyping. Following the visual scoring system developed by (Whibley,
2006) we assigned plants to one of six flower phenotypes based on their
inferred genotypes at the loci Rosea and Sulfurea(Figure 2 ). ROS/ROS plants have intense magenta pigmentation,
whilst ros/ros plants have none, and heterozygotes are intermediate. The
SULF allele is fully dominant over sulf, and SULF/SULF and SULF/sulf
plants cannot be easily distinguished by human eyes, and will henceforth
be referred to collectively as SULF/+. sulf/sulf plant are yellow
throughout the face of the flower, whilst SULF/+ plants express aurones
only in a small patch at the centre of the flower.
In August 2012 we collected a single, mature, wild-pollinated fruit from
each of 60 mothers (figure [fig:map-2012]). In order to minimise
disturbance to the population we only sampled from plants which had set
a minimum of five mature fruits. These mothers were chosen to represent
an even sample of pigmentation genotypes, spread as evenly as possible
across the core of the hybrid zone where hybrids are most dense.
Genotyping
We grew seeds in 5cm plug trays filled with potting compost (Gramaflor)
in a greenhouse under Sylvania GroLux lights on a 16-hour cycle. We
sowed three seeds per plug for 50-70 plugs per maternal family and
thinned seedlings to a single seedling per plug after cotelydons had
appeared. We transferred approximately 1cm^2^ of fresh tissue from
1419 seedlingsto 96-well DNA-extraction plates (LGC Genomics, Berlin)
and allowed tissue to dry using the sample bag and silica gel provided.
For parental tissue from the hybrid zone we transferred approximately
1cm^{2} tissue dried in the field to the same plates. DNA
extractions of the plated tissue samples were carried out by LGC
Genomics [Is this in another publication yet I can cite?].
We genotyped tissue samples at 71 SNPs by KASPR sequencing (LGC
Genomics). These SNPs are a subsample of a panel used for a wider survey
of the hybrid zone (David Field, unpublished data). The total SNP panel
is a mixture of [how many?] diagnostic (showing a gradient in allele
frequency) and (how many?) parentage (with as even a gradient in allele
frequency as possible) SNPs. For parentage loci we chose only biallelic
loci with a minor allele frequency greater than 0.3 in each of inner
four pools closest to the centre of the cline, selected to maximise
mapping distance between loci. Diagnostic SNPs were either linked to
pigmentation loci, or else showed sharp clines across the hybrid zone.
To select a subsample of SNPs for this study, we selected markers which
were at least 2cM apart to maximise mapping distance between
individuals. We removed 474 offspring and four adults that had missing
data at more than 7.5% of the SNPs. We also pruned 7 SNPs that showed
more than 10% missing data, or less than 15% heterozygosity. This left
us with a set of 984 offspring from 60 maternal families, with between
two and 29 offspring per family (mean=16.4).
Pollen dispersal
kernel
A useful function for describing plant dispersal kernels is the
generalised normal distribution (GND). This is a generalisation of the
exponential family of probability distributions and includes the
exponential and standard normal distributions as special cases, but
allows for fat and thin tails (Austerlitz et al., 2004; Clark, 1998;
Nadarajah, 2005). The GND describes the probability of observing
dispersal distance d through scale parameter a and shape
parameter b:\[GND(d_{mj},a,b) = K \exp \left(\frac{-d}{a} \right)^b
\]
where and is a normalising constant and \(\Gamma\) is Euler's gamma function. The
function takes the form of the standard exponential and normal
distributions when b =1 and b =2 respectively. Values of b<1 indicate leptokurtic distributions with tails that
decay more slowly than would be expected under the exponential
distribution. The mean dispersal distance is given by the standard
deviation of the GND:
\[\sqrt\frac{a^2 \Gamma(^3/_b)}{\Gamma(^1/_b)}\]
We model the pollen dispersal kernel based on the distances between
known mothers of full-sibling families and the plants inferred to be the
fathers of those families. Leptokurtosis in this distribution may be
caused by a long tail of genuine mating events with distant fathers.
However, when unrelated individuals are incorrectly identifed as being
fathers of a full sibship, we expect these individuals to be drawn at
random from the population since the marker panel is designed to show no
spatial structure, whereas true fathers will (on average) be close to
maternal plants. This will inflate the apparent kurtosis in the data and
underestimate b . To accommodate this we model the probability
that candidate pollen donor individual j is the sire of a full
sibship with mother m given distance \(d_{mj}\) between them as a mixture
between a GND and a uniform distribution:
\[ \Pr(d_{mj}|a,b,\lambda) = \lambda GND(d_{mj},a,b) + \frac{1-\lambda}{N} \]
where N is the number of candidate fathers and parameter
determines the proportion of the probability mass due to ’real’
dispersal. Conveniently, also provides an estimate of the false positive
rate if we had run paternity analysis using genetic data alone.
Joint estimates of paternity, sibships and
dispersal
Allowing for covariates in paternity
inference
We have previously described a Python package FAPS which performs
joint analysis of paternity and sibship relationships for sibling arrays
based on SNP data, and allows for integrating out uncertainty in
relationships (Ellis et al., 2018). Here we extended the software to
allow for additional non-genetic information to be included (Hadfield et
al., 2006).
FAPS is based on a matrix G , with a row of each
offspring and a column for each candidate father, where element \(g_{ij}\) is the
probability that candidate j is the father of offspring i .
Rows in G sum to one, and describe a multinomial distribution
of probabilities of paternity over all candidate fathers. The final
column of G is the probability that the father of each
offspring was missing from the sample of candidates, based on the
probability of observing offspring alleles from population allele
frequencies, and an estimate of the proportion of possible pollen donors
in the population that had been sampled. FAPS then uses
hierarchical clustering to partition offspring into a sets of plausible
full sibships. The probability that j is the father of putative
full sibship k is then \(\prod_i g_{ij}\).
Similarly, we can describe the probability that j is the father
of sibship k based on non-genetic information through a suitable
function of those data as \(z_j\prod_i g_{ij}\). Here, we model \(z_j\)as \(Pr(d_{mj}|a,b,\lambda)\), but this formulation is
general to any suitable function relating non-genetic data to
probabilities of paternity. The likelihood of the whole array is then
the product of probabilities for each full sibship within a partition,
summed over all possible partition structures. When there are multiple
maternal families, the likelihood of the whole dataset is the product of
per-family likelihoods over each maternal family.
Inference via
MCMC
We can then infer paremeters of interest by finding combinations of
values that maximise the likelihood of the data. Here, there are four
parameters of interest, while using the algorithm in FAPS to
integrate out uncertainty in paternity and sibship relationships: (1)
the proportion of missing fathers, (2) the shape and (3) scale
parameters of the generalised normal distribution describing dispersal,
and (4) the mixture parameter \(\lambda\) describing the strength of signal from the
GND distribution of dispersal.
We used the Metropolis-Hastings Markov-chain Monte Carlo algorithm to
infer the posterior distribution of these parameters. We ran four
independent chains beginning from distant areas of the parameter space.
At each iteration, each parameter was peturbed by a a factor drawn from
a normal distritution with a fixed standard deviation for each parameter
(\(\sigma\)= 0.025 for proportion of missing fathers and \(\lambda\), \(\sigma\)= 0.05 for shape and \(\sigma\)= 2
for scale). We ran each chain for 60,000 iterations to ensure mixing,
and subsequently removed the first 1000 iterations of each as burn-in.
After checking chain convergence (supplementary figures 2-4) we thinned
subsequent iterations to retain 250 posterior draws from each chain for
further analyses.
We used beta prior distributions for proportion of missing fathers and \(\lambda\)
(Beta(\(\alpha\)=3, \(\beta\)=15) and Beta(\(\alpha\)=1.1, \(\beta\)=1.1) respectively), and Gamma priors for dispersal shape
and mean dispersal distance (\(\Gamma\)(shape=200, scale=200) and \(\Gamma\)\(\Gamma\)\(\Gamma\)(shape=2,
scale=200) respectively). We used a prior for mean dispersal distance
instead of the scale parameter of the GND because this has a more
intuitive interpretation.
Flower colour and male
fitness
We next investigated whether certain flower-colour genotypes sire more
offspring than would be expected given local genotype frequencies, and
whether this varies across the hybrid zone. We did this by comparing
’observed’ mating events, based on genetic and dispersal data, to
’expected’ mating events simulated based on dispersal only for mothers
in each of five bins of approximately 300m (n=3, 6, 21, 27, 3 mothers
per bin; supplementary figure 1). We calculated observed and expect
mating probabilities for each of 1000 sets of dispersal parameters and
proportions of missing fathers from the trimmed MCMC results.
To calculate observed mating probabilities we used the likelihoods fromFAPS that an individual candidate father had sired at least one
offspring with a maternal plant based on genotype data, his distance
from the mother, and dispersal parameters, integrating over uncertainty
in sibling relationships (Ellis et al., 2018). For expected mating
probabilities under random mating we calculated the probability of
mating between each pair of mothers and candidate fathers based on the
distance between them and dispersal parameters. We then summed these
values over all candidates of each Rosea and Sulfureagenotype, and over mothers within each spatial bin. We normalised
likelihoods for each maternal genotype to sum to one to give relative
probabilities than mothers in each bin received pollen from males of
each flower colour genotype.
Power
analyses
One explanation for apparent leptokurtosis in dispersal is incorrect
assignment of paternity to candidates far from maternal plants, either
because the true father was not sampled or due to insufficient
statistical power of the marker set. This would inflate the tail of the
dispersal kernel even if true dispersal were not leptokurtic (i.e. cause
the shape parameter of the GND to be less than one, even when it is
really greater or equal to one).
To test this, we first examined how often we should expect incorrect
paternity assignment using this marker set, and the extent to which
incorrect assignment and missing fathers would inflate apparent
dispersal. We used FAPS to simulate offspring based on observed maternal
and candidate male genotypes (Ellis et al., 2018). For each of the 60
mothers we sampled mates in proportion to their distance, assuming that
the pollen dispersal distances are exponentially distributed with a mean
of 10, 50, 100 and 200m. Since clustering offspring into sibships is the
most computationally demanding part of the analysis, we instead consider
paternity of individuals rather than sibships; this is valid because he
are primarily interested in mating events, not of sibling relationships
themselves. We simulated one offspring for each of N fathers for
each mother, where N is the most likely number of full sibship
families in observed families rounded to the nearest integer
(supplementary figure 2), to give a total of 433 offspring. We then
calculated (1) the posterior probability of paternity for each true sire
in the absence of information about dispersal, (2) mean distance between
mothers and known true sires and (3) weighted mean distance between
mothers and possible all sires, weighted by probability of paternity for
all candidates with a probability >0.001 of having sired at
least offspring. To test dependency on sampling effort we then removed
10%, 20%, 30% or 40% of true sires at random from the dataset and
recalculated weighted-mean distances between mothers and sires. We
repeated this procedure on 100 replicate simulated datasets.
We then used a second set of simulations to investigate the extent to
which a joint analysis of paternity and dispersal could mitigate the
inflation of leptokurtosis. We generated offspring as described above
assuming exponential dispersal of 50m only, and performed a simplified
joint-analysis of paternity and dispersal by grid interpolation. We
multiplied the matrix of paternity probabilities from genetic data by a
matrix of dispersal probabilities from \( \Pr(d_{mj}|a,b,\lambda)\) using combinations of scale (\(5 \leq a \leq 150\))
and shape (\(0.1 \leq b \leq 2\)) parameter values, with fixed \(\lambda\) at the posterior mean of 0.82.
We calculated the likelihood of each combination of parameters by
summing joint probabilities of paternity over each candidate father, and
multiplying those likelihoods over all 433 offspring. We then recorded
the combination of dispersal parameters with the highest (log)
likelihood. We also repeated this procedure for datasets excluding true
sires as described above. We repeated these analyses on 100 replicate
simulated datasets.
Results
Dispersal is
leptokurtic
FAPS grouped the 937 offspring into between 434 and 437 full-sibling
families, and identified 316 mating events between the 60 mothers and
candidate fathers with posterior probabilities greater than 0.99
(supplementary figure 2). Mixture parameter was strongly weighted
towards signal coming from generalised normal dispersal rather than
random draws from the population (mean=0.82, 95% credible intervals =
0.77, 0.93), indicating that most of the mating events reflect real
mating events (supplementary figure 4). The data were compatible with
broad range of plausible values for the proportion of missing fathers
(mean=0.21, 95% credible intervals = 0.06, 0.42; supplementary figure
4), but the posterior distribution was very similar to the prior
distribution (supplementary figure 3), suggesting that the data are not
informative about missing fathers.
Pollen dispersal distances are characterised by many dispersal events
between nearby plants, and a long tail of more distant mating events.
50% of the high-probability fathers were within 40m of the mother;
distances to remaining fathers decayed slowly up to a maximum of 2398m.
Consistent with this, the full pollen dispersal kernel inferred jointly
with sibships and paternity consistently showed shape parameters less
than one (posterior mean=0.40, 95% credible intervals: 0.329-0.483).
These results indicate strong leptokurtosis in the pollen dispersal
kernel.