We collected two to three leaves for DNA extraction and dried these in
silica gel (Fischer Scientific) and one mature healthy flower for
phenotyping. Following the visual scoring system developed by \citet{Whibley2006} we assigned plants to one of six flower phenotypes based on their
inferred genotypes at the loci Rosea and Sulfurea (figure \ref{821839}). ROS/ROS plants have intense magenta pigmentation,
whilst ros/ros plants have none, and heterozygotes are intermediate. The
SULF allele is fully dominant over sulf, and SULF/SULF and SULF/sulf
plants cannot be easily distinguished by human eyes, and will henceforth
be referred to collectively as SULF/+. sulf/sulf plant are yellow
throughout the face of the flower, whilst SULF/+ plants express aurones
only in a small patch at the centre of the flower.
In August 2012 we collected a single, mature, wild-pollinated fruit from
each of 60 mothers (figure \ref{666285}, figure \ref{120180}). In order to minimise
disturbance to the population we only sampled from plants which had set
a minimum of five mature fruits. These mothers were chosen to represent
an even sample of pigmentation genotypes, spread as evenly as possible
across the core of the hybrid zone where hybrids are most dense.
Genotyping
We grew seeds in 5cm plug trays filled with potting compost (Gramaflor)
in a greenhouse under Sylvania GroLux lights on a 16-hour cycle. We
sowed three seeds per plug for 50-70 plugs per maternal family and
thinned seedlings to a single seedling per plug after cotelydons had
appeared. We transferred approximately 1cm2 of fresh tissue from
1419 seedlings to 96-well DNA-extraction plates (LGC Genomics, Berlin)
and allowed tissue to dry using the sample bag and silica gel provided.
For parental tissue from the hybrid zone we transferred approximately
1cm^{2} tissue dried in the field to the same plates. DNA
extractions of the plated tissue samples were carried out by LGC
Genomics [Is this in another publication yet I can cite?].
We genotyped tissue samples at 71 SNPs by KASPR sequencing (LGC
Genomics). These SNPs are a subsample of a panel used for a wider survey
of the hybrid zone (David Field, unpublished data). The total SNP panel
is a mixture of [how many?] diagnostic (showing a gradient in allele
frequency) and (how many?) parentage (with as even a gradient in allele
frequency as possible) SNPs. For parentage loci we chose only biallelic
loci with a minor allele frequency greater than 0.3 in each of inner
four pools closest to the centre of the cline, selected to maximise
mapping distance between loci. Diagnostic SNPs were either linked to
pigmentation loci, or else showed sharp clines across the hybrid zone.
To select a subsample of SNPs for this study, we selected markers which
were at least 2cM apart to maximise mapping distance between
individuals. We removed 474 offspring and four adults that had missing
data at more than 7.5% of the SNPs. We also pruned 7 SNPs that showed
more than 10% missing data, or less than 15% heterozygosity. This left
us with a set of 984 offspring from 60 maternal families, with between
two and 29 offspring per family (mean=16.4).
Pollen dispersal
kernel
A useful function for describing plant dispersal kernels is the
generalised normal distribution (GND). This is a generalisation of the
exponential family of probability distributions and includes the
exponential and standard normal distributions as special cases, but
allows for fat and thin tails\cite{Clark1998,nadarajah2005generalized}. The GND describes the probability of observing
dispersal distance d through scale parameter a and shape
parameter b:\[GND(d_{mj},a,b) = K \exp \left(\frac{-d}{a} \right)^b
\]
where K is a normalising constant and \(\Gamma\) is Euler's gamma function. The
function takes the form of the standard exponential and normal
distributions when b =1 and b =2 respectively. Values of b<1 indicate leptokurtic distributions with tails that
decay more slowly than would be expected under the exponential
distribution. The mean dispersal distance is given by the standard
deviation of the GND:
\[\sqrt\frac{a^2 \Gamma(^3/_b)}{\Gamma(^1/_b)}\]
We model the pollen dispersal kernel based on the distances between
known mothers of full-sibling families and the plants inferred to be the
fathers of those families. Leptokurtosis in this distribution may be
caused by a long tail of genuine mating events with distant fathers.
However, when unrelated individuals are incorrectly identified as being
fathers of a full sibship, we expect these individuals to be drawn at
random from the population since the marker panel is designed to show no
spatial structure, whereas true fathers will (on average) be close to
maternal plants. This will inflate the apparent kurtosis in the data and
underestimate b . To accommodate this we model the probability
that candidate pollen donor individual j is the sire of a full
sibship with mother m given distance \(d_{mj}\) between them as a mixture
between a GND and a uniform distribution:
\[ \Pr(d_{mj}|a,b,\lambda) = \lambda GND(d_{mj},a,b) + \frac{1-\lambda}{N} \]
where N is the number of candidate fathers and parameter
determines the proportion of the probability mass due to ’real’
dispersal. Conveniently, also provides an estimate of the false positive
rate if we had run paternity analysis using genetic data alone.
Joint estimates of paternity, sibships and
dispersal
Allowing for covariates in paternity
inference
We have previously described a Python package FAPS which performs
joint analysis of paternity and sibship relationships for sibling arrays
based on SNP data, and allows for integrating out uncertainty in
relationships \cite{Ellis2018}. Here we extended the software to
allow for additional non-genetic information to be included.
FAPS is based on a matrix G, with a row of each
offspring and a column for each candidate father, where element \(g_{ij}\) is the
probability that candidate j is the father of offspring i .
Rows in G sum to one, and describe a multinomial distribution
of probabilities of paternity over all candidate fathers. The final
column of G is the probability that the father of each
offspring was missing from the sample of candidates, based on the
probability of observing offspring alleles from population allele
frequencies, and an estimate of the proportion of possible pollen donors
in the population that had been sampled. FAPS then uses
hierarchical clustering to partition offspring into a sets of plausible
full sibships. The probability that j is the father of putative
full sibship k is then \(\prod_i g_{ij}\).
Similarly, we can describe the probability that j is the father
of sibship k based on non-genetic information through a suitable
function of those data as \(z_j\prod_i g_{ij}\). Here, we model \(z_j\)as \(Pr(d_{mj}|a,b,\lambda)\), but this formulation is
general to any suitable function relating non-genetic data to
probabilities of paternity. The likelihood of the whole array is then
the product of probabilities for each full sibship within a partition,
summed over all possible partition structures. When there are multiple
maternal families, the likelihood of the whole dataset is the product of
per-family likelihoods over each maternal family.
Inference via
MCMC
We can then infer paremeters of interest by finding combinations of
values that maximise the likelihood of the data. Here, there are four
parameters of interest, while using the algorithm in FAPS to
integrate out uncertainty in paternity and sibship relationships: (1)
the proportion of missing fathers, (2) the shape and (3) scale
parameters of the generalised normal distribution describing dispersal,
and (4) the mixture parameter \(\lambda\) describing the strength of signal from the
GND distribution of dispersal.
We used the Metropolis-Hastings Markov-chain Monte Carlo algorithm to
infer the posterior distribution of these parameters. We ran four
independent chains beginning from distant areas of the parameter space.
At each iteration, each parameter was peturbed by a a factor drawn from
a normal distribution with a fixed standard deviation for each parameter
(\(\sigma\)= 0.025 for proportion of missing fathers and \(\lambda\), \(\sigma\)= 0.05 for shape and \(\sigma\)= 2
for scale). We ran each chain for 60,000 iterations to ensure mixing,
and subsequently removed the first 1000 iterations of each as burn-in.
After checking chain convergence (figures \ref{384337}-\ref{281170}) we thinned
subsequent iterations to retain 250 posterior draws from each chain for
further analyses.
We used beta prior distributions for proportion of missing fathers and \(\lambda\)
(Beta(\(\alpha\)=3, \(\beta\)=15) and Beta(\(\alpha\)=1.1, \(\beta\)=1.1) respectively), and Gamma priors for dispersal shape
and mean dispersal distance (\(\Gamma\)(shape=200, scale=200) and \(\Gamma\)(shape=2,
scale=200) respectively; figure \ref{804996}). We used a prior for mean dispersal distance
instead of the scale parameter of the GND because this has a more
intuitive interpretation.