Quantifying inclusive biodiversity from phylogenetically-conserved
candidate genes
We hereafter describe the main steps to reveal PCCGs from focal
communities (Figure 2). They mainly consist in (i) sampling specimens of
a focal community and extracting the DNA, (ii) identifying from the
literature (and databases) the genes and sequencing them, and (iii)
quantifying PCCGs diversity and performing analyses.
Defining and sampling the focal community . A key step is to
define the term “focal community”. First, the PCCGs approach can be
applied to all living entities (prokaryotes and eukaryotes), if (i)
candidate genes have been identified in the target taxonomic group, and
(ii) they are conserved phylogenetically among species within this
group. Nonetheless, phylogenetic conservatism is restrained, so that the
PCCGs approach can not be used to estimate the diversity of communities
that contain species that are highly divergent (i.e., >20%
molecular divergence, see hereafter). We further propose that the focal
community from which PCCGs diversity is measured must follow an
“ecological logic”. Here, we therefore use the Hubbel’s definition
(2001): a focal community “is a group of trophically similar, sympatric
species that actually or potentially compete in a local area for the
same or similar resources”. This definition (i) roots our approach into
clearly-defined theoretical and conceptual grounds, and (ii)
intrinsically satisfies our phylogenetic premise as a sympatric species
sharing a similar resource are likely to be close phylogenetically. Of
course, exceptions to this second premise exist, which means in these
cases that the focal community would be split into “phylogenetic
clusters”. Examples of focal communities satisfying this definition are
numerous: insectivorous fish, insect pollinators, desert plants,
tropical trees, detritivorous insects, etc.
A second important step is to sample this focal community. The goal here
is to sample all (or most) species of the focal community and the
diversity within each species to estimate the entire diversity of the
focal community. A first a priori approach would consist in
sampling all known species from the focal communities, and for each of
them, sampling several individuals (5-30 individuals per species
depending on their rarity) to reveal intraspecific diversity. This
approach is appropriate when the focal community is already well
described taxonomically. An alternative “blinded” approach would
consist in sampling as many specimens as possible in the focal community
to provide a holistic and representative view of the diversity of the
focal community. This approach does not require a prioriknowledge on the focal community, and it best represents the actual
diversity (rare species may be less represented in the final pool, but
they are also inherently less represented in the actual community). This
approach is technically feasible as -as explained later- the DNA of
specimens can actually be pooled across species to investigate PCCGs
diversity. Both approaches are valuable since both intra- and
interspecific diversity are captured; the choice of one or the other
will depend on the local context and objectives.
Identifying and selecting relevant PCCGs . The second crucial step
concerns the selection of appropriate PCCGs (Figure 2b). We first draw
the attention to a trade-off between intraspecific polymorphism and the
conservatism of PCCGs. Then, we describe how to identify the most
relevant traits associated with the targeted ecological process. Third,
we describe how to use available literature to identify putative PCCGs
coding for these traits. Finally, we describe some bioinformatic tools
useful to recover in silico the sequences that best fit the
species from the focal community (see Figure 3).
An important prerequisite is that PCCGs must be polymorphic both among
and within species from the focal community. This condition is
nonetheless complicated to meet for all PCCGs from a panel (assuming
panels of 200-1000 genes or sequences per focal community), since genes
that are highly polymorphic intraspecifically are generally not
conserved among many species, and vice versa . For instance,
developmental genes are generally extremely conserved among species, but
are unlikely to be intraspecifically variable in most species from the
focal community
(Cardoso-Moreiraet al. 2019). A compromise must therefore be reached to optimise
the final choice of PCCGs, and a potential solution is to mix genes with
various levels of conservatism in the PCCGs panel. This compromise
implies that some PCCGs from the panel will not necessarily be sequenced
in all species from the focal community (i.e., genes that are expected
to be intraspecifically variables), and/or that some PCCGs from the
panel will not display intraspecific polymorphism in most species from
the focal community (i.e., genes that are expected to be conserved in
all species).
The choice of relevant traits will mostly depend upon the targeted
ecological process(es). For instance, for pollination, traits targeted
in the plant community could be accessibility of floral reward, floral
shape or colour and floral scent production
(Klahre et al.2011; Naghiloo et al. 2020). For leaf litter decomposition in
freshwaters, potential traits of a decomposer crustacean community
associated with this function could be locomotion activity, body size or
food assimilation
(Rota et al.2018) (Figure 3a). As the PCCGs approach assumes that hundreds of genes
with small effect sizes will be sequenced, it is mandatory to be
inclusive rather than reductionist in trait selection. This list of
traits will be the basis for searching associated candidate genes in the
literature. Noteworthily, pleiotropic genes (i.e., genes that affect
multiple traits) are excellent putative PCCGs as they are particularly
relevant for linking traits to ecological processes and functions
(Ducrest et al.2008; Watanabe et al. 2019). In the same vein, neutral genes (or
sequences) randomly taken from the genome (or known to be neutral) can
be added to the panel of genes to test for instance the role of
selection vs . drift.
The existing literature relevant to identifying PCCGs is extensive, and
merely relies on functional genomics (links between genes and traits)
and functional ecology (links between traits and ecosystem processes)
studies (Figure 3b). Candidate genes are directly identified from the
profuse literature establishing a link between a gene and its phenotypic
function at the individual level. Most of these studies are focusing on
plant or animal models (e.g. , Arabidopsis thaliana ,Zea mays , Mus musculus , Drosophila melanogaster ,Danio rerio …) and “semi-model” species
(Macrobrachium rosenbergii , Populus nigra , Cyprinus
carpio …). Although natural communities often lack one of these
species, our favourite biological models generally have a phylogenetic
cousin from one of these models, making them relevant to identify
putative PCCGs. Specific reviews focusing on candidate genes sustaining
a particular trait (e.g. , 47 genes associated with crustacean
growth, Jung et
al. 2014; 98 genes associated with plant disease resistance,
Sekhwal et al.2015) and study cases that have identified a specific gene polymorphism
responsible for an individual trait variation are also valuable. For
instance, for floral scent production (associated to pollination),
existing studies identifies allelic variation at tree locus encoding the
MYB transcription factor ODORANT1
(Klahre et al.2011), the LIMONENE-MYRCENE SYNTHASE (LM) and the OCIMENE SYNTHASE (OS)
(Byers et al.2014). For food assimilation in crustaceans, GLUCOSE TRANSPORTER
PROTEIN (Wang et
al. 2016), and CATHEPSIN L SYNTHESIS
(Jung et al.2013) genes are two potential PCCGs. To summarise: basic information is
already there, one just needs to dig into the literature linking genes
to important traits to create a panel of hundreds putative PCCGs for a
given trait or function (Figure 3b).
Usually, initial sequences of putative PCCGs can be retrieved directly
from papers, or databases such as NCBI using appropriate keywords
(Figure 3c). To continue on the example of floral scent production, gene
sequences of LMS and OS are available both in the initial paper (Byers
et al. 2014) and on NCBI (“ocimene synthase arabidopsis” ended-up with
9 hits in September 2022). The next step is to obtain the homologous
sequences of these PCCGs on a species that is phylogenetically as close
as possible from those of the focal community, or even better that
belongs to the focal community. This step consists in blasting the
sequences (Figure 3d) found on model species in appropriate search
engines (or in the home-made reference genome(s) of your favourite
species) to search for their homology in the reference genome(s) that
is(are) the closest from the focal community. These final PCCG sequences
will best match the phylogenetic composition of the focal community
(see Faircloth 2017
for further details).
Sequencing hundreds of PCCGs across species . PCCGs sequencing
benefits from the recent development of target enrichment methods
(capture of specific
regions of the genome, Mertes et al. 2011; Jones & Good 2016;
Jiménez‐Mena et al. 2022). Here, we focus on the
hybridization-based capture sequencing (HBCS) method which is
classically used in phylogenomic studies and efficient to retrieve
sequences from species that display up to 20% of molecular divergence
(Hawkins et al.2016). The general principle of HBCS is to design oligonucleotides
(called “probes” or “baits”) that are complementary to the target
(PCCG) sequences. These oligos enrich complementary sequences from an
Next-Generation-Sequencing (NGS) library. The classical NGS library
preparation workflow is completed by the capture of targeted sequences
before the sequencing step, which reduces the size of the library and
hence the sequencing cost. This method has been described in 2007 and
has been used in many taxa
(Albert et al.2007; Mamanova et al. 2010); some studies are thoroughly
describing its use and potential for evolution
(Faircloth 2017;
Jiménez‐Mena et al. 2022). A main advantage -compared to
traditional approach based on PCR enrichment- is that HBCS allows for
large mismatches between probes and the target sequences, allowing to
sequence species that diverge by 15-20%; this threshold is the one that
should (ideally) be used to define the appropriate focal species. As
said above, if the focal community contains species with a higher level
of divergence, it is possible to develop several probe sets according to
“phylogenetic clusters” (species from the focal species that are below
the 20% divergence threshold).
HBCS can be performed (i) at the individual level in which case all
individuals from all species are sequenced independently, or (ii) at the
focal community level in which case the DNA of all individuals from all
species of the community are pooled
(from 50-100
individuals per pool, Schlötterer et al. 2014; Abrams et
al. 2021) and this DNA pool is then sequenced. Individual-based
sequencing is more costly but provides more precise information that can
be used to relate specific gene polymorphism to individual traits or to
ecological processes for instance. In contrast, pool-seq approaches are
extremely affordable given the current power of sequencers. For
instance, for 48 focal communities, each composed of 10 species (from
which we sampled 5 individuals per species), the cost for DNA
extraction, library preparation, capture and sequencing would be
~240000 euros if performed at the individual level,
whereas it would be ~10000 euros if performed using a
pooled-seq approach. Information acquired with pool-seq approaches does
not provide individual data, but it is actually sufficient to get allele
frequencies for each marker
(Sham et al.2002; Gautier et al. 2022), and hence to estimate inclusive
biodiversity from PCCGs (see hereafter). Moreover, pool-seq approaches
are increasingly being used with astonishing successes, and many tools
have been developed for improving evolutionary inferences from these
data (Schlöttereret al. 2014; Gautier et al. 2022). Pool-seq approaches
are hence in our opinion the best option for developing the PCCGs
approach in a wide range of contexts.
Defining metrics for estimating PCCGs diversity of focal
communities . Given that raw data obtained from HBCS are DNA sequences,
all metrics used by population geneticists and community
phylogeneticists can be used to describe biodiversity patterns. Overall,
biodiversity metrics must follow the classical diversity partitioning
proposed by ecologists in the 1960’s
(Whittaker 1960),
including: ɑ and γ components as the local and regional diversity
components, and the ß component quantifies the diversity differentiation
among local sites. This framework was initially applied to communities
and variation in species diversity within and between local sites, and
was extended to trait and phylogenetic measures of (meta-)community
diversity (Pavoine &
Bonsall 2011; Mouquet et al. 2012; Pavoine & Izsák 2014b; Tuckeret al. 2017; Carmona et al. 2019b). Population
geneticists (and ecologists) recognized that the metrics traditionally
used to describe genetic diversity patterns in (meta-)populations (such
as the allelic richness or Fst) actually conform to the Whittaker’s
framework, that tight (statistical) connections exist between the
“population” and “community” approaches, and that developing a
unified framework to analyse diversity patterns across populations and
communities would be beneficial
(Vellend 2005; Jost
2008; Gaggiotti et al. 2018). Many papers discussed the specific
metrics that should be used to unify disciplines (e.g., Gaggiottiet al. 2018), but we do not intend to orient readers to a
specific type of metrics, as they all have their advantages and
disadvantages, and the choice of a metric should be dictated by the
scientific goals
(Mouquet et al.2012; Tucker et al. 2017). For instance, the Fst provides
estimates and information on drift
(Holsinger & Weir
2009), whereas some dissimilarity metrics can provide precise cues
about the relative role of nestedness and turnover for explaining
regional patterns of ß-diversity
(Baselga 2010).
Nonetheless, we underline that the choice of inclusive biodiversity
metrics derived from PCCGs must follow the principle that intra- and
interspecific diversity are actually shaped by similar processes (drift,
selection, mutation/speciation, dispersal) acting over a continuum from
ecological to evolutionary scales
(Hubbell 2001; Vellend
& Geber 2005). The description of biodiversity using PCCGS inherently
helps following this principle.
Concretely, one needs to consider the type of data that can be gathered
either from individual or pooled sequencing approaches. In the first
case, the data consist of a series of aligned DNA sequences, each
attributed to a single specimen and to a given gene. SNP loci (including
both intra- and interspecific SNPs) and haplotypes that groups all loci
from a given sequence (or gene) can be derived from these data. SNPs are
classical bi-allelic loci from which many types of metrics can be
derived; the number of polymorphic SNPs estimated from PCCGS can be
compared among communities (a community composed of a few species will
likely have a lower number of polymorphic SNPs than a community composed
of many species, even if the former is rich intraspecifically), the
evenness can be derived from allele frequencies, as well as the
differentiation (dissimilarity) among local communities
(e.g., Gaggiottiet al. 2018), etc. Haplotypes can be used to draw phylogenetic
trees (including both intraspecific and interspecific tips) from which
all types of phylogenetic metrics of community can be derived
(Tucker et al.2017). Possibilities are more restricted for the pool-seq approach. In
that case, a series of SNPs are therefore retrieved, together with their
relative frequency within the community; alleles can not be attributed
to a particular species or a particular individual within a species,
which impedes the reconstruction of haplotypes. For pool-seq approaches,
the metrics derived from SNP data (including information on allele
frequencies) are therefore favoured
(Schlötterer et
al. 2014).