Quantifying inclusive biodiversity from phylogenetically-conserved candidate genes
We hereafter describe the main steps to reveal PCCGs from focal communities (Figure 2). They mainly consist in (i) sampling specimens of a focal community and extracting the DNA, (ii) identifying from the literature (and databases) the genes and sequencing them, and (iii) quantifying PCCGs diversity and performing analyses.
Defining and sampling the focal community . A key step is to define the term “focal community”. First, the PCCGs approach can be applied to all living entities (prokaryotes and eukaryotes), if (i) candidate genes have been identified in the target taxonomic group, and (ii) they are conserved phylogenetically among species within this group. Nonetheless, phylogenetic conservatism is restrained, so that the PCCGs approach can not be used to estimate the diversity of communities that contain species that are highly divergent (i.e., >20% molecular divergence, see hereafter). We further propose that the focal community from which PCCGs diversity is measured must follow an “ecological logic”. Here, we therefore use the Hubbel’s definition (2001): a focal community “is a group of trophically similar, sympatric species that actually or potentially compete in a local area for the same or similar resources”. This definition (i) roots our approach into clearly-defined theoretical and conceptual grounds, and (ii) intrinsically satisfies our phylogenetic premise as a sympatric species sharing a similar resource are likely to be close phylogenetically. Of course, exceptions to this second premise exist, which means in these cases that the focal community would be split into “phylogenetic clusters”. Examples of focal communities satisfying this definition are numerous: insectivorous fish, insect pollinators, desert plants, tropical trees, detritivorous insects, etc.
A second important step is to sample this focal community. The goal here is to sample all (or most) species of the focal community and the diversity within each species to estimate the entire diversity of the focal community. A first a priori approach would consist in sampling all known species from the focal communities, and for each of them, sampling several individuals (5-30 individuals per species depending on their rarity) to reveal intraspecific diversity. This approach is appropriate when the focal community is already well described taxonomically. An alternative “blinded” approach would consist in sampling as many specimens as possible in the focal community to provide a holistic and representative view of the diversity of the focal community. This approach does not require a prioriknowledge on the focal community, and it best represents the actual diversity (rare species may be less represented in the final pool, but they are also inherently less represented in the actual community). This approach is technically feasible as -as explained later- the DNA of specimens can actually be pooled across species to investigate PCCGs diversity. Both approaches are valuable since both intra- and interspecific diversity are captured; the choice of one or the other will depend on the local context and objectives.
Identifying and selecting relevant PCCGs . The second crucial step concerns the selection of appropriate PCCGs (Figure 2b). We first draw the attention to a trade-off between intraspecific polymorphism and the conservatism of PCCGs. Then, we describe how to identify the most relevant traits associated with the targeted ecological process. Third, we describe how to use available literature to identify putative PCCGs coding for these traits. Finally, we describe some bioinformatic tools useful to recover in silico the sequences that best fit the species from the focal community (see Figure 3).
An important prerequisite is that PCCGs must be polymorphic both among and within species from the focal community. This condition is nonetheless complicated to meet for all PCCGs from a panel (assuming panels of 200-1000 genes or sequences per focal community), since genes that are highly polymorphic intraspecifically are generally not conserved among many species, and vice versa . For instance, developmental genes are generally extremely conserved among species, but are unlikely to be intraspecifically variable in most species from the focal community (Cardoso-Moreiraet al. 2019). A compromise must therefore be reached to optimise the final choice of PCCGs, and a potential solution is to mix genes with various levels of conservatism in the PCCGs panel. This compromise implies that some PCCGs from the panel will not necessarily be sequenced in all species from the focal community (i.e., genes that are expected to be intraspecifically variables), and/or that some PCCGs from the panel will not display intraspecific polymorphism in most species from the focal community (i.e., genes that are expected to be conserved in all species).
The choice of relevant traits will mostly depend upon the targeted ecological process(es). For instance, for pollination, traits targeted in the plant community could be accessibility of floral reward, floral shape or colour and floral scent production (Klahre et al.2011; Naghiloo et al. 2020). For leaf litter decomposition in freshwaters, potential traits of a decomposer crustacean community associated with this function could be locomotion activity, body size or food assimilation (Rota et al.2018) (Figure 3a). As the PCCGs approach assumes that hundreds of genes with small effect sizes will be sequenced, it is mandatory to be inclusive rather than reductionist in trait selection. This list of traits will be the basis for searching associated candidate genes in the literature. Noteworthily, pleiotropic genes (i.e., genes that affect multiple traits) are excellent putative PCCGs as they are particularly relevant for linking traits to ecological processes and functions (Ducrest et al.2008; Watanabe et al. 2019). In the same vein, neutral genes (or sequences) randomly taken from the genome (or known to be neutral) can be added to the panel of genes to test for instance the role of selection vs . drift.
The existing literature relevant to identifying PCCGs is extensive, and merely relies on functional genomics (links between genes and traits) and functional ecology (links between traits and ecosystem processes) studies (Figure 3b). Candidate genes are directly identified from the profuse literature establishing a link between a gene and its phenotypic function at the individual level. Most of these studies are focusing on plant or animal models (e.g. , Arabidopsis thaliana ,Zea mays , Mus musculus , Drosophila melanogaster ,Danio rerio …) and “semi-model” species (Macrobrachium rosenbergii , Populus nigra , Cyprinus carpio …). Although natural communities often lack one of these species, our favourite biological models generally have a phylogenetic cousin from one of these models, making them relevant to identify putative PCCGs. Specific reviews focusing on candidate genes sustaining a particular trait (e.g. , 47 genes associated with crustacean growth, Jung et al. 2014; 98 genes associated with plant disease resistance, Sekhwal et al.2015) and study cases that have identified a specific gene polymorphism responsible for an individual trait variation are also valuable. For instance, for floral scent production (associated to pollination), existing studies identifies allelic variation at tree locus encoding the MYB transcription factor ODORANT1 (Klahre et al.2011), the LIMONENE-MYRCENE SYNTHASE (LM) and the OCIMENE SYNTHASE (OS) (Byers et al.2014). For food assimilation in crustaceans, GLUCOSE TRANSPORTER PROTEIN (Wang et al. 2016), and CATHEPSIN L SYNTHESIS (Jung et al.2013) genes are two potential PCCGs. To summarise: basic information is already there, one just needs to dig into the literature linking genes to important traits to create a panel of hundreds putative PCCGs for a given trait or function (Figure 3b).
Usually, initial sequences of putative PCCGs can be retrieved directly from papers, or databases such as NCBI using appropriate keywords (Figure 3c). To continue on the example of floral scent production, gene sequences of LMS and OS are available both in the initial paper (Byers et al. 2014) and on NCBI (“ocimene synthase arabidopsis” ended-up with 9 hits in September 2022). The next step is to obtain the homologous sequences of these PCCGs on a species that is phylogenetically as close as possible from those of the focal community, or even better that belongs to the focal community. This step consists in blasting the sequences (Figure 3d) found on model species in appropriate search engines (or in the home-made reference genome(s) of your favourite species) to search for their homology in the reference genome(s) that is(are) the closest from the focal community. These final PCCG sequences will best match the phylogenetic composition of the focal community (see Faircloth 2017 for further details).
Sequencing hundreds of PCCGs across species . PCCGs sequencing benefits from the recent development of target enrichment methods (capture of specific regions of the genome, Mertes et al. 2011; Jones & Good 2016; Jiménez‐Mena et al. 2022). Here, we focus on the hybridization-based capture sequencing (HBCS) method which is classically used in phylogenomic studies and efficient to retrieve sequences from species that display up to 20% of molecular divergence (Hawkins et al.2016). The general principle of HBCS is to design oligonucleotides (called “probes” or “baits”) that are complementary to the target (PCCG) sequences. These oligos enrich complementary sequences from an Next-Generation-Sequencing (NGS) library. The classical NGS library preparation workflow is completed by the capture of targeted sequences before the sequencing step, which reduces the size of the library and hence the sequencing cost. This method has been described in 2007 and has been used in many taxa (Albert et al.2007; Mamanova et al. 2010); some studies are thoroughly describing its use and potential for evolution (Faircloth 2017; Jiménez‐Mena et al. 2022). A main advantage -compared to traditional approach based on PCR enrichment- is that HBCS allows for large mismatches between probes and the target sequences, allowing to sequence species that diverge by 15-20%; this threshold is the one that should (ideally) be used to define the appropriate focal species. As said above, if the focal community contains species with a higher level of divergence, it is possible to develop several probe sets according to “phylogenetic clusters” (species from the focal species that are below the 20% divergence threshold).
HBCS can be performed (i) at the individual level in which case all individuals from all species are sequenced independently, or (ii) at the focal community level in which case the DNA of all individuals from all species of the community are pooled (from 50-100 individuals per pool, Schlötterer et al. 2014; Abrams et al. 2021) and this DNA pool is then sequenced. Individual-based sequencing is more costly but provides more precise information that can be used to relate specific gene polymorphism to individual traits or to ecological processes for instance. In contrast, pool-seq approaches are extremely affordable given the current power of sequencers. For instance, for 48 focal communities, each composed of 10 species (from which we sampled 5 individuals per species), the cost for DNA extraction, library preparation, capture and sequencing would be ~240000 euros if performed at the individual level, whereas it would be ~10000 euros if performed using a pooled-seq approach. Information acquired with pool-seq approaches does not provide individual data, but it is actually sufficient to get allele frequencies for each marker (Sham et al.2002; Gautier et al. 2022), and hence to estimate inclusive biodiversity from PCCGs (see hereafter). Moreover, pool-seq approaches are increasingly being used with astonishing successes, and many tools have been developed for improving evolutionary inferences from these data (Schlöttereret al. 2014; Gautier et al. 2022). Pool-seq approaches are hence in our opinion the best option for developing the PCCGs approach in a wide range of contexts.
Defining metrics for estimating PCCGs diversity of focal communities . Given that raw data obtained from HBCS are DNA sequences, all metrics used by population geneticists and community phylogeneticists can be used to describe biodiversity patterns. Overall, biodiversity metrics must follow the classical diversity partitioning proposed by ecologists in the 1960’s (Whittaker 1960), including: ɑ and γ components as the local and regional diversity components, and the ß component quantifies the diversity differentiation among local sites. This framework was initially applied to communities and variation in species diversity within and between local sites, and was extended to trait and phylogenetic measures of (meta-)community diversity (Pavoine & Bonsall 2011; Mouquet et al. 2012; Pavoine & Izsák 2014b; Tuckeret al. 2017; Carmona et al. 2019b). Population geneticists (and ecologists) recognized that the metrics traditionally used to describe genetic diversity patterns in (meta-)populations (such as the allelic richness or Fst) actually conform to the Whittaker’s framework, that tight (statistical) connections exist between the “population” and “community” approaches, and that developing a unified framework to analyse diversity patterns across populations and communities would be beneficial (Vellend 2005; Jost 2008; Gaggiotti et al. 2018). Many papers discussed the specific metrics that should be used to unify disciplines (e.g., Gaggiottiet al. 2018), but we do not intend to orient readers to a specific type of metrics, as they all have their advantages and disadvantages, and the choice of a metric should be dictated by the scientific goals (Mouquet et al.2012; Tucker et al. 2017). For instance, the Fst provides estimates and information on drift (Holsinger & Weir 2009), whereas some dissimilarity metrics can provide precise cues about the relative role of nestedness and turnover for explaining regional patterns of ß-diversity (Baselga 2010). Nonetheless, we underline that the choice of inclusive biodiversity metrics derived from PCCGs must follow the principle that intra- and interspecific diversity are actually shaped by similar processes (drift, selection, mutation/speciation, dispersal) acting over a continuum from ecological to evolutionary scales (Hubbell 2001; Vellend & Geber 2005). The description of biodiversity using PCCGS inherently helps following this principle.
Concretely, one needs to consider the type of data that can be gathered either from individual or pooled sequencing approaches. In the first case, the data consist of a series of aligned DNA sequences, each attributed to a single specimen and to a given gene. SNP loci (including both intra- and interspecific SNPs) and haplotypes that groups all loci from a given sequence (or gene) can be derived from these data. SNPs are classical bi-allelic loci from which many types of metrics can be derived; the number of polymorphic SNPs estimated from PCCGS can be compared among communities (a community composed of a few species will likely have a lower number of polymorphic SNPs than a community composed of many species, even if the former is rich intraspecifically), the evenness can be derived from allele frequencies, as well as the differentiation (dissimilarity) among local communities (e.g., Gaggiottiet al. 2018), etc. Haplotypes can be used to draw phylogenetic trees (including both intraspecific and interspecific tips) from which all types of phylogenetic metrics of community can be derived (Tucker et al.2017). Possibilities are more restricted for the pool-seq approach. In that case, a series of SNPs are therefore retrieved, together with their relative frequency within the community; alleles can not be attributed to a particular species or a particular individual within a species, which impedes the reconstruction of haplotypes. For pool-seq approaches, the metrics derived from SNP data (including information on allele frequencies) are therefore favoured (Schlötterer et al. 2014).