Figure 1. Decision tree of how to accurately use data to analyse
patterns of richness. Questions are given in the light grey boxes and
solutions are provided in blue
Understanding species ranges and how well-protected they are is crucial
for filling gaps in protection coverage and effectively protecting
species into the future, but poor analysis can mislead and misdirect
attention and waste limited conservation resources. Here, we demonstrate
the potential pitfalls of uncritical species mapping, highlighting the
importance of thoroughly vetting data sources and engaging them with the
same ecological principles that are expected for the best-known species.
Ultimately, best-available-data arguments seeking to overreach beyond
what is possible (i.e. globally, Wyborn & Evans 2022) are bad-data
arguments that can also have bad outcomes for conservation. Furthermore,
“effective protection” needs may also differ across species (Butchart
et al., 2015), such that generalists and specialists would respond
differently, and this must also be considered, rather than naïve
aggregations that treat all species as generalists or only include
generalist species and then claim representation for entire groups.
Various analyses have been conducted which use these approaches (Figure
1) to sensibly and sensitively map species ranges and biodiversity
patterns. For example, in terms of limiting analysis to areas where data
are representative, there are many regional analyses which use such
approaches (Figure 1; solution 1), as well as studies which increase the
resolution of studies by incorporating other types of data not available
at wider scales (Fukaya et a., 2020; Cosentino et al., 2023). For
modelling richness patterns in the absence of sufficient data for
species-level analysis, examples exist for data-poor invertebrates,
ferns, and lycophytes (Orr et al., 2020; Weigand et al., 2020; Liu et
al., 2022; Potapov et al., 2023) (Figure 1; solution option 2), and such
work may also include collating more data to ensure that it is spatially
and taxonomically representative (Figure 1; solution 4) (Qiao et al.
2023). A good recent example is for ants, where a large database was
collated to ensure that it was representative before conducting
individual species models to map richness (Figure 1; solutions 5 and 7)
(Janicki et al., 2016; Guénard et al., 2017; AntWeb; https://antweb.org;
Kass et al., 2022). Alternatively, indices can also be applied to fill
gaps, either for limited regions such as urban areas (Hughes et al.,
2022) (Figure 1; solution 3), or indices combined and interpolated more
widely (Qiao et al., 2023). Only where representative data exist across
taxa can species-specific analyses be conducted. If limited
species-specific data are available, this may include simple approaches,
such as selecting suitable habitat from within MCPs (Figure 1; solution
6) (Chesshire et al., 2022), or following approaches similar to those
advocated for refining IUCN maps to better represent species true ranges
(landcover and elevation filters within similar ecoregions or biomes,
Brooks et al., 2019). In all cases, care and calibration at each step is
needed to ensure that data are used within reasonable limits, to not
over-extend their utility, or overreach on interpretation.
Figure 2. Projecting richness with different approaches. A. Maxent
species models (Figure 1 solution5). B. Interpolating richness from
inventories (solution 2). C. Filtering habitat within convex polygons
(solution 6). D. IUCN richness. E. IUCN richness filtered by suitable
habitat, F. MCP unfiltered richness.
Each approach has its own inherent assumptions and level of detail (see
Figure 2), so some caveats are necessary. Patterns from each should be
comparable (provided that there are sufficient data for analysis), but
they will not be identical, and richness values will vary between
approaches (thus, relative patterns rather than absolute values should
be explored). Selection of input variables, particularly land-cover
variables, can strongly influence the projected distribution of species,
and if some regions are under-sampled or under-represented then they may
not be accurately reflected in analysis; this problem increases with
scale. Categorical variables can artificially delimit ranges, and thus
whilst we used the IUCN ecosystem typology to increase the sensitivity
of analysis (as more accurate National maps are not available for
continental regions such as Africa), this may still cause more extreme
transitions in apparent diversity (especially in small, inaccessible and
under-sampled regions) but this is still necessary to refine
overgeneralised maps (Figure 2E vs D for IUCN and F vs C for MCPs); yet
more sensitive models using interpolation or modelling based on
continuous variables (such as vegetation height and density) can be
better where they can be applied (Figure 2 a, b). Furthermore, selection
of models requires care, as joint-species distribution models are
becoming increasingly popular (Zurell et al., 2018; 2019), however
whilst popular these should be applied only where a functional
relationship between species (competition, mutualism etc) exists, as
correlations based on bias sampling could emerge from applying such
models without verified ecological relationships (Poggiato et al.,
2021). Similarly tied to scale, interpolations over very large and
heterogeneous regions cannot incorporate biogeographic differences if
relying solely on interpolation of richness rather than species-specific
approaches, and may better represent drivers of patterns in better
sampled regions. Furthermore, interpolations between mainland areas and
islands cannot encompass island biogeography, as dispersal would be
assumed equal through the study area. If using MCPs, the edges of
species distributions will be excluded (as the approach can only assess
range within the recorded maximum distribution of the species based on
known locations; see Figure 2C, F). Analyses using unfiltered MCPs
(Figure 2F) are vulnerable to “mid-domain” type effects, where central
areas will appear richer regardless of land-cover type due to the
probability of overlap between MCPs in central areas of maps (Figure 2).
The importance of reflecting habitat in maps is clear by comparing
patterns when habitat variables are incorporated versus richness maps
based on simple clipped maps of maximum range extent (Figure 2C vs 2E).
Inventory-based approaches also rely on data being representative
throughout the study area. Assessing the performance of inventory-based
approaches can be challenging without having complete inventory data for
cross-referencing (Figure 2B). Hence, expert knowledge may be the only
way to assess if patterns “make sense”. Furthermore, as
interpolation-based approaches may work better over large regions where
the full environmental gradient is adequately sampled they might be more
appropriate over very large scales (i.e. global) given that some regions
(such as Europe and the US) have very good sampling. However, at a
continental scale this can be challenging without additional fieldwork
or data-collation to add additional inventory data (which was not
possible in the case of this study, 2B). In addition, less species-rich
areas may be more poorly sampled, so inventory-based interpolations need
at least some site-based inventories in addition to calculations of
richness by larger unit areas. Assessment and ground-truthing at each
step is a fundamental necessity of analysis, and whilst all models are
wrong, some models are useful. That usefulness ultimately depends on
whether a model provides sufficient accuracy to guide interventions and
further work effectively. Here we show that many methods can create
broadly similar patterns (Figure 2A-C), species specific models will
provide the best approach if sufficient species level data are
available, but on extended (i.e. global) scales this is rarely done. In
most of these examples (Figure 2) the relative patterns are similar, but
unfiltered approaches overgeneralise, inflating richness of regions
predicted to host lower diversity in Central Africa, whilst missing
areas to the Northeast and Southeast, whilst other approaches do
highlight the same areas as potentially hosting higher diversity.
Moving forward, a set of best practices are needed to ensure that, as
accessible data grow, they are used in a way that builds on
understanding rather than bias (e.g., see Box 1). For example,
understanding if data are representative across regions and taxa for
species level analysis (as noted in Figure 1), as global insect data do
not meet these criteria, global analysis is not yet possible.This is
because most data are concentrated in North America and Europe and
taxonomic representation is too incomplete. For insects, such analysis
may be possible within extremely well-studied taxa, such as ants,
because work to collate representative data for species level analysis
has been conducted (Kass et al., 2022). However, for most insects (and
many other organisms) species-specific analysis of protected area
coverage or stacked diversity patterns cannot be conducted meaningfully
on a global basis, and only coarser metrics of diversity are possible.
We also need to prepare for new types of data (such as
social-media-generated data), and constantly update guidelines to ensure
their effective use. These different types of data can amplify different
forms of bias or introduce new biases, and thus knowing how they can be
used will likely need continued revisiting, not to mention safeguards to
ensure their use is also ethical. As we have seen, the greatest growth
in data availability has been in developed areas of high-income
economies; if we uncritically apply methods fit only for data-rich areas
to data-lacking areas, we risk misleading rather than facilitating
effective management.
Whilst it is true that we will continue to generate data, and the issues
highlighted here are not new, developing best practice guidelines and
facilitating sensible and sensitive data-use remains crucial if we are
to lead rather than mislead management and conservation prioritisation.
Inappropriate data-use can misrepresent hotspots, and incorrectly gauge
levels of protection species and hotspots have. Therefore, ensuring how
to use data effectively, however limited or biased those data may be, is
key to enabling effective use and solution generation as we continue to
grow biodiversity databases.
Box 1; Steps for cleaning data.
Table S1. The possible occurrence data filtering steps and the
functions, packages, descriptions, and citations required to undertake
each sub-step. Sub-steps include data flagging (adding a column with the
test results), data carpentry (changing the data itself), and data
filtering (removing occurrences based on data flags). The packages bdc
and BeeBDC also have a selection of functions that are useful for data
visualisation and critical for checking the “common-sense” of results.