Figure 1. Decision tree of how to accurately use data to analyse patterns of richness. Questions are given in the light grey boxes and solutions are provided in blue
Understanding species ranges and how well-protected they are is crucial for filling gaps in protection coverage and effectively protecting species into the future, but poor analysis can mislead and misdirect attention and waste limited conservation resources. Here, we demonstrate the potential pitfalls of uncritical species mapping, highlighting the importance of thoroughly vetting data sources and engaging them with the same ecological principles that are expected for the best-known species. Ultimately, best-available-data arguments seeking to overreach beyond what is possible (i.e. globally, Wyborn & Evans 2022) are bad-data arguments that can also have bad outcomes for conservation. Furthermore, “effective protection” needs may also differ across species (Butchart et al., 2015), such that generalists and specialists would respond differently, and this must also be considered, rather than naïve aggregations that treat all species as generalists or only include generalist species and then claim representation for entire groups.
Various analyses have been conducted which use these approaches (Figure 1) to sensibly and sensitively map species ranges and biodiversity patterns. For example, in terms of limiting analysis to areas where data are representative, there are many regional analyses which use such approaches (Figure 1; solution 1), as well as studies which increase the resolution of studies by incorporating other types of data not available at wider scales (Fukaya et a., 2020; Cosentino et al., 2023). For modelling richness patterns in the absence of sufficient data for species-level analysis, examples exist for data-poor invertebrates, ferns, and lycophytes (Orr et al., 2020; Weigand et al., 2020; Liu et al., 2022; Potapov et al., 2023) (Figure 1; solution option 2), and such work may also include collating more data to ensure that it is spatially and taxonomically representative (Figure 1; solution 4) (Qiao et al. 2023). A good recent example is for ants, where a large database was collated to ensure that it was representative before conducting individual species models to map richness (Figure 1; solutions 5 and 7) (Janicki et al., 2016; Guénard et al., 2017; AntWeb; https://antweb.org; Kass et al., 2022). Alternatively, indices can also be applied to fill gaps, either for limited regions such as urban areas (Hughes et al., 2022) (Figure 1; solution 3), or indices combined and interpolated more widely (Qiao et al., 2023). Only where representative data exist across taxa can species-specific analyses be conducted. If limited species-specific data are available, this may include simple approaches, such as selecting suitable habitat from within MCPs (Figure 1; solution 6) (Chesshire et al., 2022), or following approaches similar to those advocated for refining IUCN maps to better represent species true ranges (landcover and elevation filters within similar ecoregions or biomes, Brooks et al., 2019). In all cases, care and calibration at each step is needed to ensure that data are used within reasonable limits, to not over-extend their utility, or overreach on interpretation.
Figure 2. Projecting richness with different approaches. A. Maxent species models (Figure 1 solution5). B. Interpolating richness from inventories (solution 2). C. Filtering habitat within convex polygons (solution 6). D. IUCN richness. E. IUCN richness filtered by suitable habitat, F. MCP unfiltered richness.
Each approach has its own inherent assumptions and level of detail (see Figure 2), so some caveats are necessary. Patterns from each should be comparable (provided that there are sufficient data for analysis), but they will not be identical, and richness values will vary between approaches (thus, relative patterns rather than absolute values should be explored). Selection of input variables, particularly land-cover variables, can strongly influence the projected distribution of species, and if some regions are under-sampled or under-represented then they may not be accurately reflected in analysis; this problem increases with scale. Categorical variables can artificially delimit ranges, and thus whilst we used the IUCN ecosystem typology to increase the sensitivity of analysis (as more accurate National maps are not available for continental regions such as Africa), this may still cause more extreme transitions in apparent diversity (especially in small, inaccessible and under-sampled regions) but this is still necessary to refine overgeneralised maps (Figure 2E vs D for IUCN and F vs C for MCPs); yet more sensitive models using interpolation or modelling based on continuous variables (such as vegetation height and density) can be better where they can be applied (Figure 2 a, b). Furthermore, selection of models requires care, as joint-species distribution models are becoming increasingly popular (Zurell et al., 2018; 2019), however whilst popular these should be applied only where a functional relationship between species (competition, mutualism etc) exists, as correlations based on bias sampling could emerge from applying such models without verified ecological relationships (Poggiato et al., 2021). Similarly tied to scale, interpolations over very large and heterogeneous regions cannot incorporate biogeographic differences if relying solely on interpolation of richness rather than species-specific approaches, and may better represent drivers of patterns in better sampled regions. Furthermore, interpolations between mainland areas and islands cannot encompass island biogeography, as dispersal would be assumed equal through the study area. If using MCPs, the edges of species distributions will be excluded (as the approach can only assess range within the recorded maximum distribution of the species based on known locations; see Figure 2C, F). Analyses using unfiltered MCPs (Figure 2F) are vulnerable to “mid-domain” type effects, where central areas will appear richer regardless of land-cover type due to the probability of overlap between MCPs in central areas of maps (Figure 2).
The importance of reflecting habitat in maps is clear by comparing patterns when habitat variables are incorporated versus richness maps based on simple clipped maps of maximum range extent (Figure 2C vs 2E). Inventory-based approaches also rely on data being representative throughout the study area. Assessing the performance of inventory-based approaches can be challenging without having complete inventory data for cross-referencing (Figure 2B). Hence, expert knowledge may be the only way to assess if patterns “make sense”. Furthermore, as interpolation-based approaches may work better over large regions where the full environmental gradient is adequately sampled they might be more appropriate over very large scales (i.e. global) given that some regions (such as Europe and the US) have very good sampling. However, at a continental scale this can be challenging without additional fieldwork or data-collation to add additional inventory data (which was not possible in the case of this study, 2B). In addition, less species-rich areas may be more poorly sampled, so inventory-based interpolations need at least some site-based inventories in addition to calculations of richness by larger unit areas. Assessment and ground-truthing at each step is a fundamental necessity of analysis, and whilst all models are wrong, some models are useful. That usefulness ultimately depends on whether a model provides sufficient accuracy to guide interventions and further work effectively. Here we show that many methods can create broadly similar patterns (Figure 2A-C), species specific models will provide the best approach if sufficient species level data are available, but on extended (i.e. global) scales this is rarely done. In most of these examples (Figure 2) the relative patterns are similar, but unfiltered approaches overgeneralise, inflating richness of regions predicted to host lower diversity in Central Africa, whilst missing areas to the Northeast and Southeast, whilst other approaches do highlight the same areas as potentially hosting higher diversity.
Moving forward, a set of best practices are needed to ensure that, as accessible data grow, they are used in a way that builds on understanding rather than bias (e.g., see Box 1). For example, understanding if data are representative across regions and taxa for species level analysis (as noted in Figure 1), as global insect data do not meet these criteria, global analysis is not yet possible.This is because most data are concentrated in North America and Europe and taxonomic representation is too incomplete. For insects, such analysis may be possible within extremely well-studied taxa, such as ants, because work to collate representative data for species level analysis has been conducted (Kass et al., 2022). However, for most insects (and many other organisms) species-specific analysis of protected area coverage or stacked diversity patterns cannot be conducted meaningfully on a global basis, and only coarser metrics of diversity are possible. We also need to prepare for new types of data (such as social-media-generated data), and constantly update guidelines to ensure their effective use. These different types of data can amplify different forms of bias or introduce new biases, and thus knowing how they can be used will likely need continued revisiting, not to mention safeguards to ensure their use is also ethical. As we have seen, the greatest growth in data availability has been in developed areas of high-income economies; if we uncritically apply methods fit only for data-rich areas to data-lacking areas, we risk misleading rather than facilitating effective management.
Whilst it is true that we will continue to generate data, and the issues highlighted here are not new, developing best practice guidelines and facilitating sensible and sensitive data-use remains crucial if we are to lead rather than mislead management and conservation prioritisation. Inappropriate data-use can misrepresent hotspots, and incorrectly gauge levels of protection species and hotspots have. Therefore, ensuring how to use data effectively, however limited or biased those data may be, is key to enabling effective use and solution generation as we continue to grow biodiversity databases.
Box 1; Steps for cleaning data.
Table S1. The possible occurrence data filtering steps and the functions, packages, descriptions, and citations required to undertake each sub-step. Sub-steps include data flagging (adding a column with the test results), data carpentry (changing the data itself), and data filtering (removing occurrences based on data flags). The packages bdc and BeeBDC also have a selection of functions that are useful for data visualisation and critical for checking the “common-sense” of results.