Occurrence data and environment variables
The occurrence records of Ganges River Dolphin (GRD) across its entire range consisting of GBM & KS river system which encompasses Nepal, India and Bangladesh and the Indus Dolphin (ID) in Indus river system which encompasses India and Pakistan were compiled. Occurrence locations were based on presence records compiled from studies conducted by several researchers (see supporting information SI1) and OBIS-SEAMAP (http://seamap.env.duke.edu/). A total of 724 occurrence records for GRD were compiled out of which 410 coordinates were used and 404 for ID out of which 304 coordinates were used. Absence points are considered valuable for SDM algorithms and model assessment techniques (Miller, 2010). Since I did not have true absence points, I generated 10,000 absence points for modeling using ’random’ strategy. In this strategy, all cells of initial background are pseudo absence candidates and the choice are made at random. For GRD, two coordinates were discarded as they showed the presence in Bay of Bengal. This is a phenomenon that has been reported during monsoons (Moreno, 2003) however, for this study the area has been limited to riverine environment.
The basin boundary and river networks were obtained from HydroSHEDS (https://hydrosheds.org ). The GBM basin provided by Hydrosheds has discarded the areas near the Bay of Bengal and some areas within the basin boundary which were merged to form the final GBM basin (see supporting information SI2). Since, the species is aquatic, the input layers were created with environmental variables clipped by river networks. This created the problem of NA predictor variable for some coordinates maybe because of factors such as coordinates reported in studies from shore based census or river network error. So, a 1 km coordinate pull was used to drag the coordinates into the nearest raster cell using nearestland function from the package SEEG-Oxford/seegSDM. Any points which did not fall even after this 1 km pull, were discarded. The coordinates so selected were again gridsampled to match the raster resolution such that there was one occurrence point per pixel.
19 bioclimatic raster layers were obtained from WorldClim version 2.1 climate data for 1970-2000 (https://worldclim.org/) along with 2 hydrological variables - hydrologically conditioned Digital Elevation Model and Flow Accumulation Model from Hydrosheds at 30-seconds spatial resolution to model potential distribution. Using all the variables might cause the problem of over-fitting due to high degrees of collinearity among predictors. To minimize this, Pearson Correlation matrix was created and variables with correlation >0.7 were discarded. In the end variables - BIO2, BIO3, BIO15, BIO16, Flow accumulation and hydrologically conditioned DEM were used. BIO2 or Mean Diurnal Range is the mean of monthly difference in maximum and minimum temperature, BIO3 or Isothermality is a measure of temperature seasonality, BIO15 or Precipitation Seasonality is a measure of annual range in precipitation, BIO16 or Precipitation of Wettest Quarter is the precipitation of the wettest quarter calculated per pixel, Flow accumulation defines the amount of upstream area (in number) draining into each cells and Hydrologically Conditioned DEM defines expected flow of water over the terrain.