Table 1: Bitrate, percentage file size reduction and maximum encodable frequency for the experimental compression levels.
Quantification of Soundscapes Using Indices
Analytical Indices
We used the seewave (Sueur, Aubin and Simonis, 2008) andsoundecology (Villanueva-Rivera and Pijanowski, 2016) packages in R (ver 3.6.1; R Core Team, 2020) to extract 7 Analytical Indices (Fig. 4d): Acoustic Complexity Index (ACI), Acoustic Diversity Index (ADI), Acoustic Evenness (AEve), Bioacoustic Index (Bio), Acoustic Entropy (H), Median of the Amplitude Envelope (M), and Normalised Difference Soundscape Index (NDSI) (Supplementary 3). These have been shown to capture diel phases, seasonality, and habitat type (Bradfer‐Lawrenceet al. , 2019). These indices could not be calculated for all recordings due to file reading errors, however, this fault occurred in 0.3% of all recordings (Supplementary 2b).
AudioSet Fingerprint
The audio was converted to a log-scaled Mel-frequency spectrogram after 16kHz downsampling and then passed through the “VGG-ish” Convolutional Neural Network (CNN) trained on the AudioSet database (Gemmeke et al. , 2017; Hershey et al. , 2017) (Fig. 1d). This generates a 128-dimensional embedding and the 128 values in that embedding describe the soundscape of given recording in an abstracted form or fingerprint. Similarly, as in the Analytical Indices, some recordings could not be analysed by the AudioSet CNN, however, this was only in 0.2% of recordings (Supplementary 2b).
Data Analysis
Impact of Index Selection: Auto-Correlation
Analytical Indices often summarise similar features of a soundscape (e.g. dominant frequency and frequency bin occupancy): this overlap may reduce the descriptive scope of the ensemble. We compare the degree of pairwise correlation between the individual Analytical Indices and between the individual features of the AudioSet Fingerprint. We also compare how well each index/feature correlates with the maximum recordable frequency (Fig. 1e).
Impact of Compression: Like-for-Like Differences
We use an adaption of Bland-Altman plots (Vesna, 2009, Araya-Salas, Smith-Vidaurre and Webster, 2019) to visualise the scaled difference (D ) between raw (\(I_{\text{raw}})\ \)and compressed (\(I_{\text{com}}\)) index values, as a percentage of the range of raw values \(R_{\text{raw}}\) (Fig. 1f) :
\begin{equation} D=\frac{I_{\text{com}}-I_{\text{raw}}}{R_{\text{raw}}}\times 100\nonumber \\ \end{equation}
D was not normally distributed (Supplementary 5a), so median and inter-quartile ranges are reported. We determine that an index has been altered as a result of compression to be when: i) the interquartile range of D does not include zero difference or ii) medianD is more than +/- 5% of the Rraw . We use Spearman rank correlation to test for a consistent trend in Dwith increasing compression. To reflect their common use cases, Dfor Analytical Indices is calculated from the univariate values, while for AudioSet Fingerprints – which is intended as a multidimensional metric – \(D\) is calculated separately for each dimension and then averaged.
Impact of Recording Schedule: Recording Length
Recordings of longer length may have a reduced variance due to the smoothing of transient audio anomalies (such as bird calls). We tested this by comparing the variance of the recording groups at different recording lengths. The index values are non-normally distributed so we use a Levene’s test for homogeneity of variance (Fig. 1g).
Impact of Parameter Alteration on Classification Task
We use random forest classification models to assess how well the soundscapes are represented by each index type under each different experimental parameter, using the RandomForest (Liaw and Wiener 2002) package in R (Fig. 1h). Models were trained on a middle 24 h period of data from each site and tested on the remaining 46+ h of audio. We used 2,000 decision trees to ensure accuracy had stabilised. The model was trained and tested separately for every combination of index type (Analytical Indices vs. AudioSet Fingerprint), compression level and recording length. We determined accuracy, precision and recall of each combination.
Impact of Temporal Subsetting
Soundscapes typically show considerable dial variation in both abiotic and biotic components. To assess the impact of this variance on model performance, we split our recordings into four 6-hour sections centred on Dawn (06:00), Noon (12:00), Dusk (18:00) and Midnight (00:00) and then further subdivided these into 3 hour (8 sections) and 2 hour (12 sections) blocks (Fig. 1i). We trained and tested the random forest model again on each of the temporal sectioned recordings, with each section used to build models individually, and determined accuracy, precision and recall as before.
Modelling the Impact of all Parameters on Accuracy Metrics
As the accuracy metrics are bound between 0 and 100%, we used a beta regression to model the relationship between each of the experimental parameters and performance metrics (Douma and Weedon, 2019). The model was built using the betareg package in R (Cribari-Neto and Zeileis, 2010). To avoid fitting issues when performance measures are exactly 1, we rescale all performance measures using m’ = (m (n-1) + 0.5) / n, where n is sample size (Smithson & Verkuilen, 2006). The model includes pairwise interactions between file size, temporal subsetting, and recording length, and then all interactions of main effects and those pairwise terms with the index selection. We observed that variance in performance measures varied as an interaction of both index choice and a temporal subsetting (Supplementary 8a), so tested the inclusion of these terms in the precision component of the model. We first treated frame size and temporal subsetting as factors, but also tested a model considering these as continuous variables. We found the Akaike Information Criterion (AIC) was markedly lower in a beta regression model using factors and including the precision component (Supplementary 8b).
Results
Although Spearman pairwise correlations of Analytical Indices and maximum recordable frequency were low on average (mean = 0.32, IQR = 0.22), we found some strongly correlated sets of indices (Fig. 2). ADI, Bio and NDSI all show strong similarities and are closely correlated with maximum recordable frequency; AEve and H are also strongly correlated (Fig. 2). Some features of the AudioSet Fingerprint correlate with each other and maximum frequency but in general, these features are more weakly correlated (mean = 0.14, IQR = 0.18, Fig. in Supplementary 4b).
Impact of Compression
Impact of Compression: Like-for-Like Differences
All indices showed both observable differences under compression and clear trends with increasing compression (confirmed with Spearman’s rank correlation, all p < 0.001, Supplementary 5b). The mode of response showed three broad qualitative patterns, illustrated here using results from the 5-minute audio sample (other recording lengths in Supplementary 5a). (1) Indices which were only affected above a threshold level of compression (AudioSet Fingerprint: CBR16; M: CBR32; and NDSI: CBR8). These indices typically showed low absolute D(median D typically <15%). (2) AEve and H showed the biggest differences at an intermediate compression (CBR64) and relatively low absolute differences (median D typically < 30%). (3) The remaining indices showed a variety of responses: ADI showed a monotonic response above a threshold, ACI showed changes up to CBR64 and then stabilises, and Bio showed a stepped pattern of increase. However, all three showed increasing and large changes in absolute D (median D often > 75%) with increasing compression.
Impact of Recording Schedule: Recording Length
Three out of seven (43%) of the Analytical Indices (ADI, AEve and H), and a smaller proportion of the AudioSet Fingerprint values (46 out of 128; 36%) were found to have non-homogeneous variance in groups of different recording length (p < 0.05, Levene’s test for homogeneity of variance, Supplementary 6b).
Impact of Index Selection
Classifiers derived from 5-minute recordings using raw audio showed higher accuracy for AudioSet Fingerprint (93.8%) than Analytical Indices (80.9%, Table 2). This advantage held across all recording lengths and performance metrics with performance gains of around 12-13% in accuracy, precision and recall (Supplementary 7b).
Compression decreased accuracy for both AudioSet Fingerprint (CBR8: 90.8%) and Analytical Indices (CBR8: 75.1%, Table 2). Classifiers trained on compressed AudioSet Fingerprint, however, still outperformed those trained on uncompressed Analytical Indices. For both indices, this reflected a decreased ability to differentiate logged and primary forest. Interestingly, both indices showed better discrimination between cleared land and logged forest under strong compression. These patterns were repeated across recording lengths (Supplementary 5a).