Introduction:
Animal vocalisations come together with natural and human-made sounds to
form soundscapes, which can be used to monitor species populations or
infer community-level metrics such as biodiversity (Roca and Proulx,
2016; Eldridge et al. , 2018; Gómez, Isaza and Daza, 2018). Such
monitoring is crucial to effectively respond to threats (Rapport, 1989;
Rapport, Costanza and McMichael, 1998). Previously, the use of in
situ expert listeners to monitor species presence and abundance was
common (Huff et al. , 2000) but: is costly and time-consuming; can
damage habitats; and is prone to narrow focus and observer bias
(Fitzpatrick et al. , 2009; Costello et al. , 2016).
Advances in portable computing now permit remote recording of
soundscapes, but produce a volume of data that precludes manual review,
leading to the development of automated, or semi-automated, methods of
analysis (Towsey, Truskinger and Roe, 2016; Sethi et al. , 2020).
Soundscape composition is primarily assessed using acoustic indices –
summary statistics that describe the distribution of acoustic energy
within the recording (Towsey et al. , 2014) – and over 60
Analytical Indices which capture aspects of biodiversity have been
developed (Sueur et al. , 2014; Buxton et al. , 2018). These
are commonly used in combination to compare the occupancy of acoustic
niches, temporal variation, and the general level of acoustic activity
(Bradfer‐Lawrence et al. , 2019) across ecological gradients or in
classification tasks (Gómez, Isaza and Daza, 2018). These approaches
have provided novel insight into ecosystems across the world ( Fulleret al. , 2015; Buxton et al. , 2016; Eldridge et al. ,
2018; Sueur, Krause and Farina, 2019) but are not foolproof and often
have poor transferability (Mammides et al. , 2017; Bohnenstiehlet al. , 2018). This may result from a lack of standardisation:
differing index selection, data storage methods, and recording
protocols, which all lead to unassessed variation in experimental
outputs (Araya-Salas, Smith-Vidaurre and Webster, 2019; Bradfer‐Lawrenceet al. , 2019; Sugai et al. , 2019).
The AudioSet convolutional neural net (CNN; Gemmeke et al. , 2017;
Hershey et al. , 2017) is an attractive replacement for Analytical
Indices. This pre-trained, general-purpose audio classifier generates a
multi-dimensional acoustic fingerprint of a soundscape that is a more
effective ecological descriptor (Sethi et al. , 2020). AudioSet is
trained on two million human-labelled anthropogenic and environmental
audio samples, potentially giving it both greater transferability and
discrimination than typical ecoacoustic training datasets.
In ecoacoustics, a continuous uncompressed or lossless recording is
generally recommended (Villanueva-Rivera et al. , 2011; Browninget al. , 2017), but generates huge files. We consider two commonly
used approaches to reducing
storage requirements (Towsey,
2018). Firstly, MP3 compression, which is widely used in ecoacoustic
studies (e.g. Saito et al. , 2015; Zhang et al. , 2016;
Sethi et al. , 2018): this lossy encoding removes acoustic
information inaudible to human listeners but is suspected of
removing ecologically important data (e.g. Towsey, Truskinger and Roe,
2016; Sugai et al. , 2019). Araya-Salas, Smith-Vidaurre and
Webster (2019) have recently shown that ecological information is lost
under high compression from recordings of isolated animal calls, however
it is not known if this extends to recordings of noisier whole
soundscapes.
Secondly, recording schedules also vary in ecoacoustic studies (Sugaiet al. , 2019). Bradfer‐Lawrence et al. (2019) showed that
longer and more continuous schedules give more stable Analytical Index
values. However, ecoacoustic composition varies with time of day (Fulleret al. , 2015; Bradfer‐Lawrence et al. , 2019; Sethiet al. , 2020) and so separating recording windows may reduce
temporal variation and improve classification (Sugai et al. ,
2019) even with reduced data. Similarly, index calculation on longer
recordings may average away anomalous calls and short term patterns.
While clear standards are crucial for collaborative research in
ecoacoustics, there is uncertainty in the literature on the impacts of
the selection of index type, compression level and recording schedule.
Here, we:
contrast the classification accuracy of index selection choices;
describe the effects of both compression, recording length and
temporal subsetting on the values, variance and classification
performance of indices.
In describing how well ecological information is stored in acoustic data
under different recording decisions, we identify stronger standards to
improve both performance and provide a basis for more extensive
meta-analysis.
Methods and Materials
Study Area
Acoustic samples were collected in Sabah at the Stability of Altered
Forest Ecosystems (SAFE) project: a large-scale ecological experiment on
habitat loss and fragmentation effects on tropical forests (Ewerset al. , 2011) with sites in the Kalabakan Forest Reserve (KFR).
Historically, logging within KFR has been heterogeneous, reflecting
habitat modifications in the wider area (Struebig et al. , 2013),
with higher than typical timber extraction rates. Habitat ranges from
areas of grass and low shrub, through logged forest to almost
undisturbed primary forest.
Soundscape Recording
Data were collected from three KFR sites representing a gradient in
above-ground biomass (figure 4a) (AGB: Pfeifer et al. , 2016):
primary forest ( AGB= 66.16 t.ha-1), logged forest
(AGB = 30.74 t.ha-1), and cleared forest (AGB = 17.37
t.ha-1) (Supplementary 1). We recorded for an average
of 72 hours at each site (range: 70 to 75) during February and March
2019 (Supplementary 2a). No rain fell during the recording period, so no
recordings were excluded due to confounding geophony (Zhang et
al. , 2016). In all sites, omnidirectional (AudioMoth, Hill et
al. , 2018) recorders were attached to trees (~ 50 cm
diameter and 1-2 m above the ground) and recorded continuously using
20-minute uncompressed samples (‘raw’, .wav format) at 44.1kHz and 16
bits.
Compressing and Re-Sizing the Raw Audio
Continuous 20-minute recordings were first split into recordings with a
length of 2.5, 5.0 and 10.0 minutes, using the python packagepydub (Webbie et al., 2018) (Fig. 1b). The audio was then
converted to lossy MP3 format using the fre:ac LAME encoder under two
standard LAME MP3 encoding techniques: constant bit rate (CBR) and
variable bit rate (VBR) compression (Fig. 1c). CBR reduces the file size
to a specified number of kilobits per second; VBR varies bitrate per
second depending on the analysis of the acoustic content and a quality
setting (0, highest quality, larger bitrate; 9 lowest quality, smaller
bitrate). Since bitrates are not directly comparable between VBR and CBR
– and because storage savings are often the principal driver of
compression choices – we use compressed file size as our measure of
compression level. We used VBR0 and CBR320, CBR256, CBR128, CBR64,
CBR32, CBR16 and CBR8, resulting in file sizes ranging from 41.6%
(CBR320) and 1.04% (CBR8) of the original raw file size and some
reductions in maximum coded frequency (Table 1). We do not consider
lossless compression, as the storage capacity is much higher and the
files are obligatorily the same post decompression. Previous studies
have also found that the lossless compressed audio is largely identical
to raw audio (Linke and Deretic, 2020).