Methods
Camera Trap Study
The subset of images used to train the model was pulled from a camera
trap study consisting of 170 cameras, which were deployed for up to
three years across two regions of South Carolina (see Supplementary
material Appendix 1 for camera trap study details). We acquired images
for the train and test datasets from 50 camera locations from each
region within two separate one-month time frames. The complete consisted
of 5,277 images of 17 classes, including images from both winter and
summer months to account for seasonal background variation (Table 1).
True negative images were not included because they would not assist in
teaching the model about any of the species classes. A commonly used
90/10 split (e.g. Fink et al. 2019) was utilized to create the training
and testing datasets from the selected images; 90% of images were used
for training and 10% were used for testing.
The basic process (Fig. 1) included selecting and labeling a subset of
images from our camera trap image repository (See Supplementary material
Appendix 1 for details) for transfer training, in order to adapt a
pre-made neural network to our image set. The subset of images used to
train the model was pulled from a camera trap study consisting of 170
camera stations which had been deployed for up to three years in two
regions of South Carolina (see Supplementary material Appendix 1 for
camera trap study details). To begin, a subset of images was created by
selecting 500 images of each species in a variety of positions within
the field of view (Fig. 1, Step 1). In cases where classes (species
being classified) reached 500 images, only images that contributed a
unique perspective of the animal were added to the training dataset, in
order to supply the model with a better generalization of the animal.
The number of images in the training data set was limited to ensure the
model did not favor one due to the number of images in the dataset.
Despite adding more than 500 images to some classes, class weights were
not influenced and remained comparable.
Feature Extraction
se of a supervised training process increases the accuracy of detection
and classification by human-generated bounding boxes (Supplementary
material Appendix 2). LabelImg (Tzutalin 2015), a graphical image
annotation tool, was used to establish ground truths (locations of all
objects in an image) and create the records needed for our supervised
training process. This software allows a user to define a box containing
the object and automatically generates a CSV file with the coordinates
of the bounding box as well as the class defined by the user.
Classification Training
A transfer training process to adapt a premade neural network (Fig. 1,
Step 3) was employed to create an identification and classification
model. We transformed the CSV file generated by the feature extraction
process into a compatible tensor dataset for the training process
through the appropriate methodologies laid out in the Tensorflow (Abadi
et al. 2015) package description. Tensorflow is an -source, experimental
Python library from Google for identification and classification models.
The Tensorflow transfer training process required a clone of the
Tensorflow repository, in combination with a customized model
configuration file defining parameters(Table 2).
Training Evaluation
The degree of learning that was completed after each step was analyzed
using intersection over union (IOU) as training occurred (Krasin et al.
2017). A greater IOU equates to a higher overlap of generated
predictions versus human labeled regions, thus indicating a better model
(See Supplementary material Appendix 3). Observing an asymptote in IOU
allowed for the determination of a minimum number of steps needed to
train the model for each class and to assess which factors influenced
the training process (e.g. feature qualities, amount of training
images). Because the minimum step number was not associated with image
quantity in determining step requirements, we relied on quality
assessments, such as animal size and animal behavior.
Following training, final discrepancies between the model output and the
labeled ground truths were summarized into confusion matrices (generated
by scikit-learn, Table 3) including false positives, false negatives,
true positives, true negatives, and misidentifications. Several metrics
were calculated to evaluate aspects of model performance (Fig. 2).
Relying on accuracy alone may result in an exaggerated confidence in the
model’s performance, so to avoid this bias, the model’s precision,
recall, and F-1 score were also calculated. Precision is a measure of
FPs while recall is a measure of FNs, with F-1 being essentially an
average of the two (Fig. 2). Due to the large proportion of TNs
associated with camera trap studies, F-1 score does not include TNs in
order to focus on measuring the detection of TPs.
In addition, the metrics were further separated into evaluations for
identification and classification purposes. Identification (ID) models
would focus only on finding objects and therefore deem
misidentifications as correct because the object was found.
Classification (CL) models would not deem misidentifications as correct.
Finally, accuracy, precision, recall, and F-1 were calculated at a
variety of confidence thresholds (CT), a parameter constraining the
lower limit of confidence necessary for a classification proposal, to
determine the threshold that resulted in the highest value of the metric
we wished to optimize.
Validation
To confirm results acquired from testing the model, it was essential to
evaluate a validation set of images. This validation set was formed by
randomly selecting five cameras from a 12-week period separate from the
training dataset, but within the same larger dataset. The validation
subset consisted of 10,983 images, including true negatives. The set ran
using the optimal CT for F-1 score determined by the test data. These
images were also labeled using labelImg to automate the calculation of
evaluation metrics. The validation set scores and test scores should be
compared to determine if the model is overfitted, meaning the test set
is not representative of the validation set. Possible reasons for such a
mismatch may be that the background environment has changed dramatically
or species not included in the test set have appeared.