Appendix 2: Bounding
Boxes
We used bounding boxes to establish ground truths in our study to
increase the value of images, allowing us to use far fewer images to
train our model. Bounding boxes provide the model with the location of
each object dictating the bounds of the object and background noise (SI
Fig. 3, human labeled). Providing the model with images without bounding
boxes makes it more difficult for the model to distinguish commonality
in patterns of similar objects and would further complicate
identification when repeated, uncorrelated, confounding objects or
background noise are present.
Once trained, the model will identify and classify all objects by
placing bounding boxes: a box, the corresponding label of that object,
and a feature score. A feature score is the percent likelihood that the
object detected reflects the respective label. Our model correctly
identified the objects in images 1-3. The model can be more precise than
human labelers in finding objects, for example, image 6 displays
correctly labeled tail feathers of a turkey that were not labeled
correctly by human labelers. Additionally, the model may pick up objects
incorrectly (image 5) with low confidence. The confidence threshold (CT)
was set at 50%, so any objects detected with over 50% confidence were
displayed. This CT can be adjusted to negate low confidence objects, but
during training can give insights into errors that may impact validation
accuracies and F-1 score. For example, in image 5, images with the same
background or images of grey squirrel can be added to further
distinguish the misidentified object. Image 4 shows an example of object
splitting, when one object is identified by two bounding boxes. Object
splitting creates problems with counting the correct number of
individuals in an image. Again, adding additional similar images of an
event where object splitting can occur can increase the chances of
correct bounding boxes. These types of discrepancies suggest the need
for a combination of human labelers and AI prescreening for a completely
thorough analysis of camera trap imagery.