LIBRARIES AND TOOLS
Python is a user-friendly general-purpose programming language that has been used to develop many data science libraries and recently deep learning frameworks due to its ease of use and similarity to plain English. Fast.ai is a high-level AI library built on top of the open-source deep-learning library Pytorch released by Facebook in 2017 \cite{Paszke2017AutomaticDI,howard2018fastai}. The library specializes in rapidly implementing state-of-the-art techniques from newly published research papers. Jupyter notebooks was used as the platform to implement the python code due to its ease of use and reproducibility - in fact, they have been the go-to tool for data scientists in the recent past as notebooks can be easily shared and run compared to scripts that were sometimes cryptic to non-experts \cite{Kluyver:2016aa,randles2017using}.
Google Colab is our training platform, because of its free GPU's offered by Google. To train the deep learning model, GPUs have been identified to be better and faster for matrix multiplications compared to CPUs \cite{shi2016benchmarking,unknown}. These processing units that started out with applications for video games have gained popularity as the go-to for training deep learning models.
In this paper, we are going to use a Jupyter notebook running on top of Google Colab to help guide the readers to implement it on their own as they read the paper. We will default to Google Colab since most of the configuration has been set up for us to use in this platform. This will ensure we worry only about our problem of building a species classifier and not waste time in the configuration.
DATA COLLECTION
Data collection is an important phase and the first phase in the AI model development pipeline as data collected will determine the accuracy of models or lack thereof. Many approaches can be employed to achieve this: from data discovery, data augmentation to data creation \cite{roh2018survey}. In our ecological context, depending on your species of interest - data can either be manually gathered or acquired from other sources. For image classification tasks such as in our case study, online repositories such as the iNaturalist or GBIF websites have tons of image data that can be accessed using APIs or other advanced data mining approaches \cite{gbif,inaturalist}. Here we go with the assumption that you already have a relatively clean curated dataset and that has balanced classes for each species and which has no noise in the ground truth labels and that the problem is a supervised learning challenge i.e where we already know the labels of our datasets. Other interesting methods beyond the scope of this study are unsupervised learning and reinforcement learning where the algorithm discovers the patterns in the data itself.
Species Image Data used for Classification
In this case study we will use three sea star species. Sea stars are important species in our understanding of marine invertebrate communities. Intertidal relationships between the sea star Pisaster ochraceus and the mussel Mytilus californicus was actually used to coin the term keystone species \cite{Paine_1966}. Following that classical study, it would, therefore, be interesting to use sea stars as case study species to prototype the classifier AI system. Further, seastars have complicated morphology that might be a challenge even for expert humans - Following that classical study, it would, therefore, be interesting to use sea stars as model species to prototype the classifier AI system. Further, seastars have complicated morphology that might be a challenge even for expert humans - for these reasons we use them to prototype our AI system. Figure \ref{488749} illustrates our the general workflow of a deep learning pipeline meant to achieve a minimum viable product: