3.2 Satellite Data

Using Google Earth Engine, we developed a script to preprocess and download Landsat-7 satellite imagery. Based on customizing existing script libraries, code was developed to: (i) collect one-year batches of satellite imagery; (ii) mask based on Landsat data's cloud-cover index band; (iii) select the non-cloudy pixels from the year's images; (iv) make a composite image from these pixels' median values; (v) export the image to Google Drive.
Images for 1999, 2003, 2007, 2011 and 2015 were created in this way. The imagery was downloaded at maximum resolution, with each pixel representing 30 square meters. Each image was converted into a Numpy array with dimensionality 2100 x 2528 x 11. Required image processing steps including clipping the reference image to the training image; removing NaN values; and resampling the reference image to achieve the exact same number of pixels and geographic area.

3.3 Random Forest classifier

Two classification methods were evaluated: Random Forest and Support Vector Machines. We proceeded with Random Forest having found a small advantage in classification accuracy and a substantial advantage in computation time - which was prohibitive in the case of SVM given our large data files.
A Random Forest was trained on the image data for 2015. Structuring the problem as a supervised classification exercise, we trained the Random Forest using the reference image as target value (or 'label' for each pixel) and the Landsat pixel values as the feature space.
Random Forest is an ensemble method that constructs decision trees and (in the case of classification) takes the modal value of their output. Given these mechanics, we conducted a grid search to find key parameters that would optimize the classifier, specifically (i) number of trees; (ii) maximum depth; and (iii) minimum sample leaf size. Accuracy was found to improve with increasing numbers of trees up to 12 but tail off after then. Given limited computational budget, the team proceeded with a 12-tree Random Forest classifier.

4. Model evaluation

As a proof of concept, we first trained the model on West Houston for 2015 and tested it for East Houston. We subsequently refined this approach by means of a k-folds cross-validation. The method is suited to our data, since we are able to train and test on the same 2015 dataset (the only year for which labeled data could be constructed). What the method does is, it divides the raster in 6 proportional cubes, trains the model on 5 of those cubes and tests it on the one left. This process is repeated for the number of cubes we have. We utilized a 6-fold cross-validation, since this is easily interpretable to audiences - for whom it can be visualized as testing a model on a one-sixth grid section of the city having learned it from the remaining 5/6ths of the image, and repeating until each pixel has been tested.
Overall accuracy of the classifier, measured by 6-fold cross-validation, was 76%.