We are provided with a large data set of feature values. We can represent these data points as a matrix X, where rows correspond to feature values at each data point. The output can similarly be expressed as a vector y. Linear regression is the task of finding a set of weights β such that ŷ=Xβ provides a good estimate for y. We attempted three main regression models to find such a β: Ordinary Least-Squares regression (OLS), Spatial Error Regression (SER), and Random Forest Regression.
OLS regression involves choosing the β that minimizes the squared sum of the residuals, namely |y-Xβ|^2. β can be solved for algebraically. We use the Python Spatial Analysis Library (pysal) package to compute the regression.
SER accounts for spatial autocorrelation of the dataset and is well-tuned for remote sensing applications. Instead of just minimizing the residuals vector u = y-Xβ, we try to filter out the autocorrelation by letting u = ⍴Wu + ε, where ⍴ is a scalar, W is a weight matrix representing the spatial autocorrelation in the dataset, and ε represents the residuals. SER minimizes the autocorrelation of the final residuals ε. This means that leftover residuals should appear random, and corrects for regional biases like localized weather events. The weight matrix associates each pixel with the nearby pixels. Thus, in our case, it can be thought of as an adjacency matrix between each pixel of the image, and is only nonzero on the 8 neighboring pixels to a given pixel. THhs matrix is sparse, and we make use of pysal to compute spatial error regression using this enormous matrix.
Random Forest Regression provides an alternative method of calculating the set of coefficients β. We use a large group of shallow decision trees designed to minimize the mean squared error of the dataset.
Our random forests are implemented in Python using scikit-learn, which supports implicit parallelism by implementing parallel training of individual trees in the forest. The number of trees to train in parallel is set by default to the number of cores or the number of trees, whichever is smaller, but this can be overridden manually.