Figure \ref{201735} above shows strong scaling results for training a 12-tree random forest using scikit-learn's built-in concurrency support to vary the number of trees being trained at the same time. Scaling remains consistent across model scales, and performance increases linearly until N=4, after which performance starts to diverge from the ideal speedup. Because all trees in a random forest must complete training before the model can be exported or otherwise used, concurrent training of non-factors of the total number of trees yields no benefit. For example, training 7 to 11 trees at a time yields the same speedup as 6 concurrent jobs because the former case requires taking an equal amount of time to train the remaining 1 to 5 trees while the remaining threads sit idle, whereas the latter case makes use of all 6 threads throughout the entire training phase.
Discussion
The structure and parameters of models trained at numerous points in time and across scales provides insight into how the predictive power of individual features changes over time. Several noteworthy patterns are observable in these results, and these are detailed below. It is important to reiterate that this study was not an attempt at building the optimal prediction model. Terrain and vegetation are only able to explain a moderate portion of the observed variability in snow depth and SWE, and numerous other factors (including spatially heterogeneous weather patterns) are also responsible. Instead, this study seeks to explain how much the influence of individual features changes over time and provide insight into whether, as part of a larger estimation procedure, terrain and vegetation could be used to (a) interpolate between point observations of SWE or (b) distribute snow measured at lower resolutions into higher, more operationally-relevant scales. With this in mind, we note the following trends in changes to model structure across the various dimensions we studied.
First, there is a clear structure to the changing influence of specific terrain and elevation features over intra-annual timescales (Figure \ref{697839}). In particular, elevation is most predictive of snow depth during the winter and early spring, or the "accumulation phase", while it is much less important in the models during the "ablation phase" of late spring and summer. Relative irradiance (as calculated on a clear-sky April 1 day) is most predictive in late spring and less predictive both earlier and later. One hypothesis for this behavior is that the spatial variation in cumulative insolation is greatest at this point. Earlier in the season, the entire snowpack has received little insolation, while later in the year the entire watershed is receiving excessive sun. This physical process driving the observed statistical phenomenon is a potential starting point for another, physically-based, analysis. Additionally, the relative predictive skill of the model, as measured by standardized RMSE or by r2, decreases later into the season, indicating that vegetation and terrain are better predictors of snow distribution during the accumulation phase than during the ablation phase.
If we compare models trained on the peak snow days of each year, (Figure \ref{686962}), we notice parallels with the intra-annual timescales. In 2015, a year with very little snow, the distribution of feature weights aremore similar to summer in a normal-snow year than to another spring (see June-July 2016). This makes some sense, as in a year with low snowfall, the snowpack should not change between summer and the following spring. Further work needs to be done to determine precisely what distinguishes a low-snow spring snowpack from a normal-snow summer snowpack.
As resolution increases, we find that the importance of the DEM decreases, while other features get more important (Figure \ref{335780}). This is because many other features like wind direction and slope are very sensitive to local topography. At coarse resolutions, these effects are all washed out or are poorer predictors for all the land in the pixel. As such, the only useful information at coarse resolution is the DEM itself.
We see subtle differences in the models for SWE vs snow depth (Figure \ref{457812}). Most notably, snow depth appears to be more sensitive to solar irradiance than SWE is. One hypothesis to explain this is that snow exposed to sunlight can melt, but not all meltwater flows away. Melted snow can remain in a pixel and therefore can be counted in SWE. However, since snow is significantly less dense than water, the depth of the snow decreases significantly. Hence, sunlight can decrease snow depth without changing SWE.
Finally it is important to note that Figure \ref{229688} is not an attempt to find the most accurate model across years. Instead, we aim to observe how our models change over time. It is also worth mentioning that the regression model trained on 2013 data appears to be a better predictor for 2014 the regression model trained on 2014 data. Further work is needed to determine why this is the case.
Conclusion
In this project, we tackled the problem of using linear regression and random forests on geospatial data to model the distribution of snow in the Tuolumne River Basin. By exploiting explicit and implicit parallelism, we resampled and transformed 120 raster images up to 1GB in size and then performed over 1500 regressions, using around 10 terrain and vegetation features and up to 1 million observations each, within a timeframe of a few hours. We compared the importance of features in these models, as well as the predictive skill of the models, (1) across dates within a season, (2) across seasons, (3) across model scale, and (4) between regressions of SWE and of snow depth to evaluate the stability of these models along these various spatiotemporal dimensions. The preliminary results that we obtained show promising agreement with existing knowledge about the dominant importance of elevation and insolation, yet several novel observations regarding patterns of change in the influence of individual features were identified, including the changing influence of these dominant features from accumulation to ablation phase. We encountered and solved several challenges associated with adapting high-level serial and/or shared memory code to run properly on HPC systems.
Work remains on exploiting further sources of parallelism, for instance to allow for more flexible, and expensive algorithmic models. Simultaneously, further scientific investigation will attempt to apply these models in an operationally-relevant situation. For instance, we can subsample the gridded snow depth products to simulate point observations and then attempt to predict snow depths via co-kriging, using the knowledge we gained in this study to allow for temporal variability in terrain/vegetation-SWE relationships. Even before these further investigations, however, these results have demonstrated the advantages and limitations of using static statistical models to estimate the distribution of snow within a watershed. Clearly, other sources of prediction are needed to form operationally useful predictions that can improve SWE estimates and help to improve water resource forecasts, yet understanding the temporal variability of these models makes the inclusion of terrain and vegetation-based statistical models feasible within a larger prediction framework.
Code
Author Contributions
I.B. conceived of and designed the analysis and the parallel analysis pipeline, obtained data and computing resources, developed data processing code and preliminary spatial error regression code, produced scaling results for the data processing steps, and drafted Abstract, Introduction, Data Structure, Data Preprocessing, and Parallelism Structure, Advantages, and Challenges sections; V.R. adapted spatial error regression code and drafted the Computing Resources section; W.H. developed random forest regression code, produced scientific results and generated figures, and drafted the Scientific Results section; I.B. and V.R. drafted and edited the project poster and drafted the Discussion section; V.R. and W.H. drafted the Regression Section; I.B. and W.H. drafted the Performance Results section and edited the manuscript; All contributed to drafting the Conclusion section.
Acknowledgements
We would like to thank Aydin and Kathy for support of this project that sought to apply the parallel concepts learned in class to higher level languages, with some techniques different than those focused on in the course (e.g. batch job parallelism). We would also like to thank Jenny and Marquita for consistently insightful feedback and support throughout the assignments of the semester. This research was conducted with Government support under and awarded by DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.