General Parallelism Structure, Advantages, and Challenges
Structure
In the methods for data processing and analysis described above, we generally exploit two levels of parallelism, both via SPMD.
- Since we are performing independent regressions on ~80 snow depth/SWE snapshots at 8 different model resolutions for 3 different regression strategies, there exists explicit SPMD-style parallelism that is exploited in both data processing and analysis steps when applying these steps to each raster image.
- Within each regression, and similarly within some of the data preprocessing steps, there exists both implicit and explicit lower-level parallelism opportunities.
For the first level, we chose to exploit this "across-raster" SPMD parallelism using the batch submission structure, submitting jobs that call a given program on multiple datasets. For this level of parallelism, the only dependence that occurs is between steps in the pipeline (e.g. regression cannot begin until data preprocessing has finished). Thus, we use the dependency feature within each job submission to make sure the later steps in the process wait until the earlier steps have completed.
For the "within-raster" parallelism, there are two forms of parallelism that we exploit. The first is implicit, which is used in the case of the Random Forest Regression by relying on the parallel structure encoded in Python's scikit-learn package. The second is explicit, as is the case with generating irradiance rasters through transformations of the DEM. In the latter case, we again rely on the batch submission structure to handle our "message passing." Jobs are submitted with dependencies such that they wait until intermediate jobs have finished and written temporary files to disk. These files are then read by the later job and processed into output.
Advantages
We chose to use the batch submission structure for our main parallelization because it allowed the greatest flexibility and interoperability with multiple high-level languages. Because the majority of our parallelism was derived from the independence of performing the same task across multiple datasets, we did not need a great deal of message passing and communication functionality and, even if we had opted for incorporating OpenMP or MPI, it was unclear whether we would be able to smoothly incorporate these into the various geospatial and statistical packages we were using to process and analyze our data (GRASS GIS, scikit-learn, and PySAL). By submitting each batch job separately, we were able to instantiate each of these packages independently for each task and then call the necessary algorithms within an environment that matched the standard single-core or SMP architecture that these packages were primarily designed for. In this way, we took advantage of the single-core optimization that the developers of these packages have spent a long time improving (and avoided trying to reinvent that wheel ourselves), while simultaneously leveraging the large number of cores available to us in NCAR's HPC environment.
Challenges
The most obvious disadvantage to parallelization using the batch submission system, as described above, is the lack of message passing functionality. Essentially, since there is no communication between processors beyond conveying when a job has finished, all message passing must occur via writes and reads to disk. In software for which communication between tasks represents a significant fraction of the overall wall-clock time, this could have dramatic consequences for overall program efficiency. While we have focused attention in class on the latency and bandwidth issues associated with code that conducts a high degree of I/O to memory, this problem is multiplied when requiring frequent communication to disk. For this reason, this method for parallelization is not suited for most dynamic simulations or other problems for which frequent communication is needed. However, in our use case the processing time was orders of magnitude greater than the time required for disk I/O. Thus, this main disadvantage did not greatly affect our overall performance.
Another challenge that we faced, not necessarily unique to batch parallelism but common to the use of high-level data analysis languages like GRASS GIS, scikit-learn, or PySAL, is that these packages are often not designed for HPC and are not thread-safe. This challenge required software-specific workarounds, one example of which is the creation of temporary GRASS Locations when processing the input DEM raster to generation irradiation map outputs as described in Section \ref{148321}. For every project undertaken with GRASS, the user must create a unique environment, called a Location, within which GRASS stores geospatial data for a given area in a GRASS-specific binary format. Each location also contains a specific geographic coordinate system (GCS) and projection, allowing for easy reprojection and comparison of data with different underlying spatial reference systems. Typically, all data for a given study area are stored within a given location; however, each instance of GRASS can only operate in a single location, and it places a lock on that location while the program is open. This occurs due to abstracted underlying functions that read and write to intermediate data files, potentially creating race conditions if multiple instances of GRASS were trying to access data in a given Location. To work around this, we created new Locations for each intermediate task and then, in the master task, copied the files from each of these temporary Locations into a central Location, performed the final step of the algorithm, and then deleted those intermediate files. While this facilitated the relatively simple batch scheduler-based parallelism strategy we employed, it involved I/O delays due to the creation of these temporary Location environments and to the writing and reading of intermediate files to/from disk. Again, however, since each calculation was orders of magnitude more expensive than the reading/writing time, this had a negligible impact on performance.
Finally, there was the difficulty of writing multiple SPMD scripts for various stages of the data processing and analysis pipeline. Inevitably, with different people working on different stages of the pipeline, it was difficult to maintain common filename, directory tree, and argument standards. Thus, a reasonable amount of time was spent trying to "blend" two stages of the pipeline together once they had been written. A related challenge occurred when writing a single script to process and adjust all of the raw data during the preprocessing stage. Because of small differences in the way files were stored across years, rasters from different years having different spatial extents, and other minor differences, SPMD-style programming was challenging at this early stage. Once this preproccesing occured, however, the remaining steps in the pipeline were made easier due to consistency in file formats, as well as in filename and directory structure conventions. These are not so much challenges unique to our project but rather one common to collaborative projects, and particularly data analysis projects with multiple stages of SPMD-style scripts being executed within a larger pipeline.