2.2 Workflow overview
The synthesis process followed the workflow of data downloading, data
quality control and cleaning, data aggregation, gap-filling in daily
time series, and finally, writing to NetCDF format. To extract the
desired data, we carefully inspected the source websites for information
about how the original data were measured, processed, and recorded. Our
data cleaning and quality control procedures included scanning for
unrealistic values and cross-checking data flag reports. After
unrealistic values were removed, any time series that were recorded at
sub-daily intervals were aggregated to daily time steps. Subsequently,
three levels of gap-filling methods (interpolation, regression, and
climate catalog; see Section 2.4) were applied to the daily-scaled data.
The resulting data were stored in NetCDF format using a consistent
structure and layout, together with metadata which provided additional
information including variable units, station names, locations and
record lengths.
2.3 Data downloading and
cleaning
For each site, we acquired (if available) time series data of
streamflow, precipitation, air temperature, solar radiation, relative
humidity, wind direction, wind speed, SWE, snow depth, vapor pressure,
soil moisture, soil temperature, and isotope values. For the convenience
of cross-watershed research and intercomparison of datasets, variable
naming standards and their units were made consistent, following Addor
et al.’s (2020) suggested format for large sample hydrology datasets. As
detailed in the data pipeline Jupyter Notebooks attached to the CHOSEN
database, we aggregated any hourly time series in one of two different
ways: cumulative variables were summed, and rate variables were
averaged.
2.4 Gap-filling methods
Gaps in the cleaned and aggregated daily data were filled using one of
three techniques, depending on the length of the gap and availability of
complementary data. The first technique involved linear interpolation
between the two nearest non-missing values. Linear interpolation was
applied to gaps of less than seven days, over which seasonal effects can
be considered trivial. Longer gaps were filled by regression for those
catchments with multiple monitoring stations (Pappas et al., 2014). To
implement spatial regression, we first evaluated the correlation
coefficients between the station with missing values and all the other
stations within the watershed. We then used the data from the station
with the highest correlation coefficient to estimate the regression
parameters. If the highest correlation coefficient was less than 0.7, or
if no data were available from other stations contemporarily, the
missing values were reconstructed using the climate catalog technique.
The climate catalogue method filled gaps by using data from the same
site for a different year (the one containing at least 9 months of data
and with the highest correlation coefficient greater than 0.7 to the
year in which values were missing). For example, suppose a catchment’s
only streamflow gauge was missing all of April’s measurements in 2002.
In this case, we would first group the available data by year, and
calculate the correlation coefficients between daily streamflows in 2002
and the other years. If the 2002 data correlated most strongly with data
from 2006, then 2006’s April 1st data point replaced the missing value
from April 1st 2002, with the addition of a Gaussian random number
scaled by the standard deviation of all April 1st values from all the
years of record.
Figure 2. Data pipeline and visualizations of cleaning methods:
a) interpolation, b) regression and c) climate catalog
To assure the quality of reconstructed data (interpolated, regressed, or
based on climate catalog), we deleted any reconstructed values that fell
outside of the thresholds that were originally used to detect
unrealistic data. After all the data filling methods were applied, a
flag table was generated indicating the technique that was used to
create each filled data point. All the python coding scripts for
processing methods are available on GitLab
(https://gitlab.com/esdl/chosen)
and will be published on Zenodo (DOI: 10.5281/zenodo.4060384) open to
the public along with a Jupyter Notebook tutorial.
2.5 NetCDF data product
We stored and published the processed data in NetCDF format. NetCDF is
emerging as the data standard for large-sample hydrology, as well as for
other large-sample products across the geosciences, particularly climate
science and remote sensing (Liu et al., 2016; Romañach et al., 2015;
Signell et al., 2008). The NetCDF library is designed to read and write
multi-dimensional scientific data in a well-structured manner. The
library enables writing data in several coordinate dimensions,
accommodating multiple measurement stations. The machine-based interface
makes data highly accessible and easily portable across various computer
platforms. Data (variables) and metadata (corresponding attributes) are
intrinsically linked and stored in the same file, making the data set
self-documenting.
We generated one NetCDF file for each watershed to store its data and
metadata. In these NetCDF files, there are four kinds of variables.
Hydrometeorological variables are stored in two-dimensional arrays
(i.e., by time and location), along with flag variables having the same
number and array dimensions. The timestamp variable is a one-dimensional
array of measurement dates and times. Lastly, a grid variable contains
information about gauges and monitoring stations, including their names,
latitudes, and longitudes. The attributes include website links, units,
full names, and record starting and ending dates (Figure 3).
Figure 3. Variables, corresponding dimensions and attributes in
NetCDF files