3.4 Calibrations and predictability of models
The calibration was carried out using randomly selected 196 samples from
the dataset and validated on 84 samples using four algorithms, namely,
partial least square (PLS), random forest (RF), multivariate adaptive
regression splines (MARS) and support vector regression (SVR)
methodology. In the calibration Vis-NIR data set, the values of
R2 and RMSE for PLSR model was 0.93, 0.12; using the
RF model was 0.84, 0.15; while using the SVR model was 0.80, 0.21 and
MARS was 0.86, 0.12. In the calibration MIR data set, the values of
R2 and RMSE values for PLSR, RF, MARS and SVR models
were 0.94, 0.26; 0.84, 0.25; 0.80, 0.25 and 0.91, 0.25, respectively.
R2 is an important statistical measure which
represents the proportion of the difference or variance in statistical
terms for a dependent variable which can be explained by an independent
variable or variables, and in short, determines how well data fit the
regression model; whereas lower RMSE indicates better fit of data. From
the calibration datasets it is clear that PLSR model outperformed other
models in having higher R2 and lower RMSE values
(Table 2 and 3).
The predictive performance of PLSR, RF, SVR and MARS models for EC in
the Vis-NIR range was evaluated and the respective values for PLSR were
(R2 = 0.84, RMSE=0.21 , RPD=2.44); for RF were
(R2 = 0.81, RMSE = 0.20, RPD=1.95); for MARS were
(R2 = 0.73, RMSE = 0.27, RPD=1.81) and for SVR were
(R2 = 0.78, RMSE = 0.22, RPD=2.09). In the MIR
dataset, the corresponding values for PLSR were (R2 =
0.55, RMSE = 0.35, RPD=1.40); for RF were (R2 = 0.52,
RMSE = 0.20, RPD=1.43); for MARS were (R2 = 0.44, RMSE
= 0.37, RPD=1.29); and for SVR were (R2 = 0.53, RMSE =
0.35, RPD=1.39) respectively. The threshold RPD values used to test
model performance were the ones developed by Chang et al.,(2001), where excellent models have RPD > 2, fair models
have RPD between 1.4 and 2, and non-reliable models with RPD <
1.4. Accordingly, PLSR was considered as an excellent model in the
Vis-NIR range (RPD = 2.44) and RF as fairly good in the MIR range
(RPD=1.43) (Table 2 and 3).
PLSR model has been successfully used in this study and has been used
for estimating soil salinity and other properties of soil elsewhere in
the world, e.g., New South Wales, Australia (Janik et al., 2009), the
island of Texel in the northwest of The Netherlands (Farifteh et al.,
2007a), the Yellow River delta region in China (Weng et al., 2008) and
the Hetao Irrigation District of Inner Mongolia in China (Qu et
al. , 2009). PLSR first decomposes the spectra into a set of
eigenvectors and scores and performs a regression with soil attributes
in a separate step, thus actually using the soil information during the
decomposition process. The advantages of PLSR is its linearity and it
takes advantage of the correlation that exists between the spectra and
the soil properties; thus, the resulting spectral vectors are directly
related to the soil attribute (Geladi and Kowalski, 1986). It is robust
in terms of data noise and missing values, and balances the two
objectives of explaining response and predictor variation and performs
the decomposition and regression in a single step. Sidike et al.,(2014) showed that an accurate prediction of soil salinity can be made
based on the PLSR method (R2 = 0.992,
RMSE = 0.195) and Farifteh et al., (2007) suggested that PLSR analyses
offered accurate to good prediction of EC.
RF is a group of al algorithms that have been developed as an extension
of Classification and Regression Tree analysis to enhance the prediction
performance and have been mainly used for classification problems (Olson
et al. 2017). The RF is a fast, simple data-driven statistical approach
that has been used in digital soil mapping and has shown good accuracy
and is reported to be resistant to over-fitting and usually performs
well in problems with a low sample-to-feature ratio (Wei et al.,2012), but could not outperform PLSR in data calibration for both
spectral ranges and validation in the Vis-NIR range in the present
study. SVR, which is a machine learning algorithm based on the
statistical learning theory which seeks to maximize the ability to
generalize using the structural risk minimization principle (Filgueiras
et al. 2014) and MARS, which splits the data into sub regions (splines)
with different interval ending knots, which are the points in the slopes
where the regression coefficients change, and fits the data in each sub
region using a set of adaptive piece wise linear regressions (Friedman,
1991); both did not perform better than PLSR and RF in this study.
The scatter plots of measured and predicted values for soil electrical
conductivity in the calibration Vis-NIR and MIR datasets (Fig 5 and Fig
7) showed good relation between these two variables with high
R2 values in both datasets. The scatter plots of
measured and predicted EC in the validation NIR and MIR datasets (Fig 6
and Fig 8) also suggest good model validation with high
R2 values. On comparing the RPD values of Vis-NIR and
MIR validation datasets, higher RPD values were obtained in the Vis-NIR
region and hence this region may be better suited for prediction of EC
than MIR region. Soriano et al., (2014) reported that Vis-NIR
spectroscopy shows better result (R2= 0.60) in
prediction of EC than MIR (R2= 0.27) as observed in
our study. Kodaira et al., (2013) reported that EC was generally
poorly predicted by both MIR (R2= 0.26) and NIR
spectroscopy (R2= 0.57) but, Minasny et al. (2009)
predicted EC with good accuracy in the MIR region using large variation
of values in the dataset used.