3.4 Calibrations and predictability of models
The calibration was carried out using randomly selected 196 samples from the dataset and validated on 84 samples using four algorithms, namely, partial least square (PLS), random forest (RF), multivariate adaptive regression splines (MARS) and support vector regression (SVR) methodology. In the calibration Vis-NIR data set, the values of R2 and RMSE for PLSR model was 0.93, 0.12; using the RF model was 0.84, 0.15; while using the SVR model was 0.80, 0.21 and MARS was 0.86, 0.12. In the calibration MIR data set, the values of R2 and RMSE values for PLSR, RF, MARS and SVR models were 0.94, 0.26; 0.84, 0.25; 0.80, 0.25 and 0.91, 0.25, respectively. R2 is an important statistical measure which represents the proportion of the difference or variance in statistical terms for a dependent variable which can be explained by an independent variable or variables, and in short, determines how well data fit the regression model; whereas lower RMSE indicates better fit of data. From the calibration datasets it is clear that PLSR model outperformed other models in having higher R2 and lower RMSE values (Table 2 and 3).
The predictive performance of PLSR, RF, SVR and MARS models for EC in the Vis-NIR range was evaluated and the respective values for PLSR were (R2 = 0.84, RMSE=0.21 , RPD=2.44); for RF were (R2 = 0.81, RMSE = 0.20, RPD=1.95); for MARS were (R2 = 0.73, RMSE = 0.27, RPD=1.81) and for SVR were (R2 = 0.78, RMSE = 0.22, RPD=2.09). In the MIR dataset, the corresponding values for PLSR were (R2 = 0.55, RMSE = 0.35, RPD=1.40); for RF were (R2 = 0.52, RMSE = 0.20, RPD=1.43); for MARS were (R2 = 0.44, RMSE = 0.37, RPD=1.29); and for SVR were (R2 = 0.53, RMSE = 0.35, RPD=1.39) respectively. The threshold RPD values used to test model performance were the ones developed by Chang et al.,(2001), where excellent models have RPD > 2, fair models have RPD between 1.4 and 2, and non-reliable models with RPD < 1.4. Accordingly, PLSR was considered as an excellent model in the Vis-NIR range (RPD = 2.44) and RF as fairly good in the MIR range (RPD=1.43) (Table 2 and 3).
PLSR model has been successfully used in this study and has been used for estimating soil salinity and other properties of soil elsewhere in the world, e.g., New South Wales, Australia (Janik et al., 2009), the island of Texel in the northwest of The Netherlands (Farifteh et al., 2007a), the Yellow River delta region in China (Weng et al., 2008) and the Hetao Irrigation District of Inner Mongolia in China (Qu et al. , 2009). PLSR first decomposes the spectra into a set of eigenvectors and scores and performs a regression with soil attributes in a separate step, thus actually using the soil information during the decomposition process. The advantages of PLSR is its linearity and it takes advantage of the correlation that exists between the spectra and the soil properties; thus, the resulting spectral vectors are directly related to the soil attribute (Geladi and Kowalski, 1986). It is robust in terms of data noise and missing values, and balances the two objectives of explaining response and predictor variation and performs the decomposition and regression in a single step. Sidike et al.,(2014) showed that an accurate prediction of soil salinity can be made based on the PLSR method (R2  = 0.992, RMSE = 0.195) and Farifteh et al., (2007) suggested that PLSR analyses offered accurate to good prediction of EC.
RF is a group of al algorithms that have been developed as an extension of Classification and Regression Tree analysis to enhance the prediction performance and have been mainly used for classification problems (Olson et al. 2017). The RF is a fast, simple data-driven statistical approach that has been used in digital soil mapping and has shown good accuracy and is reported to be resistant to over-fitting and usually performs well in problems with a low sample-to-feature ratio (Wei et al.,2012), but could not outperform PLSR in data calibration for both spectral ranges and validation in the Vis-NIR range in the present study. SVR, which is a machine learning algorithm based on the statistical learning theory which seeks to maximize the ability to generalize using the structural risk minimization principle (Filgueiras et al. 2014) and MARS, which splits the data into sub regions (splines) with different interval ending knots, which are the points in the slopes where the regression coefficients change, and fits the data in each sub region using a set of adaptive piece wise linear regressions (Friedman, 1991); both did not perform better than PLSR and RF in this study.
The scatter plots of measured and predicted values for soil electrical conductivity in the calibration Vis-NIR and MIR datasets (Fig 5 and Fig 7) showed good relation between these two variables with high R2 values in both datasets. The scatter plots of measured and predicted EC in the validation NIR and MIR datasets (Fig 6 and Fig 8) also suggest good model validation with high R2 values. On comparing the RPD values of Vis-NIR and MIR validation datasets, higher RPD values were obtained in the Vis-NIR region and hence this region may be better suited for prediction of EC than MIR region. Soriano et al., (2014) reported that Vis-NIR spectroscopy shows better result (R2= 0.60) in prediction of EC than MIR (R2= 0.27) as observed in our study. Kodaira et al., (2013) reported that EC was generally poorly predicted by both MIR (R2= 0.26) and NIR spectroscopy (R2= 0.57) but, Minasny et al. (2009) predicted EC with good accuracy in the MIR region using large variation of values in the dataset used.