Table 1. Definition of variables with units and temporal resolutions
2.2.Data analysis and model-building
Microsoft Excel was used to array data, process it, and create graphs. We then used SPSS 21 (SPSS 24 for verification and validation of results) and the R programming language for statistical computing and graphics generation. We conducted regression analysis where population and precipitation are the independent variables or predictors, whereas sorghum, maize, and rice are the dependent or response variables. We split the dataset into a training set (70%) and a test set (30%). Furthermore, MATLAB Curve Fitting Toolbox was used to analyze the data and detect the connections between the response or dependent variables and the predictors or independent variables. We used the methods developed in the National Center for Research Methods (NCRM 2017) to interpret our results.
First, the correlation analysis was performed to determine the relationships between the response variables and the predictors. The results helped determine the factor that most impacted the variations in crop production. Next, simple linear regression models were built to predict crop production using only population or precipitation as a predictor. We then built bilinear models by adding the second independent variable. This helped establish the impact of each variable on cereal production. We performed an Analysis of Variance (ANOVA) to determine which model is the best. The linear regression enabled us to identify the best model for predicting the crops.
The statistical criteria used to evaluate the effectiveness of the regression models were the root mean square error (RMSE) and the k-fold Cross-validation. RMSE indicates the discrepancy between the observed and predicted (or calculated) values. The lower the RMSE, the more accurate the prediction is. The best fit between the observed and predicted values have a RMSE = 0 and R2 = 1; however, this is not likely to happen. Finally, we carried out a cross-validation (5-fold cross-validation) method to estimate how well each model performed, and we established which model is best. Figure 2 shows a flowchart describing the main steps for integrating the data for the analyses.