Fig. 3: Histogram of log-EUI after outlier elimination. It looks like normal distribution.
Methodology: The scope of this study was limited to regression methods, namely: Least square, Ridge, Lasso, and Support Vector Regression because of simplicity of those models. Decision trees has been excluded from the analysis, although they have the potential of providing better prediction performance due to the fact that they are not sensitive to outliers and they can fit complex nonlinear relationships (Elith et al. 2008). In the regression analysis step, mean squared error (MSE) has been used to evaluate prediction performance with cross validation. 5 fold cross validation was  used since it avoids overfitting (Hsu et al. 2010). Different model parameters were evaluated to find the best ones (best alpha for ridge and lasso regressions, and best C, gamma, kernel for support vector regression). Afterwards, the mean squared error for all regression methods were visualized as seen in Figure 4. It was concluded that Ridge regression yields less MSE, and therefore, it might be a better model.