Cross-validation techniques
Purpose of cross-validation:
- Addresses over-fitting of models based on limited sample sizes
- Can be used to identify and rank most robust predictor groupings for model building purposes
- Based on sequentially training and then generating test predictions from different subset decompositions of the original data, using average number of misclassified observations as means to rank each predictor grouping
Exhaustive cross-validation approaches:
- Leave-p-out cross-validation
- Leave-one-out cross-validation (the most computationally inexpensive version of leave-p-out cross-validation)
Non-exhaustive cross-validation approaches:
- k-fold cross-validation
- Holdout method
- Monte Carlo (repeated random sub-sampling)
Limitations:
- Cross-validation only yields meaningful results if validation set and training set are drawn from the same population and human biases are controlled
- Using dissimilar time periods for validation and training sets can cause problems (alignment of features advised)
- If model is trained based on a specific population group (e.g. young people), generalisation of cross-validated training predictions to the wider population could differ greatly to actual results
Functional data analysis
Purpose of Functional Data Analysis (FDA):
- Most statistical analysis assumes data points are unrelated – this is not true of time series, where there is often a derivative function that connects points
- Functional Data Analysis is an approach developed to conduct statistical analysis and build models based on whole functions rather than independent points – as such well suited to time series data \cite{Ramsay_2009}
- Functional data analytics is suitable for conditions where phase variations are present in data (such as in growth curves that start at different stages). Methods such as nonlinear mixed models, repeated measure ANOVA, and principal components analysis do not consider these differences in timing [ https://stats.stackexchange.com/questions/26048/when-where-to-use-functional-data-analysis]
Basic principles:
- FDA uses ‘basis functions’ to represent data series as a ‘functional data object’ \cite{Ramsay_2009}
- Basis functions are defined by \(f\left(t\right)=\Sigma\beta_ib_i\left(t\right)\) where \(b_i\left(t\right)\) is known, and \(\beta_i\) are the estimated coefficients. This is often also written as \(f\left(t\right)=a_1\theta_1\left(t\right)+a_2\theta_2\left(t\right)+...+a_k\theta_k\left(t\right)\)
- Functional data objects can subsequently be used in functional linear regression analysis, in an analogous way to conventional linear regression: