Identification of smoothing parameter values for regression coefficients
With the functional data objects for each model component now ready, a cell array containing each model component along with a constant predictor term (i.e. a cell array equal to 1 for all technology terms) is generated for use in the functional linear regression. Before the final regression analysis can be run, a smoothing parameter for the regression coefficient basis system has to be selected. This is separate from the earlier smoothing parameter selected for smoothing the technology profiles; this second smoothing parameter only addresses the roughness of the regression coefficients. This is again necessary to try to prevent over-fitting, and ensure that the model converged on by the subsequent functional linear regression analysis has the best chance of performing well out-of-sample when extended to future datasets. In this instance, the selection of smoothing parameter is achieved by calculating leave-one-out cross-validation scores (i.e. error sum of squares values) for functional responses using a range of different smoothing parameter values, as per section 9.4.3 and 10.6.2 of \cite{Ramsay_2009}. The functional parameter object used in the regression coefficient basis system is then redefined using this more optimised smoothing parameter value.
Results and Discussion
The functional linear regression analysis is now run with the identified smoothing parameters and scalar response variables to identify the \(\beta_i\) coefficients and the corresponding variance, used to define the 95% confidence bounds (see sections 9.4.3 and 9.4.4 of \cite{Ramsay_2009} respectively). Fig. \ref{820059} to Fig. \ref{942889} show the resulting \(\beta_i\) coefficients and confidence bounds for the number of non-corporates and the number of cited references by priority year during the emergence phase of development when using a high-dimensional regression fit (i.e. when the beta basis system for each regression coefficient is made up of a large number of B-splines). This regression fit successfully identifies the correct mode of substitution from patent data available in the emergence stage for 19 of the 20 technologies considered. As such, from a preliminary inspection, this classification model looks to provide a good degree of accuracy, but further investigation is required to ensure the model is not over-fitted, and that the result is not simply a naturally occurring phenomenon.
From the confidence bounds on these plots it can be seen that for both the number of non-corporates and the number of cited references by priority year indicator counts the variance across technology profiles is highest at the start of the emergence phase: this is often when the least amount of data is available for comparing each technology, and also when development activity is most haphazard and sporadic, so this is not entirely surprising as this represents the point of greatest uncertainty. However, Fig. \ref{822351} and Fig. \ref{942889} also illustrate how the relative importance of the chosen science (Fig. \ref{942889}) and technology (Fig. \ref{822351}) patent indicators in determining the predicted mode of substitution varies with time during the emergence phase (based on the datasets used), although no causal explanation as to why they have this relative weighting is directly provided by these functions. More specifically, deviations away from zero in these coefficient functions equate to an increased positive or negative weighting for the associated patent indicator count at that moment in time, within the determination of the predicted mode of substitution. As such it can be seen from Fig. \ref{822351} that any patent indicator counts at t = 0 for the number of non-corporates by priority year (assuming these are present) will have a more significant influence on the predicted classification than at any other point in the emergence phase. Equally, Fig. \ref{822351} would suggest that the impact of non-corporates activity next peaks around 40% of the way through the emergence phase (potentially corresponding to the hype effect suggested by Fig. \ref{413726}), and again at the end of the emergence phase. For the number of cited references by priority year, this regression model suggests that the times of greatest impact on the mode of substitution are at the very beginning and at the very end of the emergence stage. Whilst these coefficient plots gives some indication of the relative weighting applied to patent indicator counts as time progresses, the cumulative nature of the inner products used in functional linear regression means it is not possible to visually infer from these plots alone which mode the technology under evaluation is currently converging towards. For this it is also necessary to include the corresponding patent indicator count values that these coefficient terms are multiplied by for the specific technology being assessed.