Statistical comparisons of time series

This study considers 23 technologies where literature evidence has been identified to classify the particular mode of technology substitution observed. Using bibliometric analysis methods it is possible to extract a variety of historical trends for any technologies of interest, effectively generating a collection of time series data points associated with a given technology (these multidimensional time series datasets are referred to here as 'technology profiles'). This raises the question of how best to compare dissimilar bibliometric technology profiles in an unbiased manner in order to investigate whether literature based technology substitution groupings can be determined using a classification system built on the assumptions given in section \ref{585124}. In particular comparisons of technology time series can be subject to one or more areas of dissimilarity: time series may be based on different number of observations (e.g. covering different time spans), be out of phase with each other, may be subject to long-term and shorter term cyclic trends, be at different stages through the Technology Life Cycle (or be fluctuating between different stages) \cite{little1981strategic}, or be representative of dissimilar industries. As such, a body of work already exists on the statistical comparison of time series, and in particular time series classification methods \cite{lin2012pattern}. Most modern pattern recognition and classification techniques emerging from the machine learning and data science domains broadly fall within the categories of supervised, semi-supervised, or unsupervised learning approaches. Related to this, an overview of current preprocessing, statistical significance testing, classification, feature alignment, clustering, cross-validation, and functional data analysis techniques for time series is provided in Appendix A for further details of the considerations addressed in this study's methodology beyond those discussed directly in section \ref{330519}.

Method selection

Based on the technology classification problem considered, the bibliometric data available, and the methods discussed in Appendix A the following methods have been selected for use in this analysis:

Technology Life Cycle stage matching process

For those technologies where evidence for determining the transitions between different stages of the Technology Life Cycle has either not been found or is incomplete, a nearest neighbour pattern recognition approach has been employed based on the work of Gao \cite{Gao_2013} to locate the points where shifts between cycle stages occur. However, for the technologies considered in this paper, literature evidence has been identified for the transitions between stages, and so the nearest neighbour methodology is not discussed further here.

Identification of significant patent indicator groups

In order to identify those bibliometric indicator groupings that could form the basis of a data-driven technology classification model a combination of Dynamic Time Warping and the 'PAM' variant of K-Medoids clustering has been applied in this study. For the initial feature alignment and distance measurement stages of this process, Dynamic Time Warping is still widely recognised as the classification benchmark to beat (see Appendix A), and so this study does not look to advance the feature alignment processes used beyond this. Unlike the Technology Life Cycle stage matching process which is based on a well-established technology maturity model, this study is assuming that a classification system based on the modes of substitution outlined in section \ref{585124} is not intrinsically valid. For this reason an unsupervised learning approach has been adopted here to enable human biases to be eliminated in determining whether a classification system based on presumptive technological substitution is valid or not, before subsequently defining a classification rule system. In doing so this additionally means that labelling of predicted clusters can be carried out even if labels are only available for a small number of observed samples representative of the desired classes, or potentially even if none of the observed samples are absolutely defined. This is of particular use if this technique is to be expanded to a wider population of technologies, as obtaining evidence of the applicable mode of substitution that gave rise to the current technology can be a time-consuming process, and in some cases the necessary evidence may not be publicly available (i.e. if dealing with commercially sensitive performance data). As such, clustering can provide an indication of the likely substitution mode of a given technology without the need for prior training on technologies that belong to any given class. Under such circumstances this approach could be applied without the need for collecting performance data, providing that the groupings produced by the analysis are broadly identifiable from inspection as being associated with the suspected modes of substitution (this is of course made easier if a handful of examples are known, but means that this is no longer a hard requirement). The 'PAM' variant of K-Medoids is selected here over Hierarchical clustering since the expected number of clusters is known from the literature, and keeping the number of clusters fixed allows for easier testing of how frequently predicted clusters align with expected groupings. Additionally, a small sample of technologies is evaluated in this study, and as a result computational expense is not likely to be significant in using the 'PAM' variant of K-Medoids  over Hierarchical clustering approaches. It's also worth noting that by evaluating the predictive performance of each subset of patent indicator groupings independently it is possible to spot and rank commonly recurring patterns of subsets, which is not possible when using approaches such as Linear Discriminant Analysis which can assess the impact of individual predictors, but not rank the most suitable combinations of indicators.

Ranking of significant patent indicator groups

As the number of technologies considered in this study is relatively small, exhaustive cross-validation approaches provide a feasible means to rank the out-of-sample predictive capabilities of those bibliometric indicator subsets that have been identified as producing significant correlations to expected in-sample technology groupings. As such, leave-p-out cross-validation approaches are applied for this purpose, whilst also reducing the risk of over-fitting in the following model building phases.

Model building

Due to the importance of phase variance when comparing historical trends for different technologies, and the coupling that exists between adjacent points in growth and adoption curves, functional linear regression is selected here to build the technology classification model developed in this study (see notes on Functional Data Analysis in Appendix A for further details).

Method limitations

Although precautions have been taken where available to ensure that the methods selected for this study address the problem posed of building a generalised technology classification model based on bibliometric data in as rigorous a fashion as possible, there are some known limitations to the methods used in this work that must be recognised. Many of the current limitations stem from the fact that in this analysis technologies have been selected based on where evidence is obtainable to indicate the mode of adoption followed. As such the technologies considered here do not come from a truly representative cross-section of all industries, so it is possible that models generated will provide a better representation of those industries considered rather than a more generalisable result. This evidence-based approach also means that it is still a time-consuming process to locate the necessary literature material to be able to support classifying technology examples as arising based on one mode of substitution or another, and to then compile the relevant cleaned patent datasets for analysis. As a result only a relatively limited number of technologies have been considered in this study, which should be expanded on to increase confidence in the findings produced from this work. This also raises the risk that clustering techniques may struggle to produce consistent results based on the small number of technologies considered. Furthermore, any statistical or quantitative methods used for modelling are unlikely to provide real depth of knowledge beyond the detection of correlations behind patent trends when used in isolation. Ultimately some degree of causal exploration, whether through case study descriptions, system dynamics modelling, or expert elicitation will be required to shed more light on the underlying influences shaping technology substitution behaviours. Other data-specific issues that could arise relate to the use of patent searches in this analysis and the need to resample data based on variable length time series. The former relates to the fact that patent search results and records can vary to a large extent based on the database and exact search terms used, however overall trends once normalised should remain consistent with other studies of this nature. The latter meanwhile refers to the fact that functional linear regression requires all technology case studies to be based on the same number of time samples, and as such, as discussed in Appendix A, linear interpolation is used as required to ensure consistency on the number of observations whilst possibly introducing some small errors which are not felt to be significant.