Cluster robustness

The level of agreement varies for different layers of the skills taxonomy. At the top layer, across 500 bootstrapped samples, the weighted average NMI is equal to 0.85, which indicates a high level of agreement (weighted average NMI equals 1 for complete agreement). At the deeper layers, the level of agreement is lower for certain groups (Table 4). For instance, for Science and research, the average NMI is 0.60. This measure is also below 0.70 for Engineering, construction and transport and Health and social care. There are two possible explanations for the lower level of agreement on cluster membership between different iterations of the algorithm. First, it is likely that in some skill domains there is a substantial degree of complementarity between skills. So, in the Health and social care domain, skills related to nursing can be applied in primary care as well as in critical care. This means that during the bootstrapping and community detection stages, we may identify alternative combinations of skills that are often required together. The second explanation relates to the fact that certain skills are more transversal. While we remove the most prominent transversal skills early on, some remaining skills still demonstrate higher than average centrality. It is possible that these skills keep moving between clusters and lead to a lower level of agreement between clustering iterations. One example of such a skill is biology, which refers to a general knowledge applicable in molecular biology, infectious disease research and other areas of life science research. In the future we will aim to identify these domain-central skills to measure their impact on cluster robustness. But, more importantly, these skills might represent foundation-skills, which can widen the feasible set of job transitions for individuals.

Discussion

While the skills taxonomy is generated entirely automatically without expert input, it appears to perform reasonably well in identifying distinct groups of skills, competences and knowledge. As shown in Tables 5 and 6 in Appendix 2, the cluster profiles, especially at the first and second layers of the taxonomy, reflect established occupational domains, such as EducationHealthInformation Technology and Business Administration. The metadata on skill clusters, such as salary and job titles, also appear to be generally aligned with the data from official statistics. For example, the clusters with the highest average salary are located in Securities trading and Data engineering, while the lowest paid skill clusters are in Medical and Office administration (all of these clusters reside in the third layer of hierarchy).
Initial results demonstrate that the data-driven approach to grouping skills, competences and knowledge areas has its merits. At the same time, in its current state, the methodology for deriving the hierarchical taxonomy has several limitations. The first limitation is that we do not allow skills to exist in multiple parts of the taxonomy. The current hierarchical structure places a skill in the cluster in which it is most strongly connected to other members. However, it is likely that certain skills, competences and knowledge such as cooking and data science will have lateral links to other clusters. For example, cooking resides in Social work and caregiving, but can also be connected to food service in retail. Similarly, data science, which is currently in Marketing research, could also sit in Data engineering. To address the limitations of hard partitioning, we propose complementing the provided hierarchical structure with a simplified graph of skill clusters. In this graph, all the vertices will be contracted to their 3rd layer clusters. The links between the 143 clusters can then be aggregated and used to explore the lateral relationships between skill clusters.
The second limitation relates to the current lack of clarity on how to incorporate incoming information on job adverts. In future work, we would like to explore the advantages and disadvantages of running the analysis on the whole dataset, updated with new information, as opposed to generating the word embeddings and the taxonomy on temporal slices of the data. For further validation and to assess the extent to which the clusters are distinct, we will also collect text from Wikipedia articles on individual skills in each cluster. We will then analyse the extent to which article terms are associated with certain skill clusters using the mutual information method. Given the nested nature of the taxonomy, we expect the clusters at deeper hierarchy layers to be more similar and refer to the same subject domains. This is why the proposed analysis is likely to be more appropriate, as a validation method, if applied to first and second layers only.
Finally, the skill cluster labelling needs to be automated to enable regular re-running of the methodology. However, generating labels automatically is a challenging task. Our initial strategy involved using the three most demanded skills in the cluster as a label. This method did not produce satisfactory results as often these three skills did not communicate the broadness of the skills cluster and appeared to describe a relatively narrow segment of skills. Including more than three skills makes the label too long and difficult to read in tables and data visualisations. In future research, we plan to explore other strategies to automate labelling. One strategy is to search Wikipedia for the skills in the clusters and identify the most common terms and categories used to describe them. For example, a common term occurring in descriptions for cardiology and oncology could be specialised medicine.
While the data-driven approach used to generate the skills taxonomy offers significant advantages, the end users of the taxonomy might still perceive an expert curated taxonomy to have a higher quality. This may be due to the fact that expert curated taxonomies incorporate data from different sources and reflect input from relevant industry bodies. To increase the validity of the proposed skills taxonomy we intend to refine the resulting taxonomy using feedback from ONS occupational experts, career advice services, educators and professional associations. This will enable us to increase the utility of the taxonomy for users. 

Conclusion

In this paper, we demonstrate how a taxonomy of employer skills, competence and knowledge requirements can be derived in a data-driven way. Using the initial results of the proposed method, we show that the automatically generated skills taxonomy performs reasonably well. The taxonomy contains three hierarchical layers, which are identified by applying a modularity optimisation community detection algorithm with bootstrapping and consensus clustering. The quality of the clustering is enhanced by using a word embeddings approach to capture the strength of relationships between the skills as opposed to relying only on a frequency-based measure.
In addition to generating the taxonomy, we also extract useful metadata on each skill cluster, mapping relationships between skill clusters and salary, occupations, and job titles. We also trial a method for determining the level of a skill specialisation by applying the Gaussian mixture model technique to the skill eigenvector centrality.
We make a number of contributions to the existing literature. The proposed skills taxonomy represents the first transparent non-expert-driven taxonomy that is independent from established frameworks such as ESCO or O*NET. The taxonomy is developed automatically and identifies meaningful patterns in the employer requirements without any pre-conditions for how requirements should be grouped. Because of this, the taxonomy minimises the risk that interrelationships between skills are overlooked because they don’t fit a traditional view of how skills should be organised. For example, machine learning and pattern recognition are usually grouped with computing skills, while in our taxonomy, these reside in the Physics and math cluster. Even though these skills are often applied together with programming, they are grounded in knowledge of mathematics.
One of the important contributions of the proposed skills taxonomy is that it offers a possibility of describing occupations from the perspective of skills. This is why, in the upcoming applied analysis paper, we intend to map this skills taxonomy to ONS SOC codes. As an exploratory exercise, we have studied the composition of the 200 most popular job titles by the third layer skill clusters. The results listing each job title and the most prominent skill clusters are shown in Table 7 in Appendix 3. In future work, we will extend this approach from job titles to SOC codes. The resulting crosswalk between skill clusters and SOC codes will create a foundation for combining official labour market statistics with the skills taxonomy to produce novel measures of skills demand, supply and mismatch. 
In future research, we also plan to extend the current hierarchical representation of the taxonomy into an ontology, where not just the direct, but also the lateral relationships between clusters are captured. The resulting ontology can then be implemented as a graph database accessible by the public. We would also like to study the evolution of employer requirements over time using the methodology described in \citet*{Rosvall2010}.
The resulting skills taxonomy as well as the algorithm for developing it and the interactive data visualisation will all be released to the public. We believe that these resources would benefit a wide audience and allow policymakers, educators and individuals to better understand the skill sets needed by employers and the associated salaries and job titles. The taxonomy also provides a foundation for measuring the similarity of jobs/occupations based on skills, competences and knowledge. These insights could be directly applied to inform policy on reskilling and identifying job transition opportunities for occupations at risk of decline.

Acknowledgements

The authors are grateful for the thoughts of colleagues at Nesta, the Economic Statistics Centre of Excellence and the Office for National Statistics on this work. Particular thanks are due to Hasan Bakhshi for his comments on early drafts.