Our initial strategy involved using top three most popular skills in the cluster as a label. This method did not produce satisfactory results since most popular three skills often did not communicate the broadness of the skills cluster and appeared to describe a relatively narrow segment of skills. Including more than three skills makes the label too long and difficult to read in tables/data visualisations. In the upcoming research, we will explore other strategies to automate labelling. One potential strategy will include searching Wikipedia for cluster skills and identifying most common terms and categories used to describe them. For example, a common term occurring in descriptions for cardiology and oncology could be specialised medicine.
While the data-driven approach used to generate the skills taxonomy offers significant advantages, the end users of the taxonomy might still perceive an expert curated taxonomy to have a higher quality. This may be due to the fact that expert curated taxonomies incorporate data from different sources and reflect input from relevant industry bodies. To increase the validity of the proposed skills taxonomy we intend to refine the resulting taxonomy using feedback from ONS occupational experts, career advice services, educators and professional associations. This will enable us to increase the utility of the taxonomy for users. 

Conclusion

In this paper, we demonstrate how a taxonomy of employer skills, competence and knowledge requirements can be derived in a data-driven way. Using the initial results of the proposed method, we show that the automatically generated skills taxonomy performs reasonably well. The taxonomy contains three hierarchical layers, which are identified by applying a modularity optimisation community detection algorithm with bootstrapping and consensus clustering. The quality of the clustering is enhanced by using a word embeddings approach to capture the strength of relationships between the skills as opposed to relying only on a frequency-based measure.
In addition to generating the taxonomy, we also extract useful metadata on each skill cluster, mapping relationships between skill clusters and salary, occupations, and job titles. We also trial a method for determining the level of a skill specialisation by applying the Gaussian mixture model technique to the skill eigenvector centrality.
We make a number of contributions to the existing literature. The proposed skills taxonomy represents the first transparent non-expert-driven taxonomy that is independent from established frameworks such as ESCO or O*NET. The taxonomy is developed automatically and identifies meaningful patterns in the employer requirements without any pre-conditions for how requirements should be grouped. Because of this, the taxonomy minimises the risk that interrelationships between skills are overlooked because they don’t fit a traditional view of how skills should be organised. For example, machine learning and pattern recognition are usually grouped with computing skills, while in our taxonomy, these reside in the Physics and math cluster. Even though these skills are often applied together with programming, they are grounded in knowledge of mathematics.
One of the important contributions of the proposed skills taxonomy is that it offers a possibility of describing occupations from the perspective of skills. This is why, in the upcoming applied analysis paper, we intend to map the developed skills taxonomy to ONS SOC codes. As an exploratory exercise, we have studied the composition of 200 most popular job titles by third layer skills clusters. The results listing each job title and the three most prominent skill clusters are shown in Table 7 in Appendix 3. In the future work, we will extend this approach from job titles to SOC codes. The resulting crosswalk between skill clusters and SOC codes will create a foundation for combining official labour market statistics and proposed skills taxonomy to produce novel measures of skills demand, supply and mismatch. 
In future research, we also plan to extend the current hierarchical representation of the taxonomy into an ontology, where not just the direct, but also the lateral relationships between clusters are captured. The resulting ontology can then be implemented as a graph database accessible by the public. 
We would also like to study the evolution of employer requirements over time using the methodology described in \citet*{Rosvall2010}.
The resulting skills taxonomy as well as the algorithm for developing it and the interactive data visualisation will all be released to the public. We believe that these resources would benefit a wide audience and allow policymakers, educators and individuals to better understand the skill sets needed by employers and the associated salaries and job titles. The taxonomy also provides a foundation for measuring the similarity of jobs/occupations based on skills, competences and knowledge. These insights could be directly applied to inform policy on reskilling and identifying job transition opportunities for occupations at risk of decline.

Acknowledgements

The authors are grateful for the thoughts of colleagues at Nesta, the Economic Statistics Centre of Excellence and the Office for National Statistics on this work. Particular thanks are due to Hasan Bakhshi for his comments on early drafts.

Appendices

Appendix 1: Overview of community detection algorithms
Stochastic Block Modelling (SBM) involves fitting a generative model of a graph to data \citep*{peixoto_bayesian_2017}. Under SBM, nonparametric statistical inference is applied to partition the graph in such a way as to maximise the explanatory power of a fitted model given the observed edges. From the candidates, the minimum description length model (i.e. the simplest model) is selected to prevent overfitting. SBMs have been found to produce some of the best results on real-life networks and are capable of identifying several types of network structures in addition to communities. SBMs can also detect hierarchical structures in networks and can be extended to overlapping communities. For the purposes of our analysis, we use a degree-corrected SBM that employs a Markov chain Monte Carlo (MCMC) algorithm \citep*{peixoto_efficient_2014} as implemented in the python igraph library.
The Louvain multilevel community detection algorithm identifies communities in the network that maximise the quality of the partitioning \citep*{blondel2008fast}. The established metric used to measure the quality of communities is modularity. Modularity varies between [-1,1] and refers to the concentration of edges within communities as opposed to the distribution of edges that would be observed in a random graph with the same vertex degree distribution. The Louvain algorithm is hierarchical; it starts with individual vertices belonging to their own communities and then iteratively groups the vertices in such a way as to increase the overall modularity score. This algorithm is intuitive and one of the most commonly used for identifying network communities. Louvain was found to be the second best-performing method in the comparative analysis of algorithms conducted by Lancichinetti and Fortunato \citep*{fortunato2016community}. The criticism of the modularity optimisation algorithms focuses on the limitations of these methods in identifying an appropriate level of resolution. The algorithms may split large communities or merge smaller ones. They may also underperform as compared to other methods if the true number of clusters is not known. 
Infomap is a dynamics based community detection algorithm, which identifies communities in the network by measuring the flow of information through the network using random walks \citep*{rosvall2008maps}. The rationale behind the method is that due to the higher density of edges within communities, the random walkers will be trapped and spend a longer time inside communities. Infomap further improved the early implementations of the dynamics based algorithms by using information theory to define the most parsimonious way to describe graph community structure. Infomap is especially effective when applied to directed networks, where it can identify communities that would not be detected by modularity optimisation algorithms.
Stochastic Block Modelling (SBM) involves fitting a generative model of a graph to data \citep*{peixoto_bayesian_2017}. Under SBM nonparametric statistical inference is applied to partition the graph in such a way as to maximise the explanatory power of a fitted model given the observed edges. From the candidates the minimum description length model (i.e. the simplest model) is selected to prevent overfitting. SBMs have been found to produce some of the best results on real-life networks and are capable of identifying several types of network structures in addition to communities. SBMs can also detect hierarchical structures in networks and can be extended to overlapping communities. For the purposes of our analysis, we use a degree corrected SBM that employs a Markov chain Monte Carlo (MCMC) algorithm \citep*{peixoto_efficient_2014} as implemented in python igraph library.