While 2012-2016 and 2017 cluster membership is stable at the top layer of the hierarchy, the stability decreases at the second and third layers with some clusters that are separate in 2012-2016 being merged in 2017 and vice versa. Therefore, in future work, it will be necessary to increase the robustness of skill assignment to clusters at the lower layers of the hierarchy. Several potential methods for ensuring a higher degree of cluster stability are outlined in the Discussion section.  

Discussion

While the skills taxonomy is generated entirely automatically without expert input, it appears to perform reasonably well in identifying distinct groups of skills, competences and knowledge. As shown in Table 4 in the Appendix, the cluster profiles, especially at the first and second layers of the taxonomy, reflect established occupational domains, such as education, health, information technology and business administration. The metadata on skill clusters, such as salary and job titles, also appear generally aligned with the data from official statistics. For example, the clusters with the highest minimum salary are located in finance, tax and compliance and software engineering, while the lowest paid skill clusters are in caregiving and retail (all of these clusters reside in the third layer of hierarchy).
Initial results demonstrate that the data-driven approach to grouping skills, competences and knowledge areas has its merits. At the same time, in its current state the methodology for deriving the hierarchical taxonomy has several limitations. The first limitation is the declining confidence in cluster membership at the tips of the tree. The fact that splitting a cluster improves modularity score doesn’t mean that the resulting lower level clusters are well separated. It is likely that with increasing depth, the clusters will be fragmented and driven by stochastic artefacts rather than meaningful differences. The challenge is to identify an objective criterion for determining confidence in cluster partitioning. One potential solution is to apply the approach commonly used in phylogenetics, where a consensus tree is built after multiple iterations of generating a hierarchy using bootstrapped samples from the original data. Using bootstrapping would potentially allow us to reduce spurious variation in the underlying data and identify whether detected patterns of cluster partitioning have high stability. For example, by using this method, we could test at each depth of the tree the extent to which skills are consistently grouped together. The splitting should be stopped when the resulting sub-clusters do not demonstrate high confidence (i.e. if skills are grouped together on only 50% of occasions).
The second limitation is that we do not allow skills to exist in multiple parts of the taxonomy. The current hierarchical structure places a skill in the cluster in which it is most strongly connected with the other members. However, it is likely that certain skills, competences and knowledge such as cooking and biology, will have lateral links to other clusters. For example, cooking resides in caregiving, but can also be connected to food service in retail. Similarly, biology, which is currently in pathology, could also sit in education. To address the limitation of hard partitioning, we propose to complement the provided hierarchical structure with a simplified graph of skill clusters. In this graph all the vertices will be contracted to their 3rd layer clusters. The links between the 147 clusters can then be aggregated and used to explore the lateral relationships between skill clusters.
Finally, it is not currently clear how to incorporate incoming information on job adverts. In future work, we would like to explore the advantages and disadvantages of running the analysis on the whole dataset, updated with new information, as opposed to generating the word embeddings and the taxonomy on temporal slices of the data. For further validation, to assess the extent to which the clusters are distinct we will also collect text from Wikipedia articles on individual skills in each cluster. We will then use the articles to analyse the extent to which key terms are associated with certain skill clusters using Mutual information method. Given the nested nature of the taxonomy, we expect the clusters at deeper hierarchy layers to be more similar and refer to the same subject domains. This is why the proposed analysis is likely to be more appropriate if applied to first and second layers only.

Conclusion

In this paper, we demonstrate how a taxonomy of employer skills, competence and knowledge requirements can be derived in a data-driven way. Using initial results of the proposed method, we show that the automatically generated skills taxonomy performs reasonably well. The taxonomy contains four hierarchical layers, which are identified by iteratively applying a modularity optimisation community detection algorithm. The quality of the clustering is enhanced by using a word embeddings approach to capture the strength of relationships between the skills as opposed to relying only on a frequency-based measure.
In addition to generating the taxonomy, we also extract useful metadata on each skill cluster, mapping relationships between skill clusters and salary, occupations, and job titles. We also trial a method for determining the level of a skill specialisation by applying Gaussian mixture model technique to the skill eigenvector centrality.
We make a number of contributions to the existing literature. The proposed skills taxonomy represents the first transparent non-expert-driven taxonomy which is independent from established frameworks such as ESCO or O*NET. The taxonomy is developed automatically and identifies meaningful patterns in the employer requirements without any pre-conditions on how requirements should be grouped. Because of this, the taxonomy minimises the risk that interrelationships between skills are overlooked because they don’t fit a traditional view of how skills should be organised. For example, agricultural skills are usually grouped in their own separate category, while in our taxonomy, these reside in grounds maintenance, because many of the associated skills, such as using fertilisers are similar to requirements in landscaping and gardening occupations. Therefore, the taxonomy provides a unique opportunity for validating expert-derived taxonomies.
The resulting skills taxonomy as well as the algorithm for developing it and the interactive data visualisation will all be released to the public. We believe that these resources would benefit a wide audience and allow policymakers, educators and individuals to better understand the skill sets needed by employers and the associated salaries and job titles. The taxonomy also provides a foundation for measuring the similarity of jobs/occupations based on skills, competences and knowledge. These insights could be directly applied to inform policy on reskilling and identifying job transition opportunities for occupations at risk of decline.
In future research, we plan to increase the robustness of the proposed methodology by including a bootstrapping stage in the methodology to ensure stability of the resulting groups. We will also extend the current hierarchical representation of the taxonomy into an ontology, where not just the direct, but also lateral relationships between clusters are captured. The resulting ontology can then be implemented as a graph database, accessible by the public. We would also like to study the evolution of employer requirements over time using methodology described in \citet*{Rosvall2010}.

Appendix