Introduction
In this work, we propose a data-driven taxonomy of skills mentioned by employers in online job adverts. We use the term “skills” to refer to all employer requirements including those relating to competences and knowledge. This is the first publicly available taxonomy we are aware of that does not rely on existing models and ontologies such as the Occupational Information Network (O*NET) and European Skills, Competences, Qualifications and Occupations (ESCO). It is also derived in an algorithmic way without expert elicitation, which means that it can be quickly updated to reflect changes in labour demand and provide timely insights to support labour market decision-making.
To generate the taxonomy we employ machine learning methods, such as word embeddings and network community detection algorithms. We model skills as a graph with individual skills as vertices and their co-occurrences in job adverts as edges. The strength of the relationships between the skills is measured using both frequency of actual co-occurrences of the skills in the same advert as well as their shared context, based on a trained word embeddings model. Once skills are represented as a network, we hierarchically group them into clusters. To ensure the stability of the resulting clusters, we introduce bootstrapping and consensus clustering stages into the methodology. While we share initial results and describe the skill clusters, the main purpose of this paper is to outline the methodology for building the taxonomy.
The remainder of the paper is organised as follows. We start by describing the motivation for developing a new skills taxonomy and relevant research. In the Methods section, we provide a detailed description of the methodology used to generate the skills taxonomy, followed by an overview of the initial results. The limitations of the approach are reviewed in the Discussion section. We conclude with a summary of the contributions of the paper and suggestions for future research.
Motivation
A growing body of research predicts that the labour force will undergo substantial changes in the near future. Globalisation and technological developments, together with environmental and demographic trends, will reshape labour market structures. A recent study by Nesta and Pearson predicts with confidence that around 20% of occupations will shrink and 10% will grow, but for the remaining occupations the outlook is highly uncertain \citep*{bakhshi_future_2017}. The nature of work and the requirements for effective job performance are also likely to change with new skills, competences and knowledge areas emerging, while other requirements become redundant \citep*{forum2018}. In this context, policymakers, educators, businesses and individuals need timely information on both how the labour market is changing and what the potential pathways are for upgrading workers’ skills and transitioning workers out of occupations at risk of decline. To generate such actionable insights we first, however, need a framework for measuring the similarity of skill requirements and grouping them in meaningful taxonomic groups. In this paper, we propose a methodology for discovering such a taxonomy in a data-driven way using non-traditional naturally occurring big data on the UK labour market.
Existing sources of information on occupational requirements have several limitations. Current publicly available models and taxonomies, such as O*NET \cite{national2010database} and ESCO \citep*{directorate-general_for_employment_social_affairs_and_inclusion_european_commission_esco_2017} are expert-derived, which makes them expensive to update on an ongoing basis. As a result, there is a risk that information on skills might become outdated. Another limitation is that in their current state, the taxonomies do not fully capture the relationships between skills, competences and knowledge requirements. In ESCO, occupations are explicitly linked to skills, but the information on how the skills are connected to each other is only provided for transversal skills. Transversal skills refer to skills that are not specific to particular occupations, but rather are relevant to a broad range of occupations.
Alongside taxonomies like O*NET and ESCO, researchers in the private sector have also developed skills taxonomies using vast amounts of data from online job adverts and job seeker resumes. However, these are not open to the public. In addition, none of the existing taxonomies have been developed using UK data and therefore may be less suitable for the analysis and measurement of skill requirements in UK occupations. To fill the gap in the existing skills taxonomies and frameworks, we propose an empirically-driven taxonomy that is derived automatically from online job adverts. The proposed taxonomy offers a number of advantages over existing ones. First, it leverages naturally occurring data on millions of vacancies, which can be efficiently collected at scale and in real time. Using online job adverts also allows us to capture skills required by employers directly; in the adverts, employers are free to describe what they are looking for in candidates and are not constrained to select the requirements from a narrow number of skill groups. Another advantage of the taxonomy we propose is that we can enrich it with other information available in job adverts, such as offered salary and job title. Last, but not least, we are committed to making our taxonomy and methodology open to the public, which we think is important if the data are used to inform public policy.
We believe that our data-driven skills taxonomy can directly contribute to more responsive and evidence-based policy making. Timely information on the demand for, and salaries associated with, particular skills, competences and knowledge areas can help policymakers prioritise investment in skill development. The proposed taxonomy, together with the occupational classification we developed in a previous paper \citep*{cathforthcoming}, can be combined to develop a recommender engine for identifying occupations that require similar skills. These insights could then inform policies for reskilling and supporting job transitions from occupations at risk of decline.
Related work
The systematic analysis of the occupational requirements has been a prominent area of labour market research for the past two decades. One of the most widely used models of occupational characteristics and worker attributes is O*NET, which was developed in late 1990s with support from the US Department of Labor and the Employment and Training Administration \citep*{markowitsch_descriptors_2009}. For each occupation, O*NET provides detailed descriptions of worker characteristics and requirements, necessary levels of training, education and experience, job characteristics and occupational outlooks. O*NET is periodically updated using information from occupational experts and job holders as well as from job postings. The European Commission's ESCO represents another major public effort to systematise occupational information. ESCO is an ontology that maps relationships between skills, qualifications and occupations that are aligned with the International Standard Classification of Occupations (ISCO). Following several years of expert collaboration and public consultations, the first full version of ESCO was released in October 2017. Both O*NET and ESCO are open to the public.
With regards to data-driven skills taxonomies, research in this area has, of late, been concentrated in the private sector. In one such study, \citet*{Zhao2015} used data from 100 million resumes on CareerBuilder to generate a taxonomy of skills. In processing the resumes, the authors disambiguated and normalised 46 million unique skill phrases. This resulted in a taxonomy of 50,000 skills. However, the content and structure of the taxonomy was not made public. To date, as far as we are aware, there are no purely data-driven skills taxonomies in the public domain.
Most researchers in this field use data-driven approaches to extend ESCO instead of developing new taxonomies. For example, \citet{sibarani_ontology-guided_2017} propose SARO - an ontology that connects information from job postings to ESCO skills. Authors tested automatically implementing SARO for extracting data from online vacancies for data scientists and performed a trend analysis for selected skills \citep*{dadzie_structuring_2017}. \citet{bosellilabour} also used relationships between ESCO occupations and skills to represent them in a bipartite knowledge graph, which enriches skills identified by experts for a given occupation with data from actual job adverts. It is likely that ESCO was chosen by researchers as a foundation due to its rich multilingual vocabulary of 13,485 skills as well as the availability of links between skills and occupations.
A methodology for the data-driven analysis of online job adverts was offered by \citet{coppleforthcoming}. The authors pursued a similar approach to the one we propose and implemented a bottom-up classification of jobs using vacancy descriptions in UK online job adverts. However, in their work Turrell et al. focused on identifying naturally existing occupational clusters and grouped individual jobs rather than employer requirements.
Within the context of the existing literature, our work contributes to the field in several ways. First, we offer a non-expert-driven taxonomy of skills required by employers that is independent of ESCO and O*NET. Since the taxonomy is created automatically, it’s also easier to reproduce and keep up-to-date. And unlike taxonomies developed by the private sector, our taxonomy and methodology will be released to the public. Our proposed taxonomy also captures links between skills, aggregated job titles, and the salaries mentioned in the millions of UK job adverts used in this analysis.
Data
The online job advert dataset used in this paper was provided by Burning Glass Technologies, a labour market analytics company. Burning Glass collects data on active job postings from thousands of web-pages on a daily basis \cite{burning_glass_technologies_markets_2017}. For each job posting, in addition to extracting job title, salary, education and experience requirements, Burning Glass identifies keywords from free text job descriptions. The full job descriptions are not available. We refer to the keywords as skills, which include skills, personal competences and knowledge required by employers. To develop the initial skills taxonomy we use data on over 41 million adverts collected over a five-year period from January 2012 to December 2017. It is important to note that in our dataset there are many adverts with missing information: only 61% of adverts contain data on offered salary, and substantially fewer mention education (19% of adverts) and experience requirements (13% of adverts).
Methods
We use two approaches to measure the relationships between skills mentioned in Burning Glass job adverts (Figure \ref{992146}). The first approach is based on the pairwise frequency of two skills appearing in the same job advert. The second approach is based on the distributed representation of skills. We generate the vector representations of skills by training a word2vec model, which learns the extent to which skills occur in the same context (i.e. together with other skills).
As a next step, we model the skills as a graph, where vertices represent individual skills. The vertices are joined by edges if they are mentioned in the same advert. The edges between vertices have attributes that describe the strength of the relationship, such as frequency (total number of pairwise skill mentions) and cosine similarity (similarity of the context in which the two skills occur across all adverts).