Figure \ref{768538} in Appendix 2 summarises the steps taken to build the skills graph and prepare it for the community detection stage.
Detecting Communities
Prior to applying community detection algorithms, we reviewed the filtered graph and found that 38% of existing edges have cosine similarity of 0 or lower. This may be because some of the pairwise links reflect spurious co-occurrences and that even if two skills are mentioned together in a few adverts they are not mentioned frequently enough in the same context. With this in mind, we remove any edges where the cosine similarity is equal to 0 or lower.
We have considered three different approaches to detecting communities in a graph: a method based on statistical inference (Stochastic Block Model), a method based on optimising the clustering quality parameter (Louvain) and a method based on exploring dynamical processes on the graph (Infomap).
A brief overview of algorithms is provided in Appendix 1. An initial comparative analysis of the three methods shows that they are all valid ways to partition the network. The non-hierarchical SBM identifies 283 communities, which appear to capture meaningful relationships between skills. However, the hierarchical implementation of SBM turns out not to be suitable because the clusters are grouped into higher levels based on the similarity of their connections (i.e. a similar number of edges) but ignore the subject domains. For example, computing- and medicine-related skills are placed in the same higher level group. The fact that only the non-hierarchical SBM clusters can be used for our purposes means that we need to identify a way to aggregate smaller clusters into broader groups.
We explore hierarchical clustering using the clusters’ representative word embedding vectors. However, determining the appropriate way to generate representative vectors and selecting cutoffs for splitting the hierarchy into layers adds considerable subjectivity to the methodology. This is why we do not use the SBM in this analysis.
The broad groups identified by the Louvain algorithm are similar to those found by Infomap. However, Infomap detects a larger number of small, specialised employer requirement clusters; the algorithm appears to be peeling off more peripheral clusters at each iteration. For both algorithms, the cosine similarity property of edges is used. The Normalised Mutual Information (NMI) score for the top level of Infomap and the 2nd level of Louvain (chosen because they have a similar number of groups) is relatively high at 0.67, which indicates similar clustering (the maximum value of NMI is 1 for perfect correlation). Since Infomap identifies few large clusters and many very small clusters at each layer of the taxonomy, we decide to use Louvain as it produces more evenly distributed clusters.
Increasing robustness of clustering
We initially used the Louvain algorithm to detect communities with the highest corresponding modularity and iteratively split these communities if the new partition had a positive modularity value. However, some of the identified clusters, especially at the lower layers of the hierarchy, appeared to be driven by stochastic artefacts rather than meaningful differences. The fact that splitting a cluster improves graph modularity score doesn’t mean that the resulting lower level clusters are well separated. The challenge is to identify an objective criterion for determining confidence in cluster partitioning. To ensure the stability of the resulting groups, we introduce bootstrapping and consensus clustering stages in the methodology. Bootstrapping refers to random sampling with replacement. This technique is used in machine learning to increase robustness of models since it allows for better exploration of biases and variation in the underlying dataset. Typically bootstrapping involves sampling of individual observations, which would be skills in our case. However, to preserve as many actual skills in the graph as possible, we follow an approach outlined by \citet*{Rosvall2010} and instead of bootstrapping nodes we sample graph edges. We also assign to each edge a probability of being selected, which is proportional to the edge cosine similarity property. As a result we are more likely to select edges between nodes that are strongly related and less likely to sample spurious edges. As shown in Figure \ref{712470}, once a bootstrapped sample is formed, we build a new graph using the edges in our sample and then apply the Louvain algorithm to detect communities with highest modularity. We repeat these steps 500 times, record node membership in resulting clusters and reconcile results from these runs using the cluster ensemble technique. The cluster ensemble method produces a single consensus partition and is performed using Hyper-graph partitioning (HGPA) and Meta-clustering algorithms (MCLA) \citep*{strehl2002cluster} as implemented in Cluster Ensemble python package \citep*{giecold2016robust}. The quality of clustering is determined based on the level of agreement, which is measured as a weighted average NMI between all the runs and the resulting consensus cluster membership. We select the solution with the highest level of agreement. Once the top layer membership is finalised, the approach is repeated for each cluster at layer 2 and 3.