The Louvain multilevel community detection algorithm identifies communities in the network that maximise the quality of the partitioning \citep*{blondel2008fast}. The established metric used to measure the quality of communities is modularity. Modularity varies between [-1,1] and refers to the concentration of edges within communities as opposed to the distribution of edges that would be observed in a random graph with the same vertex degree distribution. The Louvain algorithm is hierarchical, it starts with individual vertices belonging to their own communities and then iteratively groups the vertices vertices in such a way as to increase the overall modularity score. This algorithm is intuitive and one of the most commonly used for identifying network communities. Louvain was found to be the second best-performing method in the comparative analysis of algorithms conducted by Lancichinetti and Fortunato \citep*{fortunato2016community}. The criticism of the modularity optimisation algorithms focuses on limitations of these methods in identifying an appropriate level of resolution. The algorithms may split large communities or merge smaller ones. They may also underperform as compared to other methods if the true number of clusters is not known.
Infomap is a dynamics based community detection algorithm, which identifies communities in the network by measuring the flow of information through the network using random walks \citep*{rosvall2008maps}. The rationale behind the method is that due to the higher density of edges within communities, the random walkers will be trapped and spend a longer time inside communities. Infomap further improved the early implementations of the dynamics based algorithms by using information theory to define the most parsimonious way to describe graph community structure. Infomap is especially effective when applied to directed networks, where it can identify communities that would not be detected by modularity optimisation algorithms.
An initial comparative analysis of the three methods shows that they are all valid ways to partition the network. The non-hierarchical SBM identifies 283 communities, which appear to capture meaningful relationships between skills. However, the hierarchical implementation of SBM turns out not to be suitable, because the clusters are grouped into higher levels based on the similarity of their connections (i.e. similar number of edges) but ignore the subject domains. For example, computing- and medicine-related skills are placed in the same higher level group. The fact that only the non-hierarchical SBM clusters can be used for our purposes means that we need to identify a way to aggregate smaller clusters into broader groups. We explore hierarchical clustering using the clusters’ representative word embedding vectors. However, determining the appropriate way to generate representative vectors and selecting cutoffs for splitting the hierarchy into layers adds considerable subjectivity to the methodology. This is why we do not use the SBM in this analysis. The broad groups identified by Louvain algorithm are similar to those found by Infomap. However, Infomap detects a larger number of small, specialised employer requirement clusters; the algorithm appears to be peeling off more peripheral clusters at each iteration. For both algorithms, the cosine similarity property of edges is used. The Normalised Mutual Information (NMI) score for the top level of Infomap and the 2nd level of Louvain (chosen because they have a similar number of groups) is relatively high at 0.67, which indicates similar clustering (the maximum value of NMI is 1 for perfect correlation). Since at each layer of the taxonomy, Infomap identifies few large clusters and many very small clusters, we decide to use Louvain as it produces more evenly distributed clusters.
Increasing robustness of clustering
In the initial analysis, we used Louvain algorithm to detect communities with the highest corresponding modularity and continue to iteratively split these communities if the new partition had a positive modularity value. However, we found that this approach alone had a limitation of declining confidence in cluster membership at the tips of the tree. The fact that splitting a cluster improves modularity score doesn’t mean that the resulting lower level clusters are well separated. It is likely that with increasing depth, the clusters will be fragmented and driven by stochastic artefacts rather than meaningful differences. The challenge is to identify an objective criterion for determining confidence in cluster partitioning. We addressed this limitation by using the cluster ensemble technique, where a consensus partition is identified after multiple clustering iterations. To further reduce spurious variation in the underlying data we followed an approach outlined by \citet*{Rosvall2010} where the authors bootstrapped edges from a graph. When generating bootstrapped samples of graph edges we use probability, which is proportional to the strength of the edge. As a result we are more likely to see edges between nodes that are strongly related and less likely to sample an edge with only weak strength.
As summarised in Figure \ref{712470}, we sample the edges and then run Louvain algorithm 500 times and record node cluster membership. With the help of Cluster Ensemble package we aggregate the results using Hyper-graph partitioning (HGPA) and Meta-clustering algorithms (MCLA) \citep*{strehl2002cluster} and selecting the solution with the highest level of agreement between iterations. The level of agreement is measured as a weighted average NMI between all the runs and the resulting consensus cluster membership. Once the top layer membership is finalised, the approach is repeated for each cluster at layer 2 and 3.