Clustering of single cells
The CD34+ population contains hematopoietic stem and progenitor cells which are expected to be transcriptionally heterogeneous (Andrews et al., 1986), and therefore we used clustering analysis to reveal the different transcriptomic states within the cord blood CD34+ pool. We utilized biologically relevant ICs from dimensionality reduction as input for clustering, which we achieved by leveraging a modularity-based method on shared nearest neighbor (SNN) graphs (Blondel et al., 2008; Xu and Su, 2015). We defined the similarity of cells based on the overlap of neighborhoods (proportion of shared neighbors), which were built on Euclidean distances from the 19 input ICs/meta-genes. An SNN graph was then constructed using Jaccard similarity. In this SNN graph, groups of cells with largely overlapping neighborhoods represent interconnected ‘communities’ in a network, and therefore exhibit similar transcriptional patterns (Levine et al., 2015; Xu and Su, 2015). To partition the graph into a set of clusters, we utilized modularity optimization to find the best assignment for each cell through multiple iterations, where modularity (Q, shown below) evaluates both inter-cluster- and intra-cluster-connectivity on a graph (Blondel et al., 2008).
\begin{equation} =\ \frac{1}{2m}\ \sum_{i,\ j}{\left[\ A_{\text{ij}}-\ \frac{k_{i}k_{j}}{2m}\ \right]\delta(c_{i},\ c_{j})}\nonumber \\ \end{equation}
Specifically, \(A_{\text{ij}}\) refers to the edge weight between nodes and \(j\), \(k_{i}\) is the sum of all edges to node \(i\): \(\sum\), \(m=\frac{1}{2}\ \sum_{\text{ij}}A_{\text{ij}}\), \(\delta\left(c_{j}\right)=\ 1\) if and 0 if otherwise. By setting k (the number of nearest neighbor to define a neighborhood) = 20, resolution = 1.0 and 200 random starts, we obtained 21 single cell clusters using the function FindClusters() in Seurat package, implemented from a previously-published modularity optimizing software (Waltman and van Eck, 2013).
We note this clustering imposes a discrete framework on the data. While a set of clusters can be useful for interpretation of single cell data, our using of clustering algorithms does not preclude the potential for the underlying data to fall along a continuous manifold. Indeed, in downstream analyses, we further subdivide the clusters to better represent a more continuous landscape of cellular differentiation. However, we find this clustering framework to be valuable for interpreting and evaluating our data, specifically, to compare to previously generated microarray datasets, and to compare cellular densities across different cord blood units (Figure 1 C-D). Additionally, this clustering enables us to remove rare contaminant populations of differentiated cells that passed through the CD34 column. For example, cells in cluster 15 were highly expressing T cell genes such as CD6 , CD3D , CD247 and CD2 . Cluster 19 was enriched in genes unique for B cells – MS4A1 , CD83 , CD22 and CD79A , while lacking MME (CD10) expression, indicating that cells in cluster 19 were committed to B cell differentiation (Figure S1B). Overall, we kept ten clusters that represented early progenitors of megakaryocyte, erythrocyte, lymphoid, and myeloid cells (20,072 from 22,537 cells) for downstream analysis.