Clustering of single cells
The CD34+ population contains hematopoietic stem and
progenitor cells which are expected to be transcriptionally
heterogeneous (Andrews et al., 1986), and therefore we used clustering
analysis to reveal the different transcriptomic states within the cord
blood CD34+ pool. We utilized biologically relevant
ICs from dimensionality reduction as input for clustering, which we
achieved by leveraging a modularity-based method on shared nearest
neighbor (SNN) graphs (Blondel et al., 2008; Xu and Su, 2015). We
defined the similarity of cells based on the overlap of neighborhoods
(proportion of shared neighbors), which were built on Euclidean
distances from the 19 input ICs/meta-genes. An SNN graph was then
constructed using Jaccard similarity. In this SNN graph, groups of cells
with largely overlapping neighborhoods represent interconnected
‘communities’ in a network, and therefore exhibit similar
transcriptional patterns (Levine et al., 2015; Xu and Su, 2015). To
partition the graph into a set of clusters, we utilized modularity
optimization to find the best assignment for each cell through multiple
iterations, where modularity (Q, shown below) evaluates both
inter-cluster- and intra-cluster-connectivity on a graph (Blondel et
al., 2008).
\begin{equation}
=\ \frac{1}{2m}\ \sum_{i,\ j}{\left[\ A_{\text{ij}}-\ \frac{k_{i}k_{j}}{2m}\ \right]\delta(c_{i},\ c_{j})}\nonumber \\
\end{equation}
Specifically, \(A_{\text{ij}}\) refers to the edge weight between nodes
i and \(j\), \(k_{i}\) is the sum of all edges to node
\(i\): \(\sum\),
\(m=\frac{1}{2}\ \sum_{\text{ij}}A_{\text{ij}}\),
\(\delta\left(c_{j}\right)=\ 1\) if and 0 if
otherwise. By setting k (the number of nearest neighbor to define a
neighborhood) = 20, resolution = 1.0 and 200 random starts, we obtained
21 single cell clusters using the function FindClusters() in Seurat
package, implemented from a previously-published modularity optimizing
software (Waltman and van Eck, 2013).
We note this clustering imposes a discrete framework on the data. While
a set of clusters can be useful for interpretation of single cell data,
our using of clustering algorithms does not preclude the potential for
the underlying data to fall along a continuous manifold. Indeed, in
downstream analyses, we further subdivide the clusters to better
represent a more continuous landscape of cellular differentiation.
However, we find this clustering framework to be valuable for
interpreting and evaluating our data, specifically, to compare to
previously generated microarray datasets, and to compare cellular
densities across different cord blood units (Figure 1 C-D).
Additionally, this clustering enables us to remove rare contaminant
populations of differentiated cells that passed through the CD34 column.
For example, cells in cluster 15 were highly expressing T cell genes
such as CD6 , CD3D , CD247 and CD2 . Cluster 19
was enriched in genes unique for B cells – MS4A1 , CD83 ,
CD22 and CD79A , while lacking MME (CD10)
expression, indicating that cells in cluster 19 were committed to B cell
differentiation (Figure S1B). Overall, we kept ten clusters that
represented early progenitors of megakaryocyte, erythrocyte, lymphoid,
and myeloid cells (20,072 from 22,537 cells) for downstream analysis.