Micro-clustering
Our Drop-seq dataset should sample both cells in metastable progenitor states, as well as cells which are transiently progressing through a differentiation hierarchy. Indeed, this logic suggests that we can reconstruct developmental histories from cellular snapshots of many single cells. This has been the underlying logic for many trajectory building algorithms, such as Monocle, Wanderlust and Wishbone (Bendall et al., 2014; Setty et al., 2016; Trapnell et al., 2014). Importantly, the assumptions underlying this strategy require that we sufficiently sample the process to capture both abundant and rare transition states, and that our sampling procedure does not exclude particular states based on prior enrichment. The scale of our Drop-seq datasets, combined with the relatively unbiased strategy for sample preparation, strongly support these assumptions for our analyses.
While our clustering analyses are valuable for interpreting the major transcriptional states in a complex system, they impose a discrete framework on a transitioning cellular population. Moreover, the precise number of clusters for any algorithm is dependent on the granularity parameters used. We therefore reasoned that even within the clusters we defined in Figure 1, we should observe developmental heterogeneity, with each cluster consisting of both ‘early’ and ‘late’ cells.
To address this, we developed a strategy to ‘micro-cluster’ our data, further subdividing our clusters into small groups of 20 cells that not only mapped to the same cluster identity, but also were in a similar developmental state. Therefore, within each cluster, we ran a diffusion map procedure (Coifman and Lafon, 2006) on single cells, using the Euclidean distance defined by all mRNA markers’ expression. For each cluster, we found that the eigenvalues dropped off quickly after the first two diffusion map components (DMCs) within a cluster, and exhibited a unidirectional path, consistent with developmental heterogeneity. We fit a principal curve on DMCs 1 and 2 using the principal.curve() function in the R princurve package with default parameters (Banfield and Raftery, 1992; Hastie and Stuetzle, 1989). The progression of each cell was defined by projecting cells onto the principal curve, and we separate a cluster into small groups of 20 cells ordered by principal curve projection using the cut2() function in R. In this way, we partitioned our original dataset into 997 ‘micro-clusters’. we took the mean of the normalized expression for all detected genes, forming a new expression matrix of 23,661 genes and 997 micro-clusters, dramatically reducing the sampling noise associated with single cell data.
In principle, averaging signals across single cells can potentially blend together signals from heterogeneous sub-populations. While we attempted to avoid this by only averaging cells in very similar transcriptional states, we wanted to ensure that our micro-clusters truly represented ‘homogeneous’ populations. To do this, we tested whether drop-out rates for genes within a micro-cluster were consistent with pure sampling noise. For each gene in each micro-cluster, we calculated the expected Poisson drop-out rate (percentage of cells with zero detected molecules) based on its mean expression, and compared this to the observed drop-out rate (Figure S2A). Overall, we observed very high correlations (0.98 – 0.99) between expected and observed drop-outs, and this held across all micro-clusters (Figure S2B). This indicates that heterogeneity within a micro-cluster is driven primarily by sparse sampling as opposed to extensive biological heterogeneity, enabling us to pool information across cells in the same micro-cluster.