Micro-clustering
Our Drop-seq dataset should sample both cells in metastable progenitor
states, as well as cells which are transiently progressing through a
differentiation hierarchy. Indeed, this logic suggests that we can
reconstruct developmental histories from cellular snapshots of many
single cells. This has been the underlying logic for many trajectory
building algorithms, such as Monocle, Wanderlust and Wishbone (Bendall
et al., 2014; Setty et al., 2016; Trapnell et al., 2014). Importantly,
the assumptions underlying this strategy require that we sufficiently
sample the process to capture both abundant and rare transition states,
and that our sampling procedure does not exclude particular states based
on prior enrichment. The scale of our Drop-seq datasets, combined with
the relatively unbiased strategy for sample preparation, strongly
support these assumptions for our analyses.
While our clustering analyses are valuable for interpreting the major
transcriptional states in a complex system, they impose a discrete
framework on a transitioning cellular population. Moreover, the precise
number of clusters for any algorithm is dependent on the granularity
parameters used. We therefore reasoned that even within the clusters we
defined in Figure 1, we should observe developmental heterogeneity, with
each cluster consisting of both ‘early’ and ‘late’ cells.
To address this, we developed a strategy to ‘micro-cluster’ our data,
further subdividing our clusters into small groups of 20 cells that not
only mapped to the same cluster identity, but also were in a similar
developmental state. Therefore, within each cluster, we ran a diffusion
map procedure (Coifman and Lafon, 2006) on single cells, using the
Euclidean distance defined by all mRNA markers’ expression. For each
cluster, we found that the eigenvalues dropped off quickly after the
first two diffusion map components (DMCs) within a cluster, and
exhibited a unidirectional path, consistent with developmental
heterogeneity. We fit a principal curve on DMCs 1 and 2 using the
principal.curve() function in the R princurve package with default
parameters (Banfield and Raftery, 1992; Hastie and Stuetzle, 1989). The
progression of each cell was defined by projecting cells onto the
principal curve, and we separate a cluster into small groups of 20 cells
ordered by principal curve projection using the cut2() function in R. In
this way, we partitioned our original dataset into 997 ‘micro-clusters’.
we took the mean of the normalized expression for all detected genes,
forming a new expression matrix of 23,661 genes and 997 micro-clusters,
dramatically reducing the sampling noise associated with single cell
data.
In principle, averaging signals across single cells can potentially
blend together signals from heterogeneous sub-populations. While we
attempted to avoid this by only averaging cells in very similar
transcriptional states, we wanted to ensure that our micro-clusters
truly represented ‘homogeneous’ populations. To do this, we tested
whether drop-out rates for genes within a micro-cluster were consistent
with pure sampling noise. For each gene in each micro-cluster, we
calculated the expected Poisson drop-out rate (percentage of cells with
zero detected molecules) based on its mean expression, and compared this
to the observed drop-out rate (Figure S2A). Overall, we observed very
high correlations (0.98 – 0.99) between expected and observed
drop-outs, and this held across all micro-clusters (Figure S2B). This
indicates that heterogeneity within a micro-cluster is driven primarily
by sparse sampling as opposed to extensive biological heterogeneity,
enabling us to pool information across cells in the same micro-cluster.