Evaluation of cluster identities
We next sought to compare the gene expression patterns of our single
cell clusters with previously characterized progenitor populations in
human cord blood. We used a previously published microarray reference
dataset (Laurenti et al., 2013), which contains expression profiles of
common myeloid progenitor (CMP), megakaryocyte-erythroid progenitor
(MEP), hematopoietic stem cell (HSC), granulocyte-monocyte progenitor
(GMP), and multilymphoid progenitor (MLP). We hypothesized that if of
our single cell clusters matched any of these reference populations, the
two groups should share common markers of gene expression. We reasoned
that the most informative markers would represent genes that were not
only up-regulated in expression for a given cell group, but would in
fact be most highly expressed in this group compared to all other
groups.
We therefore leveraged the published list of gene expression signatures
for the dataset, extracting the top 250 genes that were most
significantly up-regulated each population (as originally computed from
limma (Ritchie et al., 2015)). To define markers for each reference
subpopulation, we required that the gene not only be in this
up-regulated list, but also be expressed at the highest level across the
dataset.
After defining these markers of reference populations, we examined the
expression of these genes in our single cell clusters, identifying which
single cell cluster had the highest expression for most of these
markers. For example, of the 143 ‘reference markers’ for GMP from the
microarray dataset, 74 of these were most highly expressed in cluster 9
cells (p<10-45; one sided test of equal
proportions). Figure 1D shows the results of this analysis for all pairs
of single cell and reference clusters. With this method, we could
recover well-characterized progenitor states from our Drop-seq clusters
using the reference dataset.