Evaluation of cluster identities
We next sought to compare the gene expression patterns of our single cell clusters with previously characterized progenitor populations in human cord blood. We used a previously published microarray reference dataset (Laurenti et al., 2013), which contains expression profiles of common myeloid progenitor (CMP), megakaryocyte-erythroid progenitor (MEP), hematopoietic stem cell (HSC), granulocyte-monocyte progenitor (GMP), and multilymphoid progenitor (MLP). We hypothesized that if of our single cell clusters matched any of these reference populations, the two groups should share common markers of gene expression. We reasoned that the most informative markers would represent genes that were not only up-regulated in expression for a given cell group, but would in fact be most highly expressed in this group compared to all other groups.
We therefore leveraged the published list of gene expression signatures for the dataset, extracting the top 250 genes that were most significantly up-regulated each population (as originally computed from limma (Ritchie et al., 2015)). To define markers for each reference subpopulation, we required that the gene not only be in this up-regulated list, but also be expressed at the highest level across the dataset.
After defining these markers of reference populations, we examined the expression of these genes in our single cell clusters, identifying which single cell cluster had the highest expression for most of these markers. For example, of the 143 ‘reference markers’ for GMP from the microarray dataset, 74 of these were most highly expressed in cluster 9 cells (p<10-45; one sided test of equal proportions). Figure 1D shows the results of this analysis for all pairs of single cell and reference clusters. With this method, we could recover well-characterized progenitor states from our Drop-seq clusters using the reference dataset.