Determining branch-dependent genes and their dynamic patterns
Once we had a reconstructed developmental hierarchy of early
hematopoiesis, we next asked how gene expression patterns varied across
the lineages. For example, Figure 2D exhibits expression patterns for
canonical markers, revealing that many key hematopoietic regulators
significantly diverge in their expression at transcriptomic branch
points.
We therefore designed a test to identify, in an unbiased way,
branch-dependent genes whose expression levels were dynamic across any
of the bifurcations in our model. For each branch point, we performed
the following test. We considered all nodes downstream of a branch
point, partitioning them into two groups based on the initial
bifurcation. We then linearly scaled (normalized) the ‘developmental
progression’ for nodes along the ‘left’ branch to fall between -1 and 0,
and nodes along the right branch to fall between 0 and 1. A normalized
value of -1 or 1 indicates a node which is farthest from the root, on
either the left branch or right branch. Effectively, this normalization
step allows the divergence on both branches to receive equal weight in
the downstream test. Therefore, for each branch point, we obtained a
vector v , which contained the normalized developmental
progression for all downstream micro-clusters, ranging between -1 and 1.
Branch-dependent genes whose expression bifurcates at the branch point,
should therefore have expression levels that are strongly correlated (or
anti-correlated) with vector v. Therefore, for each gene
g, we computed a branch score , where represents the z-score for g in
each micro-cluster.
We observed that branch scores across all genes roughly obeyed a normal
distribution, as the majority of genes across the transcriptome were not
branch-dependent. We therefore selected positive and negative outlier
genes whose branch score was greater than 2.5 times standard deviations
from the mean of the distribution. Across all three branch points, we
detected a total of 632 branch-dependent genes. Lastly, we grouped genes
into modules with similar developmental dynamics using constrained
k-means clustering (Cuesta-Albertos et al., 1997) with 100 random
starts, using the tkmeans() function in package tclust. After clustering
we identified a group of 127 genes that were primarily very lowly
expressed, exhibited poor within-cluster similarities, and no
hematopoietic ontology enrichments, and therefore removed these genes
from further analysis. Clustering results, including all removed genes,
are shown in Table S3.
Aligning datasets from human bone marrow and cord blood
The raw single-cell RNA-seq read count from human bone marrow CD34+ cells were downloaded from NCBI GEO (GSE****), and normalized identically as Drop-seq data using log normalization. Given the current limitation for aligning multiple datasets, we utilized only one of the bone marrow data (donor 1 using QUARTZ-seq). As described in Butler and Satija (citation), a canonical correlation analysis was run between QUARTZ-seq and Drop-seq datasets to extract common biological variations, and alignment was run for CCs 1 to 7 to create a new Seurat object with scRNA-seq data from the two systems. Co-clustering was performed on the merged data using modularity optimization on aligned CCs 1-7, and visualized in two dimensions with tSNE. Bone marrow cells were annotated as the identities from cord blood cells sharing the same cluster.