Determining branch-dependent genes and their dynamic patterns
Once we had a reconstructed developmental hierarchy of early hematopoiesis, we next asked how gene expression patterns varied across the lineages. For example, Figure 2D exhibits expression patterns for canonical markers, revealing that many key hematopoietic regulators significantly diverge in their expression at transcriptomic branch points.
We therefore designed a test to identify, in an unbiased way, branch-dependent genes whose expression levels were dynamic across any of the bifurcations in our model. For each branch point, we performed the following test. We considered all nodes downstream of a branch point, partitioning them into two groups based on the initial bifurcation. We then linearly scaled (normalized) the ‘developmental progression’ for nodes along the ‘left’ branch to fall between -1 and 0, and nodes along the right branch to fall between 0 and 1. A normalized value of -1 or 1 indicates a node which is farthest from the root, on either the left branch or right branch. Effectively, this normalization step allows the divergence on both branches to receive equal weight in the downstream test. Therefore, for each branch point, we obtained a vector v , which contained the normalized developmental progression for all downstream micro-clusters, ranging between -1 and 1.
Branch-dependent genes whose expression bifurcates at the branch point, should therefore have expression levels that are strongly correlated (or anti-correlated) with vector v. Therefore, for each gene g, we computed a branch score , where represents the z-score for g in each micro-cluster.
We observed that branch scores across all genes roughly obeyed a normal distribution, as the majority of genes across the transcriptome were not branch-dependent. We therefore selected positive and negative outlier genes whose branch score was greater than 2.5 times standard deviations from the mean of the distribution. Across all three branch points, we detected a total of 632 branch-dependent genes. Lastly, we grouped genes into modules with similar developmental dynamics using constrained k-means clustering (Cuesta-Albertos et al., 1997) with 100 random starts, using the tkmeans() function in package tclust. After clustering we identified a group of 127 genes that were primarily very lowly expressed, exhibited poor within-cluster similarities, and no hematopoietic ontology enrichments, and therefore removed these genes from further analysis. Clustering results, including all removed genes, are shown in Table S3.
Aligning datasets from human bone marrow and cord blood
The raw single-cell RNA-seq read count from human bone marrow CD34+ cells were downloaded from NCBI GEO (GSE****), and normalized identically as Drop-seq data using log normalization. Given the current limitation for aligning multiple datasets, we utilized only one of the bone marrow data (donor 1 using QUARTZ-seq). As described in Butler and Satija (citation), a canonical correlation analysis was run between QUARTZ-seq and Drop-seq datasets to extract common biological variations, and alignment was run for CCs 1 to 7 to create a new Seurat object with scRNA-seq data from the two systems. Co-clustering was performed on the merged data using modularity optimization on aligned CCs 1-7, and visualized in two dimensions with tSNE. Bone marrow cells were annotated as the identities from cord blood cells sharing the same cluster.