Dimensionality reduction
From the normalized expression matrix, we first identified a set of variable genes with high dispersion rates across cells. Briefly, we calculated the mean per gene in the non-log space, and dispersion was calculated from dividing mean by variance. We selected 5,000 genes with the highest dispersions as variable genes for dimensionality reduction, a common step in single-cell data analysis for reducing noise and capturing biological signals. Here we leveraged independent component analysis (ICA), which was initially developed to separate a group of mixed signals into additive sources that are independent of each other, and has more recently been applied to dimensionality reduction for single cell data (Trapnell et al., 2014). We implemented ICA using the ica package in R (Helwig and Hong, 2013; Hyvarinen, 1999).
The returned ICs contain pooled information across multiple correlated genes, and thus represented ‘meta-genes’ (Setty et al., 2016) which were robust to drop-out events in single-cell RNA-seq data. We noticed that the variance accounted for by each component fell after IC25, whereas GO term enrichment using Enrichr (Chen et al., 2013; Kuleshov et al., 2016) showed no significant enrichment after IC20. Furthermore, genes with strong IC14 loadings were dominated by mitochondrial genes, and we therefore used ICs 1 to 20 (excluding IC14) for downstream analysis.