Differential expression analysis

Gene detection was estimated as the proportion of cells and nuclei expressing each gene (CPM > 0). In order to estimate the expected variability of gene detection as a result of population sampling, subsets of [100] cells were randomly sampled 100 times and a 95% confidence interval of gene detection was constructed. Data were summarized with a hexagonal binned scatter plot and a log-transformed color scale using the R package ggplot2 \cite{Wickham:2009:GEG:1795559}.
Differential expression between nuclei and cells was calculated with the R package limma \cite{ritchie2015limma} using default settings and log2(CPM + 1) defined based on two sets of reads: introns plus exons and only exons. Significantly differentially expressed were defined as having >1.5-fold change and a Benjamini-Hochberg corrected P-value < 0.05.
Two sets of nucleus- and cell-enriched genes (introns plus exons and exons only) were tested for gene ontology (GO) enrichment using the ToppGene Suite \cite{Chen_2009}. Significantly enriched (Benjamini-Hochberg FDR < 0.05) GO terms were summarized as tree maps with box sizes proportional to -log10(P-values) using REVIGO \cite{Supek_2011} (Figure \ref{898251}).

Clustering analysis

Nuclei were grouped into transcriptomic cell types based on an iterative clustering procedure. For each gene, log2(CPM + 1) expression was centered and scaled across nuclei. Gene expression dropout was more likely to occur in nuclei with lower quality cDNA libraries and for genes with lower average expression in nuclei isolated from the same cell type. Expression noise models were estimated for each nucleus based on the 8 most similar nuclei using the “knn.error.models” function of the “scde” R package as described in Fan et al., 2016) These noise models were used to select significantly variable genes (adjusted variance > 1.25) and to estimate a zero-weight matrix that represented the likelihood of dropouts based on average gene expression levels. Dimensionality reduction was performed with principal components analysis (PCA) on variable genes, and the covariance matrix was adjusted by the zero-weight matrix to account for gene dropouts. Principal components (PCs) were retained for which more variance was explained than the broken stick null distribution or PCs based on permuted data. If more than 2 PCs were retained, then dimensionality was further reduced to 2-dimensions using t-distributed stochastic neighbor embedding (tSNE)(Van Der Maaten and Hinton, 2008) with a perplexity parameter of 80.
After dimensionality reduction, nuclei were clustered using a conservative procedure that attempted to split them into the fewest number of clusters possible. Nearest-neighbor distances between all nuclei were calculated and sorted, and segmented linear regression (R package “segmented”) was applied to estimate the distribution breakpoint to help define the distance scale for density clustering. Next, density clustering (R package “dbscan” [Ester et al., 1996]) was applied to nuclei, and the number of clusters calculated for a range of 10 nearest-neighbor distances (parameter epsilon), starting from the maximum distance between nuclei to the distance breakpoint identified in the last step. If only one cluster was found using all values of epsilon, then the above procedure was repeated using a perplexity parameter of 50, 30, and 20 for tSNE, and stopping when more than one cluster was detected. Finally, if no cluster splitting was possible using tSNE, then a final density clustering was applied to the first two significant PCs. If more than one cluster was identified, then the statistical significance of each cluster pair was evaluated with the R package “sigclust” (Liu et al., 2008), which compares the distribution of nuclei to the null hypothesis that nuclei are drawn from a single multivariate Gaussian. Iterative clustering was used to split nuclei into sub-clusters until the occurrence of one of four stop criteria: 1) <6 nuclei in a cluster (because it cannot be split due a minimum cluster size of 3), 2) no significantly variable genes, 3) no significantly variable PCs, 4) no significant sub-clusters.
To assess the robustness of clusters, the iterative clustering procedure described above was repeated 100 times for random subsamples of 80% of nuclei. A co-clustering matrix was generated that represented the proportion of clustering iterations that each pair of nuclei were assigned to the same cluster. Average-linkage hierarchical clustering was applied to this matrix followed by dynamic branch cutting (R package “WGCNA”) with cut height ranging from 0.01 to 0.99 in steps of 0.01. A cut height resulting in 25 clusters was selected to balance cohesion (average within cluster co-clustering) and discreteness (average between cluster co-clustering) across clusters. Finally, gene markers were identified for all cluster pairs, and clusters were merged if they lacked binary markers (gene expressed in >50% nuclei in first cluster and <10% in second cluster) with average CPM > 1 (see also Marker gene selection below).

Cluster visualization

The relationships between cell type clusters were represented as a constellation diagram where the area of each disc is proportional to the number of nuclei in each cluster and the similarity between clusters is proportional to the width of the lines connecting clusters. Cluster similarity was calculated as the average co-clustering between all nuclei for each pair of clusters. For example, a similarity of 0.1 indicates that 1 out of 10 clustering iterations nuclei from one cluster were assigned to the other cluster. Similarities less than 0.05 were not plotted.
Next, clusters were arranged by transcriptomic similarity based on hierarchical clustering. First, the average expression level of each gene was calculated for each cluster. Genes were then sorted based on variance and the top 2000 genes were used to calculate a correlation-based distance matrix, Dxy=1-(cor(x,y))/2, between each cluster average. A cluster tree was generated by performing hierarchical clustering on this distance matrix (using “hclust” with default parameters), and then reordered to show inhibitory clusters first, followed by excitatory clusters and glia, with larger clusters first, while respecting the tree structure. Note that this measure of cluster similarity is complementary to the co-clustering similarity described above. For example, two clusters with high transcriptomic similarity but a few distinct marker genes may have low co-clustering similarity.

Marker gene selection

Initial sets of marker genes for each pair of clusters were selected by assessing significance of differential expression using the “limma” (Ritchie et al., 2015) R package, and then filtering these sets of significant genes to include only those expressed in more than 50% of nuclei in the “on” cluster and fewer than 20% of nuclei in the “off” cluster. Potential marker genes for individual clusters were chosen by ranking the significance of pairwise marker genes, summing the ranks across all possible pairs for a given cluster, and sorting the resulting gene list ascending by summed rank. The final set of marker genes was selected by comparing the gene expression distribution for the top ranked marker genes for each cluster using the visualization described below.

Cluster matching

Gene expression visualization

Gene expression (CPM) was visualized using heat maps and violin plots, which both show genes as rows and nuclei as columns, sorted by cluster. Heat maps display each nucleus as a short vertical bar, color-coded by expression level (blue=low; red=high), and clusters ordered as described above. The distribution of marker gene expression across nuclei in each cluster were represented as violin plots, which are density plots turned 90 degrees and reflected on the Y-axis. Black dots indicate the median gene expression in nuclei of a given cluster; dots above Y=0 indicate that a gene is expressed in more than half of the nuclei in that cluster.

Estimating nuclear proportions