Cluster dendrograms

Clusters were arranged by transcriptomic similarity based on hierarchical clustering. First, the average expression level of the top 1200 marker genes (i.e. highest beta scores) was calculated for each cluster. A correlation-based distance matrix (\(D_{xy}=\frac{1-\rho\left(x,y\right)}{2}\)) was calculated, and complete-linkage hierarchical clustering was performed using the "hclust" R function with default parameters. The resulting dendrogram branches were reordered to show inhibitory clusters followed by excitatory clusters, with larger clusters first, while retaining the tree structure. Note that this measure of cluster similarity is complementary to the co-clustering separation described above. For example, two clusters with similar gene expression patterns but a few binary marker genes may be close on the tree but highly distinct based on co-clustering.

Matching clusters based on marker gene expression

Nuclei and cell clusters were independently compared to published mouse V1 cell types \cite{26727548}. The proportion of nuclei or cells expressing each gene with CPM > 1 was calculated for all clusters. Approximately 400 genes were markers in both data sets (beta > 0.3 or tau > 0.9) and were expressed in the majority of samples of between one and five clusters. Markers expressed in more than five clusters were excluded to increase the specificity of cluster matching. Weighted correlations were calculated between all pairs of clusters across these genes and weighted by beta scores to increase the influence of more informative genes. Heatmaps were generated to visualize all cluster correlations. All nuclei and cell clusters had reciprocal best matching clusters from Tasic et al. and were labeled based on these reported cluster names.
Next, nuclei and cell clusters were directly compared using the above analysis. All 11 clusters had reciprocal best matches that were consistent with cluster labels assigned based on similarity to published types. The most highly conserved marker genes of matching clusters were identified by selecting genes expressed in a single cluster (>50% of samples with CPM > 1) and with the highest minimum beta score between nuclei and cell clusters. Two additional marker genes were identified that discriminated two closely related clusters. Violin plots of marker gene expression were constructed with each gene on an independent, linear scale.
Nuclei and cell clusters were also compared by calculating average cluster expression based only on intronic or exonic reads and calculating a correlation-based distance using the top 1200 marker genes as described above. Hierarchical clustering was applied to all clusters quantified using the two sets of reads. In addition, the average log2(CPM + 1) expression across all nuclei and cells was calculated using intronic or exonic reads.
Cluster separation was calculated for individual nuclei and cells as the average within cluster co-clustering of each sample minus the maximum average between cluster co-clustering. Separations for matched pairs of clusters were visualized with box plots and compared using a Student's t-test, and significance was tested after Bonferroni correction for multiple testing. Finally, a linear model was fit to beta marker scores for genes that were expressed in at least one but not all cell and nuclear clusters, and the intercept was set to zero.

Estimating proportions of nuclear transcripts

The nuclear proportion of transcripts was estimated in two ways. First, all intronic reads were assumed to be from transcripts localized to the nucleus so that the proportion of intronic reads measured in cells should decrease linearly with the nuclear proportion of the cell as nuclear reads are diluted with cytoplasmic reads. For each cell type, the nuclear proportion was estimated as the proportion of intronic reads in cells divided by the proportion of intronic reads in matched nuclei. Second, the nuclear proportion was estimated as the average ratio of cell to nuclear expression (CPM) using only exonic reads of three highly expressed nuclear genes (Snhg11Malat1, and Meg3).
The standard deviation of nuclear proportion estimates were calculated based on standard error propagation of variation in intronic read proportions and expression levels. Nuclear proportion estimates were compared with linear regression, and the estimate based on relative expression levels was used for further analysis.
The nuclear proportion of transcripts for all genes was estimated for each cell type as the ratio of average expression (CPM) in nuclei versus matched cells multiplied by the nuclear proportion of all transcripts. Estimated proportions greater than 1 were set equal to 1 for each cell type, and a weighted average proportion was calculated for each gene with weights equal to the average log2(CPM + 1) expression in each cell type.
11,932 genes were expressed in at least one nuclear or cell cluster (>50% samples expressed with CPM > 1) and were annotated as one of three gene types -- protein-coding, protein non-coding, or pseudogene -- using gene metadata from NCBI  (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz; downloaded 10/12/2017). For each type, histograms of gene counts with different nuclear proportions were generated. Next, genes were grouped into 10 bins of estimated nuclear fraction, from high cytoplasmic to high nuclear enrichment, and beta marker score distributions were visualized as box plots. ANOVA followed by Tukey Honest Significant Differences were calculated to test the significance of beta score differences among gene types and among nuclear proportions within gene types.
Nuclear transcript proportions were compared to nuclear proportions estimated for mouse liver and pancreatic beta cells based on data from \citet{26711333}. Ratios of normalized nuclear and cytoplasmic transcript counts were calculated in four tissue replicates. Average ratios were calculated for genes with at least one count in either fraction in at least one tissue. Nuclear proportion estimates for all genes with data from both data sets (n = 4373)  were compared with Pearson correlation, a linear model with intercept set equal to zero, and histograms with a bin width of 0.02.