Introduction
Cell types in mammalian brain can been defined based on various properties including their morphology, electrophysiology, and gene expression \cite{Poulin_2016,Zeng_2017,Bernard_2009}. scRNA-seq has emerged as a high-throughput method for quantification of the majority of transcripts in thousands of cells. Similarities and differences in gene expression at the single cell level characterized by scRNA-seq have revealed diverse cell types in many mouse brain regions, including neocortex \cite{Tasic2016a,Tasic2017,Zeisel2015}, hypothalamus \cite{Campbell2017}, and retina \cite{Shekhar2016,Macosko2015}.
However, scRNA-seq profiling does not provide an unbiased survey of neural cell types. Some cell types are more vulnerable to the tissue dissociation process and are underrepresented in the final data set. For example, in mouse neocortex, fast-spiking parvalbumin-positive interneurons and subcortically projecting glutamatergic neurons in layer 5 are observed in lower proportions than expected and need to be selectively enriched using Cre-driver lines to achieve sufficient sampling \cite{Tasic2017}. In adult human neocortex, non-neuronal cells survive dissociation better than neurons and are over-represented in single cell suspensions \cite{Darmanis2015}. In contrast to whole cells, nuclei are more resistant to mechanical assaults and can be isolated from frozen tissue \cite{Krishnaswami_2016,Lacar2016}. Single nuclei have been shown to provide sufficient gene expression information to define relatively broad cell classes in adult human brain \cite{Lake2016,29227469} and mouse hippocampus \cite{Habib2016}.
However, previous studies have not investigated if individual nuclei contain sufficient diversity and number of transcripts to enable discrimination of closely related cell types at a resolution comparable to whole cells. A recent study compared clustering results for single nuclei and whole cells isolated from the mouse somatosensory cortex \cite{Lake2017}, but it only showed similar ability to distinguish two highly different cell classes: superficial- and deep-layer excitatory neurons.
In this study, we investigated differences in mRNA composition and information content between nuclei and whole cells, and the ability to detect cell types with similar resolution. For this purpose, we focused on well-matched sets of cells and nuclei: 463 nuclei and 463 whole cells from layer 5 of adult mouse primary visual cortex (VISp). We selected VISp because it contains a known variety of distinguishable yet highly similar cell types \cite{Tasic2016a} that would reveal the cell type detection limit of RNA-seq data obtained from single cells or nuclei. The cells and nuclei were processed by the same experimental and computational methods. We find that although the nuclear content and proportion of mRNA vary among cell types, nuclei contain enough informative transcripts to provide similar cell type resolution to whole cells.
Results
RNA-seq profiling of single nuclei and single cells
We isolated 487 NeuN-positive single nuclei from layer 5 of mouse VISp using fluorescence activated cell sorting (FACS). Anti-NeuN staining was performed to enrich for neurons. In parallel, we isolated 12,866 tdT-positive single cells by FACS from all layers of mouse VISp and a variety of Cre-driver lines, as part of a larger study on cortical cell type diversity \cite{Tasic2017}. For both single nuclei and cells, poly(A)-transcripts were reverse transcribed and amplified with SMART-Seq v4, cDNA was tagmented by Nextera XT, and resulting libraries were sequenced to an average depth of 2.5 million reads (Figure \ref{856440}A). RNA-seq reads were mapped to the mouse genome using the STAR aligner \cite{Dobin2013}. Gene expression was quantified as the sum of intronic and exonic reads per gene and was normalized as counts per million (CPM) and log2-transformed. For each nucleus and cell, the probabilities of gene detection dropouts were estimated as a function of average expression level based on empirical noise models \cite{Kharchenko2014}.
463 out of 487 single nuclei (95%) passed quality control metrics, and each nucleus was matched to the most similar nucleus and cell based on the maximum correlated expression of all genes, weighted for gene dropouts. Nuclei had similarly high pairwise correlations to cells as to other nuclei suggesting that cells and nuclei were well matched (Figure \ref{856440}B). As expected, matched cells were derived almost exclusively from layer 5 and adjacent layers 4 and 6 (Figure \ref{259770}B), and from Cre-driver lines that labeled cells in layer 5 (Figure \ref{856440}C and Figure \ref{259770}A,C). The small minority of matched cells isolated from superficial layers were GABAergic interneurons that have been detected in many layers \cite{Tasic2017}.
Comparison of nuclear and whole cell transcriptomes
scRNA-seq profiles nuclear and cytoplasmic transcripts, whereas snRNA-seq profiles mostly nuclear transcripts (it is possible that some fraction of transcripts is attached to rough ER, which may be partially retained in nuclear preps). Therefore, we expect that RNA-seq reads will differ between nuclei and cells. In nuclei, more than 50% of reads that aligned to the mouse genome did not map to known spliced transcripts but mapped within gene boundaries but outside exons and 5'- and 3'-untranslated regions. They were therefore annotated as intronic reads (Figure \ref{835862}A). In contrast, the majority of cells had less than 30% intronic reads with a minority of cells having closer to 50% intronic reads, similar to nuclei. Median gene detection based on exonic reads was lower for nuclei (~5,000 genes) than for cells (~9,500). Including both intronic and exonic reads increased gene detection for nuclei (~7,000) and cells (~11,000), demonstrating that intronic reads provided additional information not captured by exons. Whole brain control RNA displayed a read mapping distribution similar to cells, which is consistent with dissociated single cells capturing the majority of transcripts in the whole cell.
Gene "dropouts" result from missed detection of transcripts and are more pronounced in nuclei than in cells. When transcript dropouts were adjusted based on empirical noise models, correlations between pairs of nuclei and pairs of cells increased, although cell-cell similarities remained significantly higher (Figure \ref{835862}B). A majority of expressed genes (21,279; 63%) showed similar detection (<10% difference) in nuclei and cells, whereas 7,217 genes (21%) were detected in at least 25% more cells than nuclei (Figure \ref{835862}C and Table S1). 8,614 genes have significantly higher expression in cells than nuclei (>1.5 fold expression; FDR < 0.05) and many are involved in house-keeping functions such as mRNA processing and translation (Figure \ref{835862}D). Genetic markers of neuronal activity, such as immediate early genes Fos, Egr1, and Arc also displayed up to 10-fold increased expression in cells, potentially a byproduct of tissue dissociation \cite{Lacar2016}. 159 genes displayed significantly higher expression in nuclei (Figure \ref{835862}D and Table S2), and they appear relevant to neuronal identity as they include connectivity and signaling genes (Figure \ref{898251}A and Table S4). Based on the sum of intronic and exonic reads, these 159 nucleus-enriched genes are on average more than 10-fold longer than cell-enriched genes (Figure \ref{898251}B). Similar observation was recently reported for mRNAs enriched in single nuclei in mouse somatosensory cortex \cite{Lake2017}. When only exonic reads were used to quantify expression in nuclei and cells, a different set of 146 genes was significantly enriched in nuclei (Table S3). These genes were only slightly longer than cell-enriched genes, they were not associated with neuron-specific functions, and were significantly enriched for genes that participate in pre-mRNA splicing.
Intronic reads are required for high-resolution cell type identification from snRNA-seq
Next, we applied an iterative clustering procedure (see Methods and Figure \ref{859252}) to identify clusters of single nuclei and cells that share gene expression profiles. To assess cluster robustness, we repeated clustering 100 times using random subsets of 80% nuclei and cells and calculated the proportion of clustering runs in which each pair of samples clustered together. Co-clustering matrices were reordered by Ward's hierarchical clustering and represented as heatmaps with coherent clusters ordered as squares along the diagonal (Figure \ref{984877}A,B).
Clustering includes two steps -- selection of differentially expressed (DE) genes and distance measurement -- that are particularly sensitive to expression quantification. We repeated clustering using intronic and exonic reads or only exonic reads for these steps, and ordered co-clustering matrices to match the results using all reads for both steps. When using introns and exons, we found 11 distinct clusters of nuclei and cells, and clusters had similar cohesion (average within cluster co-clustering) and separation (average co-clustering difference with the closest cluster) (Figure \ref{984877}C). Including intronic reads for either clustering step increased the number of clusters detected for nuclei but not cells. Therefore, accounting for intronic reads in snRNA-seq was critical to enable high-resolution cluster detection comparable to that observed with scRNA-seq.
Comparable cell types identified with nuclei and cells
We used hierarchical clustering of median gene expression values for each cluster to determine the relationships between clusters. This analysis revealed that cluster relationships represented as dendrograms are similar for nuclei and cells (Figure \ref{985661}A). We compared the 11 clusters identified from both the single nucleus and single cell datasets to previously reported cell types in mouse VISp \cite{Tasic2016a}. Based on highly correlated (r > 0.85) expression of hundreds of marker genes, each cluster corresponds to a reported cell type (Figure \ref{953199}A). Conserved marker gene expression (Figure \ref{985661}B, \ref{953199}B) confirmed that the same 11 cell types were identified with nuclei and cells (Figure \ref{985661}C). These cell types included nine excitatory neuron types from layers 4-6 and two inhibitory interneuron types. Matched cluster proportions were mostly consistent, except two closely related layer 5a subtypes were under- (L5a Batf3) or over-represented (L5a Hsd11b1) among cells (Figure \ref{953199}C). This result demonstrated that the initial matching of cells to nuclei was relatively unbiased.
Since most cytoplasmic transcripts are spliced, intronic reads should be derived from nuclear transcripts. We hypothesized that we could estimate nuclear gene expression in whole cells by using only intronic reads. Indeed, average expression levels of genes were highly correlated in cells and nuclei when using intronic but not exonic reads (Figure \ref{953199}D). Furthermore, a dendrogram based on the median expression (quantified using only intronic reads) of nuclei and cell clusters paired all matching cell types, except for two closely related layer 5b subtypes (Figure \ref{985661}D). Therefore, intronic reads facilitate comparisons between data sets derived from snRNA-seq and scRNA-seq, although some expression differences remain. A dendrogram based on exonic reads grouped clusters first by sample type (nuclei and cells) and then by broad cell class (inhibitory and excitatory neurons). Grouping of samples by type was likely due to differences in cytoplasmic transcripts that were profiled in cells but not in nuclei.
While we detected comparable cell types using nuclei and cells, we expected that gene expression captured with cells likely included additional information from cytoplasmic transcripts. We compared the separation of matched pairs of clusters based on co-clustering and found that most nuclei and cell clusters were similarly distinct, but using single cell data significantly increased the separation of two pairs of similar types: L4 Arf5 from L5a Hsd11b1 and L5b Cdh13 from L5b Tph2 (Figure \ref{985661}E). Next, we compared how well genes marked one or more cell types by calculating the localization of gene expression to clusters independent of the number of clusters with expression. Briefly, for each gene, the proportion of nuclei or cells with expression above background noise (CPM > 1) was calculated for each cluster, and a marker score was calculated as the sum of the squared differences in proportions divided by the sum of the differences in proportions. Scores range from zero (ubiquitous or no expression) to one (perfectly binary expression in a subset of clusters). On average, marker scores were 15% higher in cells than nuclei due to fewer expression dropouts in cells ( Figure \ref{985661}F). Better detection of marker genes contributes to the mildly improved cluster separation when single cell data was used.
Nuclear content varies among cell types and for different transcripts
We estimated the nuclear proportion of mRNA for each cell type in two ways. Transcripts in the cytoplasm are spliced so intronic reads should be restricted to the nucleus. First, we estimated the nuclear RNA proportion by calculating the ratio of the percentage of intronic reads in cells to the percentage of intronic reads in nuclei (Figure \ref{886252}A). Second, we estimated nuclear proportions by selecting three genes (Malat1, Meg3, and Snhg11) with the highest expression in nuclei (Figure \ref{953199}D) and calculating the ratio of the average expression in cells versus nuclei (Figure \ref{886252}B and Figure \ref{684950}A). Both methods predicted that L4 Arf5 and L5a Hsd11b1 had a significantly larger proportion of transcripts located in the nucleus compared to other cell types (Figure \ref{886252}C).
Based on the comparison of scRNA-seq and snRNA-seq data, we estimate that L4 types have high nuclear to cell volume (~50%), whereas L5 types have lower nuclear to cell volume. To evaluate this finding, we measured nucleus and soma sizes of different cell types in situ. These types were labeled by fluorescent proteins in transgenic mice containing different Cre-transgenes and a Cre-reporter. Nr5a1-Cre and Scnn1a-Tg3-Cre mice almost exclusively label two cell types (L4 Arf5 and L5a Hsd11b1), whereas Rbp4-Cre mice label all layer 5 cell types including L5a Hsd11b1 (Figure \ref{684950}B and Table S5) \cite{Tasic2016a}. We measured the nuclear and cell sizes in situ, and calculated the nuclear proportion of each cell as the ratio of nuclear to soma volume (Figure \ref{684950}C). We found that the average nuclear proportion was significantly lower for layer 5 cells compared to layer 4 cells, as predicted based on RNA-seq data (Figure \ref{886252}D).
In addition, nuclear proportion estimates based on in situ size measurements were systematically higher than predicted for layer 5 but not layer 4 neurons. This could be the result of under-estimating the soma volume based on cross-sectional area measurements of these large non-spherical (pyramidal) neurons. Alternatively, layer 5 neuronal nuclei may have a lower concentration of nuclear transcripts, or there may be cell type-specific biases in our RNA-seq based estimates. To test whether layer 4 or layer 5 neurons were exceptional compared to neurons in other layers, we performed an unbiased survey of nuclear proportions across the full depth of the cortex. We found that layer 5 neurons tend to be larger and have proportionally smaller nuclei than other cortical neurons (Figure \ref{684950}D). This feature of L5 neurons has been observed in rat primary visual cortex \cite{Sigl-Glöckner2017}.
Next, we determined the nuclear versus cytoplasmic distribution of transcripts for individual genes. The nuclear proportion of 11,932 transcripts was estimated by the ratio of nuclear to whole cell expression multiplied by the overall nuclear fraction of each cell type and averaged across cell types (Table S6). Different functional classes of genes had strikingly different nuclear proportions (Figure \ref{886252}E). Many non-coding transcripts were localized in the nucleus, but some were abundantly expressed in the cytoplasm, such as the long non-coding RNA (lncRNA) Tunar that is highly enriched in the brain, is conserved across vertebrates, and has been associated with striatal pathology in Huntington's disease \cite{Lin2014}. Most protein-coding transcripts were expressed in both the nucleus and cytoplasm with a small number restricted to the nucleus, including the Parkinson's risk gene Park2. We found that pseudogenes were almost exclusively cytoplasmic and were highly enriched for house-keeping functions.
We compared our estimates of nuclear enrichment in cortex to mouse liver and pancreas based on data from \citet{Halpern2015} and found moderately high correlation (r = 0.61) between 4,373 mostly house-keeping genes that were expressed in all three tissues. Moreover, the shape of the distributions of nuclear transcript proportions was highly similar between tissues with slightly higher proportions estimated in this study. These results suggest that the mechanisms regulating the spatial localization of these transcripts -- for example, rates of nuclear export and cytoplasmic degradation \cite{Halpern2015} -- are generally conserved across disparate cell types.
Surprisingly, non-coding genes and pseudogenes are better markers of cell types, on average, than protein-coding genes (Figure \ref{886252}F). lncRNAs are known to have specific expression among diverse human cell lines \cite{Djebali2012}, and we show that this is also true for neuronal types in the mouse cortex. Many pseudogene transcripts, most of which are enriched in the cytoplasm, were selectively depleted in the two cell types, L4 Arf5 and L5a Hsd11b1. This is consistent with our previous analysis that showed that neurons of these types have relatively less cytoplasm. However, on average, cytoplasmic-enriched transcripts were slightly worse cell type markers than nucleus-enriched transcripts (Figure \ref{886252}G). Overall, we find that nuclei contain transcripts of the vast majority of genes that are informative for cell type identity.
Finally, we compared our estimates of nuclear localization of transcripts for three genes -- Calb1, Grik1, and Pvalb -- to relative counts of transcripts in nuclei and cytoplasm using multiplex RNA fluorescence in situ hybridization (mFISH). We found that the relative nuclear proportions estimated by scRNA-seq and mFISH were consistent although mFISH estimates showed more balanced localization of transcripts between the nucleus and cytoplasm (Figure \ref{886252}H). Both methods confirmed that Pvalb transcripts were mostly excluded from the nucleus, and this explained why 2 out of 35 nuclei in the Pvalb-positive interneuron type (Pvalb Wt1) had no detectable Pvalb expression, whereas all cells of this cell type had robust Pvalb expression.
Discussion
Unlike scRNA-seq, snRNA-seq enables transcriptomic profiling of tissues that are refractory to whole cell dissociation and of archived frozen specimens. snRNA-seq is also less susceptible to perturbations in gene expression that occur during cell isolation, such as increased expression of immediate early genes that can obscure transcriptional signatures of neuronal activity \cite{Lacar2016}. However, these advantages come at the cost of profiling less mRNA, and until this study, it was unclear if the nucleus contained sufficient number and diversity of transcripts to distinguish highly related cell types.