Introduction
Understanding neural circuits requires characterization of their cellular components. Cell types in mammalian brain have been defined based on shared morphological, electrophysiological and, more recently, molecular properties (REF). Single cell RNA-sequencing (scRNA-seq) has emerged as a high-throughput method for quantification of the majority of transcripts in thousands of cells. scRNA-seq data have revealed diverse cell types in many mouse brain regions, including neocortex \cite{Tasic_Yao_Smith_Graybuck_Nguyen_2017,Tasic2016,Zeisel2015}, hypothalamus \cite{Campbell2017}, and retina \cite{Shekhar2016,Macosko2015}.
However, single cell RNA-seq profiling does not provide an unbiased survey of neural cell types. Some cell types are more vulnerable to the tissue dissociation process and are underrepresented in the final data set. For example, in mouse neocortex, fast-spiking parvalbumin-positive interneurons and deep-projecting glutamatergic neurons in layer 5b are observed in lower than their expected proportions and need to be selectively enriched using Cre-driver lines \cite{Tasic2017} for sufficient sampling. In adult human neocortex, neurons largely do not survive dissociation, whereas non-neuronal cells are highly over-represented \cite{Darmanis2015}. In contrast to whole cells, nuclei are more resistant to mechanical assaults and can be isolated from frozen tissue (@REF). Single nuclei have been shown to provide sufficient gene expression information to define relatively broad cell classes in adult human brain \cite{Lake2016,29227469} and mouse hippocampus \cite{Habib2016}.
It remains unknown if the nucleus contains sufficient diversity and number of transcripts to enable discrimination of closely related cell types at a resolution comparable to whole cells. A recent study compared clustering results for single nuclei and whole cells isolated from mouse somatosensory cortex \cite{Lake2017}, but it only showed similar ability to distinguish two very different cell classes: superficial- and deep-layer excitatory neurons.
In this study, we compared 463 matched nuclei and whole cells from layer 5 of mouse primary visual cortex (VISp) to investigate differences in single nucleus and single cell transcriptomes. We selected this brain region because it contains a known variety of distinct yet highly similar cell types that would reveal the cell type detection limit of RNA-seq data obtained from single cells and nuclei \cite{Tasic2017}. We used the same primary cell source and processed cells and nuclei with the same transcriptomic profiling method, to directly compare the resolution limit of cell type detection from well-matched sets of single cells and nuclei.
Results
RNA-Seq profiling of single nuclei and single cells
We isolated 487 NeuN-positive single nuclei from layer 5 of mouse VISp using fluorescence activated cell sorting (FACS). Anti-NeuN staining was performed to enrich for neurons. In parallel, we isolated 12,866 tdT-positive single cells by FACS from all layers of mouse VISp and a variety of Cre-driver lines. For both single nuclei and cells, poly(A)-transcripts were amplified with SMART-Seq v4, cDNA was tagmented by Nextera XT and resulting libraries were sequenced to an average depth of 2.5 million reads (Figure \ref{856440}A). RNA-seq reads were mapped to mouse genome by STAR aligner (REF). Gene expression was quantified as the sum of intronic and exonic reads per gene and was normalized as counts per million (CPM) and log2-transformed. For each nucleus and cell, the probability of gene detection dropouts were estimated as a function of average expression level based on empirical noise models (REF).
463 out of 487 single nuclei (95%) passed quality control metrics, and each nucleus was matched to the most similar nucleus and cell based on the maximum correlated expression of all genes, weighted for gene dropouts. Nuclei had similarly high pairwise correlations to cells as to other nuclei suggesting that cells and nuclei were well matched (Figure \ref{856440}B). As expected, matched cells were derived almost exclusively from layer 5 and adjacent layers 4 and 6 (Figure \ref{259770}B) and from Cre-driver lines that labeled cells in layer 5 (Figure \ref{856440}C and Figure \ref{259770}A,C). The small minority of matched cells isolated from superficial layers were GABAergic interneurons that have been detected in many layers (REF, Tasic 2017).
Comparison of nuclear and whole cell transcriptomes
Single cell RNA-seq profiles nuclear and cytoplasmic transcripts while single nucleus RNA-seq only profiles nuclear transcripts. Therefore, we expect that RNA-seq reads will differ between nuclei and cells. Indeed, we find that for most nuclei and cells, more than 90% of reads aligned to the mouse genome, but in nuclei, more than 50% of these reads did not map to known spliced transcripts but were located within genic bounds and were annotated as intronic reads (Figure \ref{835862}A). In contrast, the majority of cells had less than 30% intronic reads with a minority of cells having closer to 50% intronic reads, similar to nuclei. Cells with a higher proportion of intronic reads may have a higher proportion of nuclear transcripts since most unspliced transcripts are in the nucleus. Median gene detection based on exonic reads was lower for nuclei (~5000 genes) than cells (~9500). Including both intronic and exonic reads increased gene detection for nuclei (~7000) and cells (~11,000), demonstrating that intronic reads provided additional information not captured by exons. Whole brain control RNA displayed read mapping distribution similar to cells, consistent with dissociated single cells capturing the majority of transcripts in the whole cell. Only 10 pg of control RNA was used, and consequently gene detection was similar to single nuclei.
Transcript dropouts likely result from both technical and biological variability, and both effects are more pronounced in nuclei than cells. When transcript dropouts were accounted for, correlations between pairs of nuclei and pairs of cells increased, although cell-cell similarities remained significantly higher (Figure \ref{835862}B). The vast majority of genes show similar detection in nuclei and cells, while 7217 genes (17%) are detected in at least 25% more cells than nuclei (Figure \ref{835862}C and Table S1). 8614 genes have significantly higher expression in cells than nuclei (>1.5 fold expression; FDR < 0.05) and are involved in housekeeping functions such as mRNA processing and protein translation (Figure \ref{835862}D). Genetic markers of neuronal activity, such as immediate early genes Fos, Egr1, and Arc also had increased expression in cells, potentially due to neuron stress during tissue dissociation. 159 genes have higher expression in nuclei (Figure \ref{835862}D and Table S2), and they appear central to neuronal identity as they include connectivity and signaling genes (Figure \ref{898251}A and Table S4). On average, nucleus-enriched genes are more than 10-fold longer than cell-enriched genes based on genomic gene length due to the the inclusion of intronic reads (Figure \ref{898251}B), as reported by \citet{Lake2017}. When only exonic reads were used to quantify expression in nuclei and cells, a different set of 146 genes were significantly enriched in nuclei (Table S3), and many genes were involved in mRNA splicing.
Intronic reads are required for similar discrimination of nuclei and cell clusters
Next, we applied an iterative clustering procedure (see Methods and Figure \ref{859252}) to identify clusters of single nuclei and cells that share gene expression profiles. To assess cluster robustness, we repeated clustering 100 times using 80% random sub-samples of nuclei and cells and calculated the proportion of clustering runs that each pair of samples were clustered together. Co-clustering matrices were reordered using Ward's hierarchical clustering and represented as heatmaps with consistent clusters ordered as squares along the diagonal (Figure \ref{984877}A,B).
Clustering includes two steps -- variable gene selection and distance measurement -- that are particularly sensitive to expression quantification. We repeated clustering using intronic and exonic reads or only exonic reads for these steps and ordered co-clustering matrices to match the results using all reads for both steps. When using introns and exons, we found 11 distinct clusters of nuclei and cells, and clusters had similar cohesion (average within cluster co-clustering) and separation (average co-clustering difference with the closest cluster) (Figure \ref{984877}C). Excluding intronic reads for either clustering step reduced the number of clusters detected for nuclei but not cells. Therefore, intronic reads were critical to detect the same number of cell types using single nuclei and cells.
Equivalent cell types identified with nuclei and cells
Cluster relationships were represented as dendrograms that are remarkably similar for nuclei and cells (Figure \ref{985661}A). We compared the 11 clusters identified with single nuclei and cells to reported cell types in mouse VISp \cite{Tasic2016}. Each nuclei and cell cluster could be linked to a reported cell type (Figure \ref{953199}A) and to each other (Figure \ref{985661}B) based on correlated expression of cell type informative genes. Many genes contributed to high expression correlations (r > 0.85) for all cluster pairs (Figure \ref{953199}B). Conserved marker gene expression confirmed that the same 11 cell types were identified with nuclei and cells (Figure \ref{985661}C). These cell types included nine excitatory neuron types from layers 4-6 and two inhibitory interneuron types. Matched cluster proportions were mostly consistent, except two closely related layer 5a subtypes were under- (L5a Batf3) or over-represented (L5a Hsd11b1) among cells (Figure \ref{953199}C). This suggested that the initial matching of cells to nuclei was relatively unbiased.
We hypothesized that most intronic reads were mapped to nuclear transcripts, so quantifying gene expression in cells using only introns would approximate nuclear expression. This was supported by higher correlations of average expression across all nuclei and cells using intronic reads as compared to exonic reads (Figure \ref{953199}D). Thus, a dendrogram based on the median expression (quantified using only intronic reads) of nuclei and cell clusters paired all matching cell types, except for two closely related layer 5b subtypes (Figure \ref{985661}D). Therefore, intronic reads can help facilitate comparisons between data sets derived from single nucleus and single cell RNA-sequencing, although small expression differences remain. Strikingly, a dendrogram based on exonic reads grouped clusters first by sample type (nuclei and cells) and then by broad cell class (inhibitory and excitatory neurons). This likely reflected spliced transcripts enriched in the cytoplasm that were profiled in cells but not nuclei and would be quantified by exonic but not intronic reads.
While we detected the same cell types using nuclei and cells, we expected that gene expression captured with cells included additional information from cytoplasmic transcripts. We compared the separation of matched pairs of clusters based on co-clustering and found that all nuclei and cell clusters were similarly distinct, except one layer 4 type and one layer 5b type were significantly more distinct using single cells (Figure \ref{985661}E). Next, we compared how well genes marked cell types by calculating the degree of binary expression. Cell marker scores were, on average, 15% higher than nuclei scores due to fewer expression dropouts in cells (Figure \ref{985661}F), and this was consistent with mildly improved cluster separation.
Nuclear content varies among cell types and genes
We tested whether the bimodal distribution of intronic read mapping in cells (Figure \ref{835862}A) resulted from variable nuclear content among cell types. In fact, the percentage of intronic reads varied significantly across matched pairs of cell and nuclei clusters, with significantly more intronic reads in two cell clusters: L4 Arf5 and L5a Hsd11b1. We confirmed this cell type specific difference by comparing the expression of the three highly expressed (Figure \ref{953199}D), nuclear-enriched genes -- Malat1, Meg3, and Snhg11 -- in cells versus nuclei. All three genes had significantly higher expression in L4 Arf5 and L5a Hsd11b1 cells compared to other cell types (Figure \ref{886252}B and Figure \ref{684950}A). Nuclear proportions estimated based on the ratio of intronic reads were consistently higher than those based on relative expression levels (Figure \ref{886252}C), suggesting that introns are more highly restricted to the nucleus than transcripts of these three genes.
The relative number of transcripts in the nucleus versus the cytoplasm may vary due to the relative volume of these two compartments across cell types. We tested this hypothesis by measuring nucleus and soma sizes of different cell types that were labeled using three Cre-driver mouse lines. Nr5a1-Cre and Scnn1a-Tg3-Cre mice almost exclusively label two cell types (L4 Arf5 and L5a Hsd11b1) with a higher estimated nuclear proportion, while Rbp4-Cre mice label all layer 5 cell types including L5a Hsd11b1 (Figure \ref{684950}B and Table S5). We estimated the nuclear proportion of each cell as the ratio of nuclear to soma volume (Figure \ref{684950}C) and found that the average proportion was significantly less for layer 5 cells labeled by the Rbp4-Cre mouse, as predicted based on RNA-sequencing data (Figure \ref{886252}D). Nuclear proportion estimates based on size measurements were systematically higher than predicted for layer 5 but not layer 4 neurons. This could be the result of under-estimating the soma volume based on cross-sectional area measurements of these large pyramidal-shaped neurons. Alternatively, layer 5 neuronal nuclei may have a lower density of nuclear transcripts or there may be cell type specific biases in our RNA-sequencing based estimates. Next, we performed an unbiased survey of nuclear proportions across the full depth of cortex to test whether layer 4 or layer 5 neurons were exceptional compared to neurons in other layers. We found that layer 5 neurons tend to be larger and have proportionally smaller nuclei (Figure \ref{684950}D) than other cortical neurons, as was recently reported in rat primary visual cortex \cite{Sigl-Glöckner2017}.
The nuclear proportion of 11,932 genes was estimated by the ratio of nuclear to whole cell expression multiplied by the overall nuclear fraction of each cell type and averaged across cell types (Table S6). Functional classes of genes had strikingly different nuclear proportions (Figure \ref{886252}E). Many protein non-coding genes were localized in the nucleus, but some were abundantly expressed in the cytoplasm, such as the long noncoding RNA (lncRNA) Tunar that is highly enriched in the brain, is conserved across vertebrates, and has been associated with striatal pathology in Huntington's disease \cite{Lin2014}. Most protein coding gene transcripts were expressed in both the nucleus and cytoplasm with a small number restricted to the nucleus, including the Parkinson's risk gene Park2. Pseudogenes were mostly housekeeping genes and were almost exclusively cytoplasmic.
We compared our estimates of nuclear enrichment in cortex to mouse liver and pancreas based on data from \citet{Bahar2015} and found moderately high correlation (r = 0.61) between individual genes. Moreover, the shape of the distributions of nuclear proportions was highly similar between tissues with slightly higher proportions estimated in this study. These results suggest that the mechanisms regulating the spatial localization of transcripts -- for example, rates of nuclear export and cytoplasmic degradation \cite{Bahar2015a} -- are conserved across cell types.
Surprisingly, non-coding genes and pseudogenes are better markers of cell types, on average, than protein-coding genes (Figure \ref{886252}F). lncRNAs are known to have more specific expression among diverse human cell lines \cite{Djebali2012}, and we show that this is also true for neuronal types in the mouse cortex. Many pseudogenes, most of which are in the cytoplasm, are selectively depleted in the two cell types, L4 Arf5 and L5a Hsd11b1. This is consistent with our previous analysis that showed that neurons of these types have relatively less cytoplasm. We also find that nuclear enriched genes are slightly better cell type markers than cytoplasmic enriched genes, although this is highly variable across genes (Figure \ref{886252}G).
Finally, we compared our estimates of nuclear localization of three genes -- Calb1, Grik1, and Pvalb -- to relative counts of transcripts in nuclei and cytoplasm using multiplex fluorescent in situ hybridization (mFISH). We found that the relative nuclear proportions estimated by RNA-seq and mFISH were consistent although the absolute levels were quite variable (Figure \ref{886252}H). Both methods confirmed that Pvalb transcripts were mostly excluded from the nucleus, and this explained why 2 out of 35 nuclei in the Parvalbumin-positive interneuron type (Pvalb Wt1) had no detectable Pvalb expression while all cells of this cell type had robust Pvalb expression.
Discussion
- Furthermore, we estimated the nuclear proportion of transcripts for all genes that can help guide selection of nuclear-enriched marker genes for discriminating cell types in situ using spatial transcriptomics methods.
- Working hypothesis: inclusion of intronic reads increases S/N and expression of long genes that are brain-enriched [Gabel 2015 Nature - Mecp2 paper] and more informative for cell type identity than the average gene. Remains to be seen if there are other features of intronic reads that improve cell type discrimination.
- IEGs lower expression in nuclei - likely due to differences in dissociation and nuclei better for studies of activity induced changes \cite{Wu_Pan_Zuo_Li_Hong_2017}.
- More genes are captured by whole cells but many of them are house-keeping genes in the cytoplasm so do no improve cell type identification via clustering.
- Gene dropout higher in nuclei which lowers marker scores but not enough to decrease cell type detection, although cluster separation is somewhat better between highly similar cell types. We expect there may be differences in cell type detection at even finer resolutions as larger numbers of nuclei and cells are profiled.
- L4 types differ significantly in nuclear content. This is only partially explained by increased ratio of nuclear to soma volumes. We also know that cytoplasmic enriched genes are depleted in these cells including housekeeping genes which contributes to cell type identify (negative markers). We cannot rule out the possibility that these cells were selectively damaged during dissociation and sorting and lost some of their cytoplasm, but they are high quality cells based on all of our quality control metrics. [Biological significance??]
- Nuclear genes better markers on average - so not missing too much with single nuclei rna-seq
- Cyto pseudogenes are depleted in 2 L4/5a types - ribosomal, cytochrome oxidase, housekeeping genes likely to be widely and highly expressed in germline cells \cite{Zhang_Carriero_Gerstein_2004}
Materials and Methods
Tissue preparation
Tissue samples were obtained from adult (postnatal day (P) 53-59)) male and female transgenic mice carrying a Cre transgene and a Cre-reporter transgene. Mice were anesthetized with 5% isoflurane and intracardially perfused with either 25 or 50 ml of ice cold, oxygenated artificial cerebral spinal fluid (ACSF) at a flow rate of 9 ml per minute until the liver appeared clear, or the full volume of perfusate had been flushed through the vasculature. The ACSF solution consisted of 0.5mM CaCl2, 25mM D-Glucose, 98mM HCl, 20mM HEPES, 10mM MgSO4, 1.25mM NaH2PO4, 3mM Myo-inositol, 12mM N-acetylcysteine, 96mM N-methyl-D-glucamine, 2.5mM KCl, 25mM NaHCO3, 5mM sodium L-Ascorbate, 3mM sodium pyruvate, 0.01mM Taurine, and 2mM Thiourea. The brain was then rapidly dissected and mounted for coronal slice preparation on the chuck of a Compresstome VF-300 vibrating microtome (Precisionary Instruments). Using a custom designed photodocumentation configuration (Mako G125B PoE camera with custom integrated software), a blockface image was acquired before each section was sliced at 250 μm intervals. The slice was then hemisected along the midline, and both hemispheres were then transferred to chilled, oxygenated ACSF.
Each slice-hemisphere was transferred into a Sylgard-coated dissection dish containing 3 ml of chilled, oxygenated ACSF. Brightfield and fluorescent images between 4X and 20X were obtained of the intact tissue with a Nikon Digital Sight DS-Fi1 or a Sentech STC-SC500POE camera mounted to a Nikon SMZ1500 dissecting microscope. To guide anatomical targeting for dissection, boundaries were identified by trained anatomists, comparing the blockface image and the slice image to a matched plane of the Allen Reference Atlas. In general, three to five slices were sufficient to capture the targeted region of interest, allowing for expression analysis along the anterior/posterior axis. The region of interest was then dissected and both brightfield and fluorescent images of the dissections were acquired for secondary verification. The dissected regions were transferred in ACSF to a microcentrifuge tube, and stored on ice. This process was repeated for all slices containing the target region of interest, with each region of interest deposited into a new microcentrifuge tube.
For whole cell dissociation, after all regions of interest were dissected, the ACSF was removed and 1 ml of a 2 mg/ml pronase in ACSF solution was added. Tissue was digested at room temperature (approximately 22°C) for a duration that consisted of adding 15 minutes to the age of the mouse (in days; i.e., P53 specimen had a digestion time of 68 minutes). After digestion, the pronase solution was removed and replaced by 1 ml of ACSF supplemented with 1% Fetal Bovine Serum (FBS). The tissue was washed two more times with the same solution and the sample was then triturated using fire-polished glass pipettes of decreasing bore sizes (600, 300, and 150 μm). The cell suspension was incubated on ice in preparation for fluorescence-activated cell sorting (FACS). FACS preparation involved adding 4’-6-diamidino-2-phenylindole (DAPI) at a final concentration of 4 μg/ml to label dead (DAPI+) versus live (DAPI-) cells. The suspension was then filtered through a fine-mesh cell strainer to remove cell aggregates. Cells were sorted by excluding DAPI positive events and debris, and gating to include red fluorescent events (tdTomato-positive cells). Single cells were collected into strip tubes containing 11.5μl of collection buffer (SMART-Seq v4 lysis buffer 0.83x, Clontech #634894), RNase Inhibitor (0.17U/µl), and ERCCs (External RNA Controls Consortium, MIX1 at a final dilution of 1x10-8) \cite{16179916,25150836}. After sorting, strip tubes containing single cells were centrifuged briefly and then stored at -80°C.
For nuclei isolation, dissected regions of interest were transferred to microcentrifuge tubes, snap frozen in a slurry of dry ice and ethanol, and stored at -80°C until the time of use. To isolate nuclei, frozen tissues were placed into a homogenization buffer that consisted of 10mM Tris pH 8.0, 250mM sucrose, 25mM KCl, 5mM MgCl2, 0.1% Triton-X 100, 0.5% RNasin Plus RNase inhibitor (Promega), 1X protease inhibitor (Promega), and 0.1mM DTT. Tissues were placed into a 1ml dounce homogenizer (Wheaton) and homogenized using 10 strokes of the loose dounce pestle followed by 10 strokes of the tight pestle to liberate nuclei . Homogenate was strained through a 30µm cell strainer (Miltenyi Biotech) and centrifuged at 900xg for 10 minutes to pellet nuclei. Nuclei were then resuspended in staining buffer containing 1X PBS supplemented with 0.8% nuclease-free BSA and 0.5% RNasin Plus RNase inhibitor. Mouse anti-NeuN antibody (EMD Millipore, MAB377, Clone A60) was added to the nuclei at a final dilution of 1:1000 and nuclei suspensions were incubated at 4°C for 30 minutes. Nuclei suspensions were then centrifuged at 400xg for 5 minutes and resuspended in clean staining buffer (1X PBS, 0.8% BSA, 0.5% RNasin Plus). Secondary antibody (goat anti-mouse IgG (H+L), Alexa Fluor 594 conjugated, ThermoFisher Scientific) was applied to nuclei suspensions at a dilution of 1:5000 for 30 minutes at 4°C. After incubation in secondary antibody, nuclei suspensions were centrifuged at 400xg for 5 minutes and resuspended in clean staining buffer. Prior to FACS, DAPI was applied to nuclei suspensions at a final concentration of 0.1µg/ml and nuclei suspensions were filtered through a 35µm nylon mesh to remove aggregates. Single nuclei were captured by gating on DAPI-positive events, excluding debris and doublets, and then gating on Alexa Fluor 594 (NeuN) signal. Strip tubes containing FACS isolated single nuclei were then briefly centrifuged and frozen at -80°C.