Cluster dendrograms
Clusters were arranged by transcriptomic similarity based on hierarchical clustering. First, the average expression level of the top 1200 marker genes (i.e. highest beta scores) was calculated for each cluster. A correlation-based distance matrix (\(D_{xy}=\frac{1-\rho\left(x,y\right)}{2}\)) was calculated, and complete-linkage hierarchical clustering was performed using the "hclust" R function with default parameters. The resulting dendrogram branches were reordered to show inhibitory clusters followed by excitatory clusters, with larger clusters first, while retaining the tree structure. Note that this measure of cluster similarity is complementary to the co-clustering separation described above. For example, two clusters with similar gene expression patterns but a few binary marker genes may be close on the tree but highly distinct based on co-clustering.
Matching clusters based on marker gene expression
Nuclei and cell clusters were independently compared to published mouse VISp cell types \cite{Tasic2016a}. The proportion of nuclei or cells expressing each gene with CPM > 1 was calculated for all clusters. Approximately 400 genes were markers in both data sets (beta score > 0.3) and were expressed in the majority of samples of between one and five clusters. Markers expressed in more than five clusters were excluded to increase the specificity of cluster matching. Weighted correlations were calculated between all pairs of clusters across these genes and weighted by beta scores to increase the influence of more informative genes. Heatmaps were generated to visualize all cluster correlations. All nuclei and cell clusters had reciprocal best matching clusters from Tasic et al. and were labeled based on these reported cluster names.
Next, nuclei and cell clusters were directly compared using the above analysis. All 11 clusters had reciprocal best matches that were consistent with cluster labels assigned based on similarity to published types. The most highly conserved marker genes of matching clusters were identified by selecting genes expressed in a single cluster (>50% of samples with CPM > 1) and with the highest minimum beta score between nuclei and cell clusters. Two additional marker genes were identified that discriminated two closely related clusters. Violin plots of marker gene expression were constructed with each gene on an independent, linear scale.
Nuclei and cell clusters were also compared by calculating average cluster expression based only on intronic or exonic reads and calculating a correlation-based distance using the top 1200 marker genes as described above. Hierarchical clustering was applied to all clusters quantified using the two sets of reads. In addition, the average log2(CPM + 1) expression across all nuclei and cells was calculated using intronic or exonic reads.
Cluster separation was calculated for individual nuclei and cells as the average within cluster co-clustering of each sample minus the maximum average between cluster co-clustering. Separations for matched pairs of clusters were visualized with box plots and compared using a Student's t-test, and significance was tested after Bonferroni correction for multiple testing. Finally, a linear model was fit to beta marker scores for genes that were expressed in at least one but not all cell and nuclear clusters, and the intercept was set to zero.
Estimating proportions of nuclear transcripts
The nuclear proportion of transcripts was estimated in two ways. First, all intronic reads were assumed to be from transcripts localized to the nucleus so that the proportion of intronic reads measured in cells should decrease linearly with the nuclear proportion of the cell as nuclear reads are diluted with cytoplasmic reads. For each cell type, the nuclear proportion was estimated as the proportion of intronic reads in cells divided by the proportion of intronic reads in matched nuclei. Second, the nuclear proportion was estimated as the average ratio of cell to nuclear expression (CPM) using only exonic reads of three highly expressed nuclear genes (Snhg11, Malat1, and Meg3). The standard deviation of nuclear proportion estimates were calculated based on standard error propagation of variation in intronic read proportions and expression levels. Nuclear proportion estimates were compared with linear regression, and the estimate based on relative expression levels was used for further analysis.
The nuclear proportion of transcripts for all genes was estimated for each cell type as the ratio of average expression (CPM) in nuclei versus matched cells multiplied by the nuclear proportion of all transcripts. Estimated proportions greater than 1 were set equal to 1 for each cell type, and a weighted average proportion was calculated for each gene with weights equal to the average log
2(CPM + 1) expression in each cell type. 11,932 genes were expressed in at least one nuclear or cell cluster (>50% samples expressed with CPM > 1) and were annotated as one of three gene types -- protein-coding, protein non-coding, or pseudogene -- using gene metadata from NCBI (
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Mus_musculus.gene_info.gz; downloaded 10/12/2017). For each type, histograms of gene counts with different nuclear proportions were generated. Next, beta marker score distributions were visualized as violin plots, and differences across gene types were compared with a Kruskal-Wallis rank sum test followed by Wilcoxon signed rank unpaired tests. Finally, genes were grouped into 10 bins of estimated nuclear proportions, from high cytoplasmic enrichment to high nuclear enrichment, and beta marker score distributions were visualized as box plots. A linear regression was fit to marker scores versus nuclear proportion.
Nuclear transcript proportions were compared to nuclear proportions estimated for mouse liver and pancreatic beta cells based on data from \citet{Halpern2015}. Ratios of normalized nuclear and cytoplasmic transcript counts were calculated in four tissue replicates. Average ratios were calculated for genes with at least one count in either fraction in at least one tissue. Nuclear proportion estimates for all genes with data from both data sets (n = 4373) were compared with Pearson correlation, a linear model with intercept set equal to zero, and histograms with a bin width of 0.02.
Colorimetric in situ hybridization
In situ hybridization data for mouse cortex was from the Allen Mouse Brain Atlas
\cite{17151600}. All data is publicly accessible through
www.brain-map.org. Data was generated using a semiautomated technology platform as described in
\citet{17151600}. Mouse ISH data shown is from primary visual cortex (VISp) in the Paxinos Atlas
\cite{paxinos2013paxinos}.
Multiplex fluorescence RNA in situ hybridization and quantification of nuclear versus cytoplasmic transcripts
The RNAscope multiplex fluorescent kit was used according to the manufacturer’s instructions for fresh frozen tissue sections (Advanced Cell Diagnostics), with the exception that 16µm tissue sections were fixed with 4% PFA at 4°C for 60 minutes and the protease treatment step was shortened to 15 minutes at room temperature. Probes used to identify nuclear and cytoplasmic enriched transcripts were designed antisense to the following mouse genes: Calb1, Grik1, and Pvalb. Following hybridization and amplification, stained sections were imaged using a 60X oil immersion lens on a Nikon TiE epifluorescence microscope.
To determine if spots fell within the nucleus or cytoplasm, a boundary was drawn around the nucleus to delineate its border using measurement tools within Nikon Elements software. To delineate the cytoplasmic boundary of each cell, a circle with a diameter of 15um was drawn and centered over the cell (Fig. 5). RNA spots in each channel were quantified manually using counting tools available in the Nikon Elements software. Spots that fell fully within the interior boundary of the nucleus were classified as nuclear transcripts. Spots that fell outside of the nucleus but within the circle that defined the cytoplasmic boundary were classified as cytoplasmic transcripts. Additionally, if spots intersected the exterior boundary of the nucleus they were classified as cytoplasmic transcripts. To prevent double counting of spots and ambiguities in assigning spots to particular cells, labeled cells whose boundaries intersected at any point along the circumference of the circle delineating their cytoplasmic boundary were excluded from the analysis. A linear regression was fit to nuclear versus soma probe counts, and the slope was used to estimate the nuclear proportion.
In situ quantification of nucleus and soma size
Coronal brain slices from Nr5a1-Cre;Ai14, Scnn1a-Tg3-Cre;Ai14, and Rbp4-Cre_KL100;Ai14 mice were stained with anti-dsRed (Clontech #632496) to enhance tdTomato signal in red channel and DAPI to label nuclei. Maximum intensity projections from six confocal stacks of 1-µm intervals were processed for analysis. Initial segmentation was performed by CellProfiler \cite{Lamprecht2007} to identify nuclei from the DAPI signal and soma from the tdTomato signal. Segmentation results were manually verified and any mis-segmented nuclei or somata were removed or re-segmented if appropriate. Area measurement of segmented nuclei and somata was performed in CellProfiler in Layer 4 from Nr5a1-Cre;Ai14 and Scnn1a-Tg3-Cre;Ai14 mice, and in Layer 5 from Rbp4-Cre_KL100;Ai14 mice. A linear regression was fit to nuclear versus soma area to highlight the differences between Cre-lines.
For measurements of nucleus and soma size agnostic to Cre driver, we used 16 µm-tissue sections from P56 mouse brain. To label nuclei, DAPI was applied to the tissue sections at a final concentration of 1mg/ml. To label cell somata, tissue sections were stained with Neurotrace 500/525 fluorescent Nissl stain (ThermoFisher Scientific) at a dilution of 1:100 in 1X PBS for 5 minutes, followed by brief washing in 1X PBS. Sections were coverslipped with Fluoromount-G (Southern Biotech) and visualized on a Nikon TiE epifluorescence microscope using a 40x oil objective. Soma and nuclei area measurements were taken by tracing the boundaries of the Nissl-stained soma or DAPI-stained nucleus, respectively, using cell measurement tools available in the Nikon TiE microscope software. All cells with a complete nucleus clearly present within the section were measured, except that we excluded glial cells which had very small nuclei and scant cytoplasm. Measurements were taken within a 40x field of view across an entire cortical column encompassing layers 1-6, and the laminar position of each cell (measured as depth from the pial surface) was tracked along with the nucleus and soma area measurements for each cell.
For each cell in the experiments above, the nuclear proportion was estimated as the ratio of nucleus and soma area raised to the 3/2 power. This transformation was required to convert area to volume measurements and assumed that the 3-dimensional geometries of soma and nuclei were reflected by their cross-sectional profiles. This is true for approximately symmetrical shapes such as most nuclei and some somata, but will lead to under- or over-estimates of nuclear proportions for asymmetrical cells. Therefore, the estimated nuclear proportion of any individual cell may be inaccurate, but the average nuclear proportion for many cells should be relatively unbiased.
Code availability
Competing interests
The authors declare no competing interests.
Acknowledgements
The authors thank the Allen Institute for Brain Science founders, P. G. Allen and J. Allen, for their vision, encouragement, and support.
Author Contributions
Conceptualization - BT, TEB, AB, ESL, HZ
Data Curation - TEB, JM
Formal Analysis - TEB, JM, ZY, JG
Investigation - RDH, TNN, EB, DB, TC, ND, EG, LTG, MK, KL, SP, CR, SIS, MT, KAS
Methodology - RDH, BA, RSL, RHS, NJS,
Project Administration - JWP, AB, KAS, BT, HZ
Supervision - BT, ESL, HZ, AB, JWP
Validation - RDH, TNN, EB, EG
Visualization - TEB, JM, RDH
Writing – Original Draft Preparation - TEB, JM, RDH, ESL, BT
Writing – Review & Editing - TEB, JM, RDH, ESL, BT