RNA-Seq data processing
Raw read (fastq) files were aligned to the GRCm38 mouse genome sequence (Genome Reference Consortium, 2011) with the RefSeq transcriptome version GRCm38.p3 (current as of 1/15/2016) and updated by removing duplicate Entrez gene entries from the gtf reference file for STAR processing. For alignment, Illumina sequencing adapters were clipped from the reads using the fastqMCF program \cite{e2011}. After clipping, the paired-end reads were mapped using Spliced Transcripts Alignment to a Reference (STAR) \cite{23104886} using default settings. STAR uses and builds it own suffix array index which considerably accelerates the alignment step while improving on sensitivity and specificity, due to its identification of alternative splice junctions. Reads that did not map to the genome were then aligned to synthetic constructs (i.e. ERCC) sequences and the E.coli genome (version ASM584v2). Quantification was performed using summerizeOverlaps from the R package GenomicAlignments \cite{23950696}. The final results files included quantification of the genome-mapped reads (exon and intron counts). Also, part of the final results files are the percentages of reads mapped to the RefSeq transcriptome, to ERCC spike-in controls, and to E.coli.
Expression levels were calculated as counts per million (CPM) of exonic plus intronic reads, and log2(CPM + 1) transformed values were used for a subset of analyses as described below. CPM values reflected absolute transcript number and gene length, i.e. short and abundant transcripts may have the same apparent expression level as long but rarer transcripts. Intron retention varied across genes so no reliable estimates of effective gene lengths were available for expression normalization. Instead, absolute expression levels were estimated as fragments per kilobase per million (FPKM) using only exonic reads so that annotated transcript lengths could be used.
Selection of single nuclei and matched cells
463 of 487 (95%) of single nuclei isolated from layer 5 of mouse V1 passed quality control criteria: >500,000 genome mapped reads, >75% reads aligned, and >50% unique reads. 12,866 single cells isolated from layers 2-6 of mouse V1 passed quality control criteria: >200,000 transcriptome mapped reads and >1000 genes detected at any expression level.
To account for gene dropouts due to stochastic transcription or technical artifact
\cite{24836921}, expression noise models were fit separately to single nuclei and cells using the "knn.error.models" function of the
scde R package (version 2.2.0) with default settings and 8 nearest neighbors. Expression dropout should be accounted for in calculating similarities between cells or nuclei. A dropout weight matrix was calculated based on noise models using mode-relative weighting, as described by Kharchenko and Fan et al. (
http://hms-dbmi.github.io/scde/diffexp.html). The probability of dropout,
\(P_{sg}\left(E_{\bar{g}}\right)\), was calculated for each sample (s) and gene (g) based on the average expected expression level of similar samples using the "scde.failure.probability" and "scde.posteriors" functions. A second probability of dropout,
\(P_{sg}\left(E_{sg}\right)\), was calculated for each sample given the observed expression levels for those samples. Finally, the dropout weighting was calculated as a combination of those:
\(W_{sg}=1\ -\ \sqrt{P_{sg}\left(E_{sg}\right)\cdot\sqrt{P_{sg}\left(E_{sg}\right)\cdot P_{sg}\left(E_g\right)}}\)