https://www.biorxiv.org/content/biorxiv/early/2018/05/17/065094.full.pdf
"Unsupervised methods for identifying relevant features specifically for scRNASeq data have
mainly focused on the identification of highly variable genes (Brennecke et al., 2013;
Kolodziejczyk et al., 2015; Satija et al., 2015). These methods differ mainly in their approach to
adjusting for the relationship between mean and variance inherent to count-data, such as fitting
a polynomial regression (Brennecke et al., 2013), binning genes by expression level (Satija et
al., 2015), or comparing to a moving median (Kolodziejczyk et al., 2015). Alternatively, specific
genes may be identified from their weights inferred during dimensionality reduction (Björklund et
al., 2016; Klein et al., 2015; Macosko et al., 2015; Pollen et al., 2014; Usoskin et al., 2015;
Wilson et al., 2015)."
gfsfdg
\(X=U{\Sigma}V^T\)
load hald;
X=ingredients;
Xn=X-mean(X);
[U,S,V]=svd(Xn,0);
C=cov(Xn);
norm((V*S^2*V')./(n-1)-C)
Brennecke et al., 2013
Accounting for technical noise in single-cell RNA-seq experiments
https://www.nature.com/articles/nmeth.2645
"To capture the dependence of the CV2 of the spike-ins on their average normalized count μ, we fit a curve to the observed data, using the parameterization CV2 = a1/μ + α0 (Online Methods)"
"All genes will display some biological variability in expression from cell to cell, but a high level of variance (exceeding the specified threshold) will indicate genes important in explaining heterogeneity within the cell population under study (Online Methods and Supplementary Note 6). "
"at a false discovery rate of 10%, 876 genes across the seven GL2 cells that showed statistically significant evidence against the null hypothesis that their biological coefficient of variation was less than our chosen minimum CV of 50% (i.e., CV2 < 0.25). We therefore considered these genes to be highly variable (Fig. 2d and Supplementary Table 2)."
"Across the highly variable genes, we found clear enrichments for Gene Ontology (GO) categories such as “Nucleosome Assembly” (P = 2.5 × 10−24), “Cell Proliferation” (P = 6.0 × 10−6), “Anaphase” (P = 5.4 × 10−7) and “Cell Wall” (P = 3.8 × 10−6), which are expected to vary across cells because they are indicative of distinct growth states for GL2 and QC cells (for a full list, see Supplementary Table 4). Additionally, individual GO categories tended to be upregulated in a coordinated fashion in individual cells, a result suggesting that these GO categories reflect different cellular states and possible instances of co-regulation10 (Supplementary Fig. 9). "
"As in any hypothesis test, our results did not imply that none of the remaining genes was highly variable. In fact, for all genes in the GL2 cells with normalized counts below ∼100 (weakest significant gene in Fig. 2d), even the strongest biological variation could not be detected because technical noise was maximal (Supplementary Note 7). This is not a limitation of our statistical approach; rather, it is a direct consequence of the limited sensitivity of current single-cell RNA-seq protocols. (Supplementary Note 4)."
"The relationship between technical variability and expression strength showed a robust fit (Fig. 3). We identified highly variable genes across the 91 cells and found 1,198 at a 10% false discovery rate. This set of genes was strongly enriched for several GO categories including “Cytokine Activity” (P = 6.9 × 10−8), as expected. This suggested that the set of genes identified are likely to be physiologically relevant."
"Supplementary Note 1 – Differences between groups versus variability within a group
The purpose of our method is to identify genes whose expression levels vary across single cells within a single population of cells. These cells are supposedly similar or, at least, they are not a priori known to come from two distinct cell populations. This scenario is significantly different from the more common experimental setup of finding genes that are differentially expressed between two or more discrete groups of cells. .... In our setting, however, we seek to find genes that are variable within a single population of cells. In other words, the biological variability, which was part of the nuisance parameter in the two-group comparison setting, now becomes the parameter of interest. Hence, distinguishing biological noise from technical noise is critical, and only in this situation does it become necessary to resort to spike-in data to characterize the strength of technical noise."
Kolodziejczyk et al., 2015
Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation
https://www.sciencedirect.com/science/article/pii/S193459091500418X?via%3Dihub
"An advantage of the single-cell approach is that we can study the distribution of expression levels across the population, thereby capturing cell-to-cell variability in gene expression (Figure 2A). To compare global levels of gene expression heterogeneity between the three different culture conditions, we used the coefficient of variation (CV) of normalized read counts (Figure S2). However, the CV of a gene depends strongly on its mean expression level and length, making it difficult to interpret differences between conditions. To account for the confounding factor of expression level, we therefore developed a measure of cell-to-cell variation by calculating the distance between the squared CV of each gene and a running median (Figures S2E and S2F). This is derived from the scatterplot of the mean normalized read counts versus the squared CV values, as in (Newman et al., 2006). We refer to this expression-level normalized measure of gene expression heterogeneity as distance to the median (DM) (refer to Supplemental Experimental Procedures for details)."
"We found that 712 GO terms (out of a total of 19,107 terms) exhibit a significant difference in levels of gene expression heterogeneity in at least one pairwise comparison "