Background and motivation

Bulk RNA-seq

Bulk RNA-seq (i.e. measuring gene expression in multi-cell samples, often including mixtures of cell types) is an extremely common experiment.  Often bulk RNA-seq samples are collected for two samples and their gene expression values are compared (i.e. differential gene expression analysis).  For instance, these two samples could be a healthy mouse lung and a diseased mouse lung or it could be two diseased lungs where one of the two samples are exposed to a potential treatment.  There are a number of reasons why a particular gene might be over-expressed in one sample compared to another.  One reason could be that the tissue ratio is different between the two samples-- for instance, the diseased tissue might have more white blood cells than a control tissue.  Another reason could be that some specific cell types (or all the cells) are over-expressing that gene.   The focus of this MPhil project will be on these first two explanations, but it is also important to remember that this up-regulation may be because of an uncontrolled confounding factor (potentially things like the time of day the sample was collected, the age of the patient, the stress level of the patient) or another observable physiological difference that distinguishes the two samples (like how quickly the cells divide). In particular, you will attempt to answer the question: how much of the gene expression changes could be explained by differences in the cell type composition between the RNA-seq samples?

Single cell resolution sequencing

Now it is possible to perform RNA-seq on single cells.  With bulk RNA-seq you would find a single gene expression value for each gene (a single column in a table), but with single cell RNA-seq you would find the gene expression for each gene within each cell that is sampled (a table with a column for each cell).  This means that you can graph the distribution of gene expression values in a population of cells for each cell type.  However, some of this variability arrises from technical noise (from the experimental procedure) rather than biological noise (true variation between cells).  The most notorious issue is zero-inflation: for each individual cell there will be many genes that will have a recorded gene expression of 'zero', but this may be technical, rather than biological.  

Ways to utilise single cell RNA-seq to help interpret bulk RNA-seq

If you have an atlas of gene expression from single cells or single tissues, then it may be possible to determine the composition of cell types or tissues in a bulk RNA-seq experiment.  I sent you the CIBERSORT article which describes one way to do this in detail.  There are a few issues with CIBERSORT that we found when we tried to apply it to our (plant) data: (i) it gives you an estimated tissue ratio, but it doesn't give you a sense of how confident it is in these results (ii) it performs poorly when the tissue ratios are more extreme (like 90% coming from one tissue or cell type and only a few % coming from other tissue/cell types) and (iii) it doesn't take into account difference in the age of the sample.  We developed a strategy for overcoming these issues (which we call TissueTimer).  The master student who worked on this project is currently writing up a paper about this work (I'll give it to you to read once its in better shape).  However, he used a 'tissue atlas' rather than a 'cell atlas' (samples taken from whole tissues rather than single cell data.  An ideal outcome of the MPhil would be to adapt this method to be able to utilise single cell RNA-seq data.  The main challenge is to deal with is zero-inflation, but also there might need to be other changes to the method to account for the fact that we have data from 100s of cells per sample, instead of data from 3 RNA-seq replicates.  Before we make these changes, we need to do a thorough analysis of the single cell RNA-seq atlas that we will be using, and this will inform how we modify TissueTimer.

Why the mouse lung?

We might eventually get access to new mouse lung data from collaborators.  In the meantime, there are lots of lung RNA-seqs available, both single cell and bulk, which we can use.
Single cell RNA-seq datasets:
https://figshare.com/articles/Single-cell_RNA-seq_data_from_Smart-seq2_sequencing_of_FACS_sorted_cells/5715040
https://figshare.com/articles/MCA_DGE_Data/5435866
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124872
Bulk RNA-seq datasets (same strain as first article):
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92891
also: GSE49114
many more can be found through google scholar

Research questions:

Overall aim: We will develop a tool to enable us to predict the cell type ratio of whole-lung RNA-seq, based on single cell gene expression data.  We will use this tool to explore how the cellular composition of lungs changes during disease progression.
Exploratory analysis:  (Note: even if this all that you get through, it will be enough for a thesis.  You might also discover something really super cool in this phase of the analysis and the project might veer in a different direction)
Applying CIBERSORT:
Extending TissueTimer:

Initial steps to take:

Learn how to perform the following operations in R and knowing what they mean (I can help with this):
Meetings:
07-10-19: Overview of data set structure
14-10-19: