Background and motivation
Bulk RNA-seq
Bulk RNA-seq (i.e. measuring gene expression in multi-cell samples, often including mixtures of cell types) is an extremely common experiment. Often bulk RNA-seq samples are collected for two samples and their gene expression values are compared (i.e. differential gene expression analysis). For instance, these two samples could be a healthy mouse lung and a diseased mouse lung or it could be two diseased lungs where one of the two samples are exposed to a potential treatment. There are a number of reasons why a particular gene might be over-expressed in one sample compared to another. One reason could be that the tissue ratio is different between the two samples-- for instance, the diseased tissue might have more white blood cells than a control tissue. Another reason could be that some specific cell types (or all the cells) are over-expressing that gene. The focus of this MPhil project will be on these first two explanations, but it is also important to remember that this up-regulation may be because of an uncontrolled confounding factor (potentially things like the time of day the sample was collected, the age of the patient, the stress level of the patient) or another observable physiological difference that distinguishes the two samples (like how quickly the cells divide). In particular, you will attempt to answer the question: how much of the gene expression changes could be explained by differences in the cell type composition between the RNA-seq samples?
Single cell resolution sequencing
Now it is possible to perform RNA-seq on single cells. With bulk RNA-seq you would find a single gene expression value for each gene (a single column in a table), but with single cell RNA-seq you would find the gene expression for each gene within each cell that is sampled (a table with a column for each cell). This means that you can graph the distribution of gene expression values in a population of cells for each cell type. However, some of this variability arrises from technical noise (from the experimental procedure) rather than biological noise (true variation between cells). The most notorious issue is zero-inflation: for each individual cell there will be many genes that will have a recorded gene expression of 'zero', but this may be technical, rather than biological.
Ways to utilise single cell RNA-seq to help interpret bulk RNA-seq
If you have an atlas of gene expression from single cells or single tissues, then it may be possible to determine the composition of cell types or tissues in a bulk RNA-seq experiment. I sent you the CIBERSORT article which describes one way to do this in detail. There are a few issues with CIBERSORT that we found when we tried to apply it to our (plant) data: (i) it gives you an estimated tissue ratio, but it doesn't give you a sense of how confident it is in these results (ii) it performs poorly when the tissue ratios are more extreme (like 90% coming from one tissue or cell type and only a few % coming from other tissue/cell types) and (iii) it doesn't take into account difference in the age of the sample. We developed a strategy for overcoming these issues (which we call TissueTimer). The master student who worked on this project is currently writing up a paper about this work (I'll give it to you to read once its in better shape). However, he used a 'tissue atlas' rather than a 'cell atlas' (samples taken from whole tissues rather than single cell data. An ideal outcome of the MPhil would be to adapt this method to be able to utilise single cell RNA-seq data. The main challenge is to deal with is zero-inflation, but also there might need to be other changes to the method to account for the fact that we have data from 100s of cells per sample, instead of data from 3 RNA-seq replicates. Before we make these changes, we need to do a thorough analysis of the single cell RNA-seq atlas that we will be using, and this will inform how we modify TissueTimer.
Why the mouse lung?
We might eventually get access to new mouse lung data from collaborators. In the meantime, there are lots of lung RNA-seqs available, both single cell and bulk, which we can use.
Single cell RNA-seq datasets:
Bulk RNA-seq datasets (same strain as first article):
also: GSE49114
many more can be found through google scholar
Research questions:
Overall aim: We will develop a tool to enable us to predict the cell type ratio of whole-lung RNA-seq, based on single cell gene expression data. We will use this tool to explore how the cellular composition of lungs changes during disease progression.
Exploratory analysis: (Note: even if this all that you get through, it will be enough for a thesis. You might also discover something really super cool in this phase of the analysis and the project might veer in a different direction)
- How consistent are the single cell lung RNA-seq datasets across the three papers?
- What genes are most resilient to batch effect? (i.e. what genes have expression levels that are most consistent across the experiments?) . Are there certain cell types that seem more resilient to batch effects than others?
- How consistent are immune system cells across tissues? Can we predict the tissue that an immune system cell came from, based on its gene expression values? (If immune systems cells from other tissues are similar to immune system cells in lung, then we can use those cells to identify good single cell markers in the next phase of the project)
- Can we distinguish between lung cell type based on the gene expression of single cells? What genes are most informative?
- Can we infer the cell cycle phase of the cells in the lung? How does the distribution of cell cycle phase vary across cell types? (https://rdrr.io/bioc/scran/man/cyclone.html)
- Do any pairs of genes tend to have highly correlated gene expression values within a cell type? Are these the same or different across different cell types?
Applying CIBERSORT:
- What are the gene markers that are selected by CIBERSORT: do they seem reasonable, given your previous analysis?
- If you only provide CIBERSORT gene expression values for genes that you think would be decent markers from your previous analysis, does it perform better?
- How accurately does it perform? (make simulated bulk RNA-seq from your single cell RNA-seq data)
- Does it perform accurately on data from a different 'batch'?
- What cell type composition is predicted for RNA-seq datasets from public databases? Do you see a difference in predicted tissue ratio?
Extending TissueTimer:
- Go through the math: what do we need to modify to take into account zero-inflation?
- Are there any assumptions we can remove now that we have data from 100s of cells instead of 3 replicates?
- Can we incorporate age, using the aging lung atlas? Can we think about how to incorporate cell cycle phase, as well?
- Repeat all the CIBERSORT analysis with TIssueTimer: How different are the results from CIBERSORT and TissueTimer? What are the benefits of using one over the other?
Initial steps to take:
Learn how to perform the following operations in R and knowing what they mean (I can help with this):
- PCA (visualising how similar cells are to one another)
- t-SNE (visualising how similar cells are to one another and clustering-- i.e. finding groups of cells that are similar to one another)
- Kmeans clustering and hierarchical clustering (clustering-- i.e. finding groups of cells that are similar to one another)
- Drawing results as a heatmap or scatterplot
- Supervised learning techniques: randomForest, SVM (if you have labelled data-- such as tissue type-- these are strategies to build models to help classify the cells by their labels. You can then look at the model and see which genes were most useful for predicting the label.)
- Single cell versions of differential gene expression analysis: https://hms-dbmi.github.io/scde/ or read https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2599-6
Meetings:
07-10-19: Overview of data set structure
14-10-19: