Cell preparation and single-cell RNA sequencing
For Drop-seq, enriched CD34+ cells after MACS
separation were diluted to 200 cells/ul in PBS-0.1% BSA solution in
single-cell suspensions, and loaded into the Drop-seq device (Macosko et
al., 2015). We set the input concentration of cells and beads to be 200
cells/ul and 250 beads/ul, and optimized the flow rates for cells (2,500
ul/hour), beads (2,500 ul/hour) and oil (6,800 ul/hour) to obtain a
stabilized aqueous flow. We also ensured that these flow rates returned
a low cell doublet rates (1-2%) using human/mouse species-mixing
experiments as suggested, using human HEK293 cells and mouse 3T3 cells.
Droplets were collected after each run, and we recovered single-cell
transcriptomes attached to microparticles (STAMPs) using 6X SSC and
Perfluorooctanol (PFO, Sigma #370533). Reverse transcription was
performed on the STAMPs in a pooled fashion using Maxima H Minus Reverse
Transcriptase (Thermo Fisher Scientific #EP0752), followed by
Exonuclease I cleavage to remove primers not binding to mRNAs. cDNAs
were then amplified through PCR using KAPA HiFi HotStart ReadyMix (Kapa
Biosystems #KK2602) by collecting 5,000 STAMPs per PCR reaction, and
later fragmented and prepared into paired-end sequencing libraries with
the Nextera XT DNA sample prep kit (Illumina) using custom Read 1
primers (GCCTGTCCGCGGAAGCAGTGGTATCAACGCAGAGTAC, IDT). Libraries were
quantified using Qubit and BioAnalyzer High Sensitivity Chip (Agilent),
and sequenced on the Illumina HiSeq 2500 machine.
For index sorting experiments with LMPP, CD34+ cells
were enriched from umbilical cord blood using MACS separation, and
stained with the following antibodies in single-cell suspensions: APC
mouse anti-human CD34 (BD #560940, clone 581), Alexa
Fluor 700 mouse anti-human CD38 (BD #560676, clone
HIT2), APC/Cy7 anti-human CD45RA antibody (BioLegend
#304127, clone HI100), CD52 monoclonal antibody FITC (Thermo Fisher
Scientific #MA1-82037, clone HI186), BV421 mouse anti-human CD10 (BD
#562902, clone HI10a), and PE mouse anti-human CD114 (CSF3R, BD
#554538, clone LMM741). The amount to use per antibody was determined
from titration experiments using cord blood MNCs or enriched
CD34+ cells. CD34+
CD38- CD45RA+ cells were gated on
the SONY SH800Z cell sorter with CD10, CD52 and CSF3R indices recorded,
and individual cells were sorted into single wells in 96-well plates.
For mast cell progenitors, MNCs in single cell suspensions were stained
with the following antibodies: APC mouse anti-human CD34 (BD #560940,
clone 581), APC/Cy7 anti-human CD45RA antibody
(BioLegend #304127, clone HI100), BV421 mouse
anti-human CD117 (BD #562435, clone YB5.B8), PE anti-human FcεRIα
antibody (BioLegend #334609, clone AER-37), and FITC
Mouse Anti-Human CD71 (BD #561939, clone M-A712). We gated
CD34+ CD117+
FcεRIα+ CD45RA- and
CD34+ CD117+
FcεRIα+ CD45RA+ cells on the SONY
SH800Z sorter, with CD71 index recorded, and single cells were isolated
into individual wells on 96-well plates.
Cells were immediately lysed and mRNAs were released when single cells
were sorted into wells with 5X Maxima reverse transcription buffer, dNTP
mixture, RNase inhibitors (SUPERase In RNase Inhibitor, Thermo Fisher
Scientific #AM2696) and water. We reverse-transcribed the mRNAs using
Supercript II Reverse Transcriptase (Thermo Fisher Scientific
#18064071), and amplified cDNAs for each cell (KAPA) in individual
wells using the SMART-Seq2 protocol (Picelli et al., 2013), with the
exception that a 12-base cell barcode was included in the 3’-end RT
primer. This allowed us to perform multiplexed pooling before library
preparation with the Nextera XT DNA sample prep kit (Illumina), and
returned 3’ biased data similar to the Drop-seq protocol, enabling
direct comparison. We quantified the cDNA libraries on Agilent
BioAnalyzer and sequenced them on HighSeq 2500 with paired-end
sequencing.
Raw data processing and quality control
Raw reads from single-cell RNA-seq were processed using Drop-seq tools
v1.12 (Macosko et al., 2015). Briefly, reads were mapped to the human
hg19 reference genome, and a digital expression matrix was returned with
counts of unique molecular identifiers (UMIs) for every detected gene
(row) per cell barcode (column). To determine the number of cells (cell
barcodes) represented in the expression matrix, we used the elbow plot
method recommended by the Drop-seq core computational protocol, which
utilize the cumulative distribution of reads and identify an inflection
point in the plot. Beyond the inflection point should only be empty
micro-particles exposed to ambient RNA.
We further filtered cells by removing those with less than 500 UMIs
detected, and those with transcriptomic alignment rates less than 50%.
We also calculated the percentage of reads aligned to mitochondrial
genes per cells, and removed cells with greater than 10% of UMIs
corresponding to mitochondrial genes (Ilicic et al., 2016). A higher
proportion of mitochondrial genes indicates the loss of cytoplasmic
mRNAs and/or high cell stress experienced during sample preparation
(Ilicic et al., 2016). The filtered matrix was then log-normalized to
correct for the difference in sequencing depth between single cells, by
applying the formula below to each cell barcode, where \(c_{i}\) indicates the raw counts for gene \(i\):
\begin{equation}
ormalized\ expression=\operatorname{ln\ }{[\ \frac{c_{i}}{\sum_{i}c_{i}}\ \times\ 10,000\ ]}\nonumber \\
\end{equation}
Variation in cell cycle stages can contribute to the heterogeneity in
single-cell data and will be confounded with developmental
heterogeneity. Furthermore, technical factors will also act as
confounding sources of noise when analyzing heterogeneous populations.
We therefore sought to remove cell cycle effects together with technical
covariates through latent variable regression (Buettner et al., 2015).
Briefly, we assigned a cell cycle score for every cell using the PC2
loadings from a principal component analysis (PCA) done using only a
published list of genes whose expression level is strongly correlated
with cell cycle phase (Macosko et al., 2015), from which we found that
PC2 represented the separation between S and G2/M phases. We then
modeled the expression for each gene \(i\) using the formula:
\begin{equation}
{}_{i}=\ \beta_{0}+\ \sum_{j}{\beta_{j}X_{j}}+\varepsilon_{i}\nonumber \\
\end{equation}
where \(G_{i}\) is a vector showing the log-normalized expression for
gene \(i\) in all the cells, \(X_{j}\) represents a user-defined
covariate to regress out and \(\varepsilon_{i}\) is the random noise
associated to gene \(i\). In addition to cell cycle scores, we have
chosen to include total UMI counts, alignment rates, percentage for
mitochondrial reads and donor IDs as latent variables for regression.
The residuals were then z-scored and used as corrected expression values
for dimensionality reduction, which is described below.