Cell preparation and single-cell RNA sequencing
For Drop-seq, enriched CD34+ cells after MACS separation were diluted to 200 cells/ul in PBS-0.1% BSA solution in single-cell suspensions, and loaded into the Drop-seq device (Macosko et al., 2015). We set the input concentration of cells and beads to be 200 cells/ul and 250 beads/ul, and optimized the flow rates for cells (2,500 ul/hour), beads (2,500 ul/hour) and oil (6,800 ul/hour) to obtain a stabilized aqueous flow. We also ensured that these flow rates returned a low cell doublet rates (1-2%) using human/mouse species-mixing experiments as suggested, using human HEK293 cells and mouse 3T3 cells. Droplets were collected after each run, and we recovered single-cell transcriptomes attached to microparticles (STAMPs) using 6X SSC and Perfluorooctanol (PFO, Sigma #370533). Reverse transcription was performed on the STAMPs in a pooled fashion using Maxima H Minus Reverse Transcriptase (Thermo Fisher Scientific #EP0752), followed by Exonuclease I cleavage to remove primers not binding to mRNAs. cDNAs were then amplified through PCR using KAPA HiFi HotStart ReadyMix (Kapa Biosystems #KK2602) by collecting 5,000 STAMPs per PCR reaction, and later fragmented and prepared into paired-end sequencing libraries with the Nextera XT DNA sample prep kit (Illumina) using custom Read 1 primers (GCCTGTCCGCGGAAGCAGTGGTATCAACGCAGAGTAC, IDT). Libraries were quantified using Qubit and BioAnalyzer High Sensitivity Chip (Agilent), and sequenced on the Illumina HiSeq 2500 machine.
For index sorting experiments with LMPP, CD34+ cells were enriched from umbilical cord blood using MACS separation, and stained with the following antibodies in single-cell suspensions: APC mouse anti-human CD34 (BD #560940, clone 581), Alexa Fluor 700 mouse anti-human CD38 (BD #560676, clone HIT2), APC/Cy7 anti-human CD45RA antibody (BioLegend #304127, clone HI100), CD52 monoclonal antibody FITC (Thermo Fisher Scientific #MA1-82037, clone HI186), BV421 mouse anti-human CD10 (BD #562902, clone HI10a), and PE mouse anti-human CD114 (CSF3R, BD #554538, clone LMM741). The amount to use per antibody was determined from titration experiments using cord blood MNCs or enriched CD34+ cells. CD34+ CD38- CD45RA+ cells were gated on the SONY SH800Z cell sorter with CD10, CD52 and CSF3R indices recorded, and individual cells were sorted into single wells in 96-well plates. For mast cell progenitors, MNCs in single cell suspensions were stained with the following antibodies: APC mouse anti-human CD34 (BD #560940, clone 581), APC/Cy7 anti-human CD45RA antibody (BioLegend #304127, clone HI100), BV421 mouse anti-human CD117 (BD #562435, clone YB5.B8), PE anti-human FcεRIα antibody (BioLegend #334609, clone AER-37), and FITC Mouse Anti-Human CD71 (BD #561939, clone M-A712). We gated CD34+ CD117+ FcεRIα+ CD45RA- and CD34+ CD117+ FcεRIα+ CD45RA+ cells on the SONY SH800Z sorter, with CD71 index recorded, and single cells were isolated into individual wells on 96-well plates.
Cells were immediately lysed and mRNAs were released when single cells were sorted into wells with 5X Maxima reverse transcription buffer, dNTP mixture, RNase inhibitors (SUPERase In RNase Inhibitor, Thermo Fisher Scientific #AM2696) and water. We reverse-transcribed the mRNAs using Supercript II Reverse Transcriptase (Thermo Fisher Scientific #18064071), and amplified cDNAs for each cell (KAPA) in individual wells using the SMART-Seq2 protocol (Picelli et al., 2013), with the exception that a 12-base cell barcode was included in the 3’-end RT primer. This allowed us to perform multiplexed pooling before library preparation with the Nextera XT DNA sample prep kit (Illumina), and returned 3’ biased data similar to the Drop-seq protocol, enabling direct comparison. We quantified the cDNA libraries on Agilent BioAnalyzer and sequenced them on HighSeq 2500 with paired-end sequencing.
Raw data processing and quality control
Raw reads from single-cell RNA-seq were processed using Drop-seq tools v1.12 (Macosko et al., 2015). Briefly, reads were mapped to the human hg19 reference genome, and a digital expression matrix was returned with counts of unique molecular identifiers (UMIs) for every detected gene (row) per cell barcode (column). To determine the number of cells (cell barcodes) represented in the expression matrix, we used the elbow plot method recommended by the Drop-seq core computational protocol, which utilize the cumulative distribution of reads and identify an inflection point in the plot. Beyond the inflection point should only be empty micro-particles exposed to ambient RNA.
We further filtered cells by removing those with less than 500 UMIs detected, and those with transcriptomic alignment rates less than 50%. We also calculated the percentage of reads aligned to mitochondrial genes per cells, and removed cells with greater than 10% of UMIs corresponding to mitochondrial genes (Ilicic et al., 2016). A higher proportion of mitochondrial genes indicates the loss of cytoplasmic mRNAs and/or high cell stress experienced during sample preparation (Ilicic et al., 2016). The filtered matrix was then log-normalized to correct for the difference in sequencing depth between single cells, by applying the formula below to each cell barcode, where \(c_{i}\) indicates the raw counts for gene \(i\):
\begin{equation} ormalized\ expression=\operatorname{ln\ }{[\ \frac{c_{i}}{\sum_{i}c_{i}}\ \times\ 10,000\ ]}\nonumber \\ \end{equation}
Variation in cell cycle stages can contribute to the heterogeneity in single-cell data and will be confounded with developmental heterogeneity. Furthermore, technical factors will also act as confounding sources of noise when analyzing heterogeneous populations. We therefore sought to remove cell cycle effects together with technical covariates through latent variable regression (Buettner et al., 2015). Briefly, we assigned a cell cycle score for every cell using the PC2 loadings from a principal component analysis (PCA) done using only a published list of genes whose expression level is strongly correlated with cell cycle phase (Macosko et al., 2015), from which we found that PC2 represented the separation between S and G2/M phases. We then modeled the expression for each gene \(i\) using the formula:
\begin{equation} {}_{i}=\ \beta_{0}+\ \sum_{j}{\beta_{j}X_{j}}+\varepsilon_{i}\nonumber \\ \end{equation}
where \(G_{i}\) is a vector showing the log-normalized expression for gene \(i\) in all the cells, \(X_{j}\) represents a user-defined covariate to regress out and \(\varepsilon_{i}\) is the random noise associated to gene \(i\). In addition to cell cycle scores, we have chosen to include total UMI counts, alignment rates, percentage for mitochondrial reads and donor IDs as latent variables for regression. The residuals were then z-scored and used as corrected expression values for dimensionality reduction, which is described below.