ATAC-seq analysis and motif enrichment
The count matrix for ATAC-seq profiles of hematopoietic and leukemic cell types (132 samples in total) was downloaded from NCBI Gene Expression Omnibus (GSE74912) (Corces et al., 2016). Peaks were quantile normalized using the normalize.quantiles() function in R package preprocessCore. We also scaled the peaks between 0 and 1 using the rescale() function in the R scales package, clipping at 5% and 95% quantiles for every peak across samples. Each peak was associated with a nearby transcription start site (TSS) using annotatePeaks.pl from HOMER (Heinz et al., 2010), with human hg19 as a reference. To filter out peaks with low accessibility, we calculated the maximum normalized signals across samples (we selected samples from the following cell type: CLP, GMP, CMP, HSC, LMPP, MEP, MPP), and removed peaks with a maximum value less than 80 from downstream analysis.
To define the variable loci, we calculated the mean and standard deviation for every peak, and selected the top 2,000 peaks with the highest coefficient of variation (CV, standard deviation divided by the mean), and performed PCA to learn the primary structure in early hematopoietic regulation. To retrieve the “primary peak” per gene, we compared the range of normalized signals for peaks associated with the same gene across all cell types, and used the one with the maximal accessibility as the primary peak.
To visualize modules of ATAC-seq peaks with similar dynamic patterns, we used constrained k-means clustering on peaks assigned to a dynamic gene module (for example, “de novo lymphoid” genes), setting k = 4 and alpha = 0.2. To systematically group loci into “consistent” and “inconsistent” types, we also leveraged the ranking of peaks associated with one gene module across different cell types. For each cell type, we averaged the normalized signals per peak, and ranked the averaged signals among MEP, CMP, MPP, HSC, LMPP, GMP and CLP. Peaks with highest ranks in the consistent cell type (for example, peaks assigned to “de novo lymphoid” genes with the highest rank in CLP) were defined as being “consistent”, whereas other peaks (“de novo lymphoid” genes with peaks highest in MEP) were defined as being “inconsistent”.
For each module, the genomic positions of either consistent or inconsistent peaks were used for motif enrichment, using the findMotifsGenome.pl command in HOMER, with hg19 as the reference genome and the default settings for all other options. To visualize the shared motifs from different peak classifications, we combined the top 30 motifs of each group to form a list for potential transcriptional regulators. The negative log P values corresponding to these motifs were retrieved from HOMER output, and we took those with high enrichment (maximum –logP > 10 in at least one peak classification) for visualization in heatmaps using heatmap.2() in gplots.