ATAC-seq analysis and motif enrichment
The count matrix for ATAC-seq profiles of hematopoietic and leukemic
cell types (132 samples in total) was downloaded from NCBI Gene
Expression Omnibus (GSE74912) (Corces et al., 2016). Peaks were quantile
normalized using the normalize.quantiles() function in R package
preprocessCore. We also scaled the peaks between 0 and 1 using the
rescale() function in the R scales package, clipping at 5% and 95%
quantiles for every peak across samples. Each peak was associated with a
nearby transcription start site (TSS) using annotatePeaks.pl from HOMER
(Heinz et al., 2010), with human hg19 as a reference. To filter out
peaks with low accessibility, we calculated the maximum normalized
signals across samples (we selected samples from the following cell
type: CLP, GMP, CMP, HSC, LMPP, MEP, MPP), and removed peaks with a
maximum value less than 80 from downstream analysis.
To define the variable loci, we calculated the mean and standard
deviation for every peak, and selected the top 2,000 peaks with the
highest coefficient of variation (CV, standard deviation divided by the
mean), and performed PCA to learn the primary structure in early
hematopoietic regulation. To retrieve the “primary peak” per gene, we
compared the range of normalized signals for peaks associated with the
same gene across all cell types, and used the one with the maximal
accessibility as the primary peak.
To visualize modules of ATAC-seq peaks with similar dynamic patterns, we
used constrained k-means clustering on peaks assigned to a dynamic gene
module (for example, “de novo lymphoid” genes), setting k = 4 and
alpha = 0.2. To systematically group loci into “consistent” and
“inconsistent” types, we also leveraged the ranking of peaks
associated with one gene module across different cell types. For each
cell type, we averaged the normalized signals per peak, and ranked the
averaged signals among MEP, CMP, MPP, HSC, LMPP, GMP and CLP. Peaks with
highest ranks in the consistent cell type (for example, peaks assigned
to “de novo lymphoid” genes with the highest rank in CLP) were defined
as being “consistent”, whereas other peaks (“de novo lymphoid” genes
with peaks highest in MEP) were defined as being “inconsistent”.
For each module, the genomic positions of either consistent or
inconsistent peaks were used for motif enrichment, using the
findMotifsGenome.pl command in HOMER, with hg19 as the reference genome
and the default settings for all other options. To visualize the shared
motifs from different peak classifications, we combined the top 30
motifs of each group to form a list for potential transcriptional
regulators. The negative log P values corresponding to these motifs were
retrieved from HOMER output, and we took those with high enrichment
(maximum –logP > 10 in at least one peak classification)
for visualization in heatmaps using heatmap.2() in gplots.