A simple solution is to train a replication timing with only H3K4me3 and H3K27me. But this would be too simplistic a model, even though replication time is related to gene expression and H3K4me3 induces expression while H3k27me3 represses it.
Instead, using data available from the ENCODE project, a different model was trained for each experiment using as many tags available to improve training.
Training model
- For each experiment, ENCODE data was obtained for cell lines, or tissues with at least all the tags needed.
- Data values for 100kb region for each tag were extracted from regions whose replication time is variable. and the mean, standard variation, skewness and Kurtosis was saved.
- Once enough features have been extracted (ca 20000), the Standard Variation, Skewness, and Kurtosis were Boxcox normalized (as they do not lay in a normal distribution).
- The features are now ready to be modelled by linear regression.
Applying model
- Features from the experiments are extracted in the same fashion as \ref{645025}
- Using the model trained in \ref{645025} on the features from \ref{972036} we get a predicted replication time for mice germ line.
Result
Linear regression: