\documentclass{report}
\usepackage{fullpage}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
\usepackage{url}
\usepackage{hyperref}
\hypersetup{colorlinks=false,pdfborder={0 0 0}}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\author{Madjid BESSOUL\\ Affiliation not available \and Richard\\ Affiliation not available \and Aklil\\ Affiliation not available \and Awaiting Activation\\ Affiliation not available}
\title{BSHMM : A model for Markov-based DNA methylation profiling and case study in diatoms.}
\begin{document}
\maketitle
\section{Introduction}
\subsection{Host laboratory}
The internship took place in the Laboratory of Quantitative and Computational Biology in Paris. The lab is led by A. Carbone and is affiliated with both UPMC and CNRS. The research focuses on interdisciplinary computational biology, promoting a tight collaboration between theoretical and experimental approaches, both conducted in the same lab within seven different teams composed of biologists, computer scientists, statisticians and biophysicists. Under the supervision of Hugues Richard, I was part of the analytical genomics team whose area of research spans two main subjects : protein evolution and modelling and sequence evolution.
\subsection{Prior work and rationale}
The idea of studying methylation patterns based on a statistical method was initiated during the first year of the master's degree as a compulsory project. The goal was to construct and implement a model inspired by the Ph.D. thesis of Bogdan Mirauta with his active help and supervision. Guillaume Viejo, a fellow student at the time and myself had to repurpose Parseq \cite{Mirauta_2014}, a model aimed at RNA-Seq data analysis and modify it into a reliable DNA methylation profiling tool, starting from a library of sequencing data called BS-Seq.
A 6 months voluntary internship further extended this work. Even though the main motivation of the project has been kept the same, the statistical methods have been heavily simplified : from a sophisticated Monte Carlo combined with Gibbs particle sampling into a more practical and easier 3-layer hidden Markov process of order 1, more relevant to the aspirations of an internship research project. The tool has been almost entirely implemented during this period and dubbed BSHMM for BS-Seq Hidden Markov Model. It has been proven to be effective on simulated data but no validation had been conducted in real world conditions yet. In addition, during this year, I presented a poster presenting the tool at the CJC (Jeunes Chercheur des Cordeliers) meeting, which is mainly aimed at Ph.D. students.
\subsection{Objectives of the Internship}
This second internship was an immediate follow up to the development of BHSMM. We first sought to validate our results by comparing them to those of a different methylation experiment based on microarrays to draw the 5-methylcytosine (5mC) profile of \textit{Phaeodactylum tricornutum}. The second part consisted of using the tool that we have implemented in a larger pipeline of analysis. Recent publications have shown how methylation profiles exhibit spatial periodicity and play an important role in the chromosome arrangement inside the nucleus of some diatom species via nucleosome linkage. \cite{Huff_2014} Besides, the same type of periodicity has been observed in the expression level of small RNAs, although it is still unclear whether these two single events are related to the same biological process. The goal is to figure out whether this periodicity is also present in \textit{P. tricornutum}, and also if it is linked in any way to the placement patterns of small non coding RNA (snRNA) derived fragments.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/meth-dna/meth-dna}
\caption{{Simplified representation of methylated CpGs in DNA.
\label{fig:meth_biology}%
}}
\end{center}
\end{figure}
\subsection{Biological background}
DNA methylation is a heritable epigenetic marker found on DNA sequences, targeting specific patterns : mainly stretches of C and G in mammals, called CpG, but also CpHpG or CpHpH in plants and other microorganisms (H refers to an A, a G, or a T). Covalent addition of a -CH 3 at the 5' of the cytosine ring results in what we call a 5-methylcytosine, or 5mC, sometimes via enzymes belonging to the family of DNA methyltransferases (DNMTs). The silencing effect of DNA methylation has been experimentally proven and extensively studied over the last decade. The main mechanism of repression involves altering chromatin conformation, inducing a more compact DNA structure and consequently impeding DNA polymerase recruitment. This phenomenon has critical implications on gene expression and cell differentiation \cite{Khavari_2010} as well as cancer \cite{Daniel_2010, Kulis_2010}.
In addition, recent studies have shown very specific periodical patterns of DNA methylation profiles in some of the diatom organisms. In these organisms --which possess unusually high proportion of methylated cytosine in CpG context- methylation does not impact gene expression and methylated cytosine occurred at a 182bp periodicity (in \textit{Emiliania huxleyi} for instance, as shown in figure~\ref{fig:period_huff} below) \cite{Huff_2014}. It is suspected to guide nucleosome linkage and drive spatial chromosome compaction inside the nucleus. In this article, the authors have shown that there is an interplay between methylation levels and nucleosome positioning. We will first try to reproduce these results using our own models and pipelines for methylation analysis and BS-Seq data, we will then investigate the existence of similar patterns in \textit{Phaeodactylum tricornutum}.
Furthermore, it is also suspected that snRNAs might be linked to DNA methylation. Striking results brought up surprisingly similar periodical patterns in snRNA expression levels as well as significant overlap between Highly Methylated Regions, transposable elements and snRNA producing loci in \textit{P. tricornutum} \cite{Rogato_2014}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/cell-period-nucleo/cell-period-nucleo}
\caption{{Interplay between nucleosome positioning and methylated cytosine in \textit{E. huxleyi} Left figure : In-vitro nucleosome centers (black line) are aligned to methylation clusters (blue line: proportion of methylated cytosines).
Right figure : Autocorrelation function on CG and CHG methylation levels across large scaffolds of \textit{Emiliania Huxleyi} genome, showing a 182bp periodical pattern. Excerpt from \protect\cite{Huff_2014}, fig. 3.
\label{fig:period_huff}%
}}
\end{center}
\end{figure}
\section{Materials and Methods}
DNA methylation profiling presents many challenges throughout the course of the analysis, from preparing the library to interpreting the results. However, the major steps are similar to any study involving next-generation sequencing data, only great care should be taken to integrate some specific aspects of this type of data. One major difference resides in using a chemically altered DNA sequence, the result of a process called bisulfite conversion and regarded as the gold standard for DNA methylation analysis.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/workflow5/workflow5}
\caption{{Typical workflow for analyzing and interpreting DNA methylation data (upper arrows) and the corresponding steps in our own project.
\label{fig:workflow}%
}}
\end{center}
\end{figure}
\subsection{Mapping and processing BS-Seq data}
\subsubsection{The BS-Seq protocol}
Bisulfite treatment is a biochemical process intended to enable downstream DNA application to assess methylation at a genome-wide and at single base pair resolution. Without going into the chemical details, the conversion starts by deaminating every unmethylated cytosine residues turning them into uracil while keeping the 5-methylcytosine residues unaffected. A subsequent PCR finishes the transformation by turning the uracils into thymines. (See figure \ref{fig:bisulfite}). The modified DNA is amplified and sequenced using high throughput sequencing platform, producing a typical library of 10\textsuperscript{6} - 10\textsuperscript{9} read sequences.
Even though bisulfite treatment is an inherently harsh and complicated process that changes both the chemical and physical properties of DNA, it is still widely used and when done carefully yields very high conversation rates close to 99\%.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/bisulfite/bisulfite}
\caption{{Bisulfite treatment of DNA keeps 5-methylcytosines intact, but converts every other C in the sequence into a T. Positions 2 and 4 are here converted, while the 5mC at position 5 (marked with a red dot) is kept as is. The image is overly simplified for the sake of clarity, the complete chemical conversion is achieved through many successive transformations.
\label{fig:bisulfite}%
}}
\end{center}
\end{figure}
\subsubsection{Alignment of BS-Seq data}
Due to the various sources of errors that can occur during Next-generation sequencing, the BS-Seq reads (e.g. short reads library of bisululfite-treated DNA) are systematically passed through a pipeline of quality control to minimize errors and bias, prior to the mapping phase. This step is essential in every analysis involving NGS data (The process is further detailed in the Materials section).
Mapping BS-Seq data rules out traditional alignment methods due to the heavily modified sequence. In fact, a result of bisulfite treatment, most cytosines are turned into thymines, inevitably leading to a large bias when trying to match a T in the reads to a C in the genome. Several methods and tools have been developed to deal with this specificity, among which many rely on reduced representation of the genome using only a 3-letter reference sequence where all Ts are transformed to Cs.
A more principled approach has been chosen for this work. LAST is a BLAST-like alignment tool that uses custom seeds and weight matrices (Table \ref{table:last-score}) while taking into account sequencing and bisulfite conversion error rate to significantly reduce possible mismatch biases~\cite{Frith_2012}. The benchmarks strongly suggested that LAST was the best among 10 alignment tool for bisulfite modified DNA in terms of error rate, sensitivity and CPU runtime. However, as mentioned in the paper, the same bias can also produce the reverse effect by making C-rich sequences easier to align. At the cost of negligible alignment accuracy, it is possible to reduce this bias by computationally converting \verb|C|s in the reads into lowercase \verb|t|s to keep track of the modification. This technique had little effect on the alignment result, and we have successfully put it to use.
We have also had the chance to meet Martin Frith in person at the lab, and we seized the opportunity to discuss the different aspects of this particular problem and seek advice and guidance.
Based on mapping results, we count the C to C (C on the reads that align to C on the reference genome), that a priori refer to a 5- methylated cytosine (Figure \ref{fig:hmm}b). The higher the coverage on a given position, the higher the confidence of the estimation. However, the alignment quality is extremely heterogeneous and sometimes incomplete, in part due to sequencing coverage variability, one more motivation to construct a robust statistical model that can account for the uncertainties.\selectlanguage{english}
\begin{table}
\begin{center}
\begin{tabular}{l*{6}{c}r}
\hline
& a & c & g & t \\
\hline
a & 6 & -18 & -18 & -18 \\
c & -18 & 6 & -18 & \textbf{3} \\
g & -18 & -18 & 6 & -18 \\
t & -18 & -18 & -18 & 6 \\
\hline
\end{tabular}
\end{center}
\caption{{Score matrix used for aligning bisulfite-converted DNA reads to a genome sequence. Columns refer to bases in the reads, rows refer to bases on the reference genome. Note that the cost for a C-to-T mismatch is in line match costs.}}
\label{table:last-score}
\end{table}
\subsection{The BSHMM model}
\subsubsection{Model parameters}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/hmm-align/hmm-align}
\caption{{\label{fig:hmm}
\textbf{a)} Directed acrylic graph of the order 1 hidden Markov model with a 2-layer observation. The variables of the model are
$x$, the hidden methylation level, $y$, the observations (C-C matches), $z$, the coverage depth. Parameters are $A$, the transition matrix and $e_k$ the emission distribution.
\textbf{b)} Mapping BS-Seq reads onto the reference genome. In red, aligned cytosines that a priori indicates a methylated state.
The arrow shows the relevant data that is extracted from the alignment results and then used as input for the model, which also refers to the observed data in the HMM.%
}}
\end{center}
\end{figure}
DNA methylation in mammals is arranged in clusters, called CpG islands. This clustered pattern is also present in
plant organisms. A CpG island is a locus on the DNA sequence, sizing roughly between 200 and 3000 bp and typically
has a high CG content and a higher than the average number of methylated cytosines. This biological feature is the main
motivation towards the use of a Markov chain model. The fact that methylation is almost always aggregated in densely methylated regions can be interpreted as a spatial dependency between any 5mC and its immediate predecessor.\\
We construct a hidden Markov chain with a 2-layered set of observations , $y_i$, the number of aligned
cytosine at position $i$ and $z_i$ the total coverage depth at the same position (Ftig X). The hidden layer is noted $x_i$, which
corresponds to the proportion of methylated C at position $i$. $x_i$ are discrete values in a range that
can be refined depending on the desired outcome. Typically, other models in the literature seldom go over three values
for low, medium and highly methylated. Our model is set to 10 values by default.
Furthermore, a study has shown that second order HMMs yield a small performance gain, and anything higher declines marginally in term of accuracy \cite{Seifert_2012}. We thus decided to use order 1 markov model in the following.
\begin{equation}
x_t \in \{\frac{k}{n}, k = \{0, 1, ..., n\}\}
\end{equation}
As for every HMM, we need to define three main parameters, namely the initial probability distribution, the emission and the
transition probabilities.
The acyclic Markov chain goes from one cytosine to another, and is as long as the number of cytosines in the
methylated region, whose length is expected to be around 200bp to 3000bp. The short term memory property of the Markov process implies that the distribution converge exponentially fast toward the stationary distribution, hence the choice of the initial parameter has no implication on the outcome of the model. We thus use uniformly distributed values.
The transition parameter is how a methylation state changes from a position $i$ to $i+1$. With n levels of methylation,
the transition probabilities are modeled as a set of $(n+1)^2$ parameters arranged in an $n+1 \cdot n+1$ matrix we call $A$, defined
as :
\begin{equation}
A(l, k) = P(x_{i+1} = k \mid x_i = l)
\end{equation}
And since we only consider the positions of Cs on the genome, we need to account for the distance between one
C and another in our transition parameters by using powers of the matrix A. The farther they are from each other, the weaker the
dependency between their methylation states. If the cytosines are far enough from each other, the
transition matrix converges to a stationary distribution.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/matrix-power1/matrix-power1}
\caption{{\label{fig:transition}
Spatial representation of the HMM over a DNA sequence. The powers of the transition matrix A are pre-
computed for speed and performance. An upper bound is calculated for m, at which $A^m$ is close enough to the
stationary distribution%
}}
\end{center}
\end{figure}
As for the emission parameter, we use a binomial distribution as $z$ random draws with probability $x$. This is
the simplest model we can come with and a first step towards a more sophisticated distribution that accounts for
coverage variability in a more reliable fashion
\begin{equation}
e_k(y_i) = P(y_i \mid x_i = k, z_i) = {z_i \choose y_i}x_{i}^{y_i}(1-x_i)^{z_i - y_i}
\end{equation}
Finding the methylation probabilities for every cytosine on the sequence \textit{per se} refers to the decoding problem. in
other words, based on the set of observed values $y$ and $z$ at each position $i$, we try to uncover the hidden value of $x$.
There are two main algorithms to compute the best hidden path, using either the Viterbi decoding, or a posterior
probability approach. While Viterbi computes the most likely sequence of hidden states, the posterior decoding approach
yields marginal probabilities for every state, which can be used to calculate confidence intervals and reliably assess the
accuracy of the model.
Calculating the posterior probability can be broken down into two different operations, by going forward and then
backward along the sequence after using the conditional independence property resulting from the Markov process on $x_i$
\begin{equation}
p(x_i = k \mid y) = \frac{f_k(i) \cdot b_k(i)}{P(y)}
\end{equation}
where $f_k(i)$ and $b_k(i)$ are the forward and backward variables respectively, defined as follows :
\begin{eqnarray}
f_k(i) &= &P(y_1, ..., y_i, x_i = k) \quad i=1,\ldots, L\\
b_k(i) &=& P(y_{i+1}, ..., y_L \mid x_i =k) \quad i=L-1,\ldots, 0\\
\end{eqnarray}
The second part of the model estimates the parameters of the Markov chain based on the data. To do so, we use
the Baum-Welch algorithm, which is a special case of a more general class of Expectation-Maximization algorithms.
%\begin{equation}
%\hat{x}_i = \underset{k}{\operatorname{argmax}}P(x_i = k \mid x)
%\end{equation}
The optimization of the transition matrix parameters computes the frequency of transitions from a methylation state $x_i$ to a state $x_{i+1}$ at positions $i$ and $i+1$ respectively given the set of observable data.
One major advantage of this method is that it uses the forward-backward arrays during the E-step to compute
there new values for the parameters. Since these arrays are also using the forward-backward algorithm and computing them is time-consuming, doing the operation once for both algorithms is a significant performance gain.
The M-step maximizes value for the estimated parameters and updates it at every iteration until the difference is small enough between the past and the current value. We evaluate this difference using by monitoring the likelihood and the relative entropy. The training stops when the likelihood gain get smaller than a threshold on its relative increase. (Fig 5)
Poorly-covered DNA regions are troublesome during this phase. The lower the coverage depth, the lower the
confidence into the transition probabilities, resulting in sub-optimal parameters. The workaround is to consider the training phase to take into account the high coverage regions only, with an automatically tuned threshold parameter that sets the lower bound for coverage depth, below which the data would be ignored.
\subsubsection{Implementation}
The HMM model has been coded as a Python class to make it flexible and easy to use as a part of a larger
pipeline analysis. We have made heavy use of NumPy to make array operations as fast as possible, in addition to many
other speed optimisation strategies, such a the pre-computation of the powers of the transition matrix between
occurrences of \verb|C|s along the sequence.
We also provide the scripts that have been used for quality control, alignment and data post-processing for the
sake of clarity and reproducibility.
We have used Git versioning system through the whole development process, and the source code is hosted and
made publicly accessible on GitHub under an open-source MIT licence.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/bshmm/bshmm}
\caption{{Screenshot of the BSHMM GitHub page.
\label{fig:github-screen}%
}}
\end{center}
\end{figure}
\subsection{Material}
\subsubsection{Sequence Data}
Two different types of dataset have been used with BSHMM.
We used a BS-Seq dataset of 3 combined biological replicates of IMR90 human foetal lung fibroblasts. \cite{Lister_2009},
sequenced using Illumina technology and resulting in more than 120 million of 86 bp long reads after quality control and
alignment. The whole chromosome 6 has been used for simulation and model validation.
BSHMM has been validated on a real use-case to analyze the methylation profiles of \textit{Phaeodactylum}
\textit{tricornutum}, \textit{Thalassiosira pseudonana}, \textit{Aureococcus anophagefferens} and \textit{Emiliania Huxleyi} a all of which are available under the GEO Accession GSE46692 \cite{Huff_2014}\selectlanguage{english}
\begin{table}
\begin{center}
\begin{tabular}{l*{6}{c}r}
\hline
Organism & Strain & Accession & Genome length (Mb) & \%GC \\
\hline
E. huxleyi & CCMP1516 & GSM1134616 & 167.68 & 65.7 \\
A. aanophagefferens & CCMP1984 & GSM1134609 & 56.66 & 67.4 \\
T. pseudonana & CCMP1335 & GSM1134628 & 32.44 & 46.9\\
P. tricornutum & CCMP632 & GSM1134626 & 27.45 & 48.8 \\
Human IMR90 & N/A & GSE16256 & 2996.43 & 41.2\\
\hline
\end{tabular}
\end{center}
\caption{{Summary of the different organisms for which we have profiled DNA methylation.}}
\label{table:organisms}
\end{table}
\subsubsection{Quality control of short read data}
We have used FASTQC to analyse the quality of the raw sequencing data, and NGS Toolkit \cite{Patel_2012} and SeqTK \cite{seqtk} to run them through a pipeline of quality control. The goal is to obtain the best quality dataset by identifying and
removing low-quality sequences and optimize the subsequent analysis steps. After assessing the quality of the obtained
sequences, we have trimmed the first 3 and the last 25 bp from each short-read. We have also performed a sliding
window filtering, where the average Phred score is computed every step, and the read is trimmed if the score drops\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/data-files1/data-files1}
\caption{{\label{fig:align-workflow}
Data pre-processing steps. From typical BAM alignment, to pileup data to filtered data as direct model input.%
}}
\end{center}
\end{figure}
\subsubsection{Data post-processing}
The input data for the model has to be extracted from the alignment files. Typical alignment results are text files
indicating aligned sequences, mismatch flags, quality scores and so on. Using Samtools \cite{Li_2009} we aggregate the data
to turn it into per-nucleotide information, yielding the list of aligned nucleotides over every position on the reference
genome. It is then easy to count the number of aligned Cs ($y_i$ ) as well as the total number of aligned nucleotides to
account for the coverage depth ($z_i$ ).
\subsection{Detection of periodic signals in genomic region}
We assessed, on highly methylated regions, the existence of a periodic pattern of the methylation levels. The discrete nature of the data makes it very sparse, and even highly methylated regions can contain stretches of cytosines that will \textit{de facto} be unmethylated. Hence, the first step is to apply a window function to the methylation signal. We choose to smooth the data using a Hanning window, which we think is more relevant to the clustered characteristic of methylation than a simple averaging sliding window. The Hann function $w$ is defined as followed : \\
\begin{equation}
w(n) = 0.5 - 0.5cos \frac{2\pi n}{N}
\end{equation}
where $n = \{0, 1, ..., N-1\}$ and $N$ is the window size. We tend to keep the window size much smaller than the expected period (around 10 to 30 bp), otherwise it would flatten the whole profile and fail its purpose.
An autocorrelation function is the cross-correlation between a signal and itself. It's widely used in signal processing, the only difference here is that instead of comparing the same signal at different times, we shift the signal (the methylation levels or RNA expression levels in our case) in space over the genome, ie. the lag is expressed in bp. On the smoothed values of methylation levels $\tilde{x}_i$, we compute the autocorrelation function $acf(k)$ as:
\begin{equation}
acf(k) = \frac{1}{(L-k)\hat{\sigma}^2} \sum_{i=1}^{L-k} (\tilde{x}_{i} - \hat{\mu}) (\tilde{x}_{i+k}-\hat{\mu})
\end{equation}
where $\hat{\mu}$ and $\hat{\sigma}^2$ are the sample mean and variance of the $\tilde{x}_i$, respectively.
\section{Results}
\subsection{Simulated data}
Model-based simulated data have been used to assess the model's accuracy by using a generative form of our hidden Markov process. We have simulated an artificial methylation profile and a set of observed data over the human chromosome 6. Starting with real 5-mC positions $i$ and real sequencing coverage values (real $z_i$) to retain the heterogeneity of the data. The model then draws random values of methylation probabilities (simulated $x_i$) and count data (simulated $y_i$) at every position. To make it as close as possible to real data, we have used a transition matrix that has been trained over part of the genome. We end up with an artificial methylation profile that is known, in other words, we know the values of the hidden states of the Markov chain. We then feed the observed data (the simulated $y_i$ and the real coverage values $z_i$) to the HMM which will then infer the methylation level. And since we already have this information, we can measure how good BSHMM is at inferring $x_i$ values. We have simulated a portion of 30Kbp based on the read coverage of chromosome 6 in human (dataset IMR90). The results were largely satisfactory, BSHMM has reliably and consistently reconstructed the real methylation profile (see figure~\ref{fig:bshmm-simu}) by estimating the true methylation level for 99.8\% of the cytosines (using a 95\% confidence interval).\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/simu5/simu5}
\caption{{\label{fig:accuraty_var}
a) Accuracy of the prediction when the coverage depth varies. The accuracy reaches a plateau with coverage as low as 20 reads covering a given position. Lower values causes the accuracy to fluctuate but stays higher than 85\% in the case of fixed coverage depth. When the depth is sampled following a Poisson distribution, the models tends to stay reliable even at 5-10 depth. b) Same conditions with varying coverage depth applied to evaluate the parameter estimation step. The relative entropy decreases while the log-likelihood increases with higher coverage depths.%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.98\columnwidth]{figures/sim-results1/sim-results1}
\caption{{Methylation profile reconstruction based on simulated data over human chr6:28,000,007-28,003,289 (hg18/NCBI Build 36.1). BSHMM has estimated the methylation profile (red solid line) with an accuracy of 99.8\%, over which we have superimposed the true profile (blue solid line). Light blue areas represent the 95\% confidence intervals.
\label{fig:bshmm-simu}%
}}
\end{center}
\end{figure}
We took the process one step further in order to account for the effects of variable coverage depth. In fact, the heterogeneous nature of coverage depth results from the combination of different effects of biases, starting from the sequence composition (repeated regions, composition bias) which heavily affects the alignment of reads, as well as sequencing errors and artifacts. Even though quality control of the reads library helps reducing these effects, we still observe them to an extent.
To observe how efficiently our model can recover the underlying methylation profile with varying read coverage depth, we have included a simulation module that can create artificial methylation patterns, much like the simulation described in the preceding paragraph ; although this time the $z_i$ values are simulated as well. We use two models for this purpose : the first one is a fixed value depth across the genome, which represents the best case scenario, we vary this values and watch how the accuracy changes. The second is closer to reality: we sample the coverage values using a Poisson distribution with different parameters. (Figure \ref{fig:accuraty_var}b)
We apply the same validation strategy to benchmark the training and parameter estimation. This time we want to observe the final values for both the log-likelihood and the relative entropy between the initial and updated parameter for each instance of coverage depth. (Figure \ref{fig:accuraty_var}b). Unsurprisingly, the estimation gets better the higher the coverage.
Based on the results of these simulation we can conclude that sequencing coverage depth is critical for the model to yield the most accurate inferred profiles. However the loss for lower values is not dramatic and we still obtain estimations above 95\% accuracy unless the coverage depth dips to extremely low values (1 to 5). Even though such occurrences sound extreme, we still encounter poorly covered genomic regions. Accounting for this heterogenity would be possible by integrating a sophisticated modeling of the distribution of coverage along the genome, in the spirit of what was done by B. Mirauta \cite{Mirauta_2014} but would require more sophisticated estimation and smoothing techniques.
\subsection{Real data validation}
Further validation of our model on real data has been conducted by comparing our results with a microarray-
based experiment on \textit{phaeodactylum tricornutum} \cite{Veluchamy_2013}.
A whole-genome analysis has been conducted on this organism, confirming that 2,8\% of the CG sites are highly
methylated (in our case, with a methylation probability higher than p=0.5), which is relevant with the findings in literature \cite{Huff_2014}. Furthermore, 84\% of methylated sites found by BSHMM are located inside HMR inferred from microarray
experiment on the same strain. We have also investigated how both methylation profiles correlate inside every HMR. An initial analysis yielded a correlation coefficient R = 0.17. This coefficient jumps to up to 0.53 once the data have been smoothed. The noisiness of microarray data and the sparseness of BSHMM sets plays a huge part in explaining this gap (see fig. \ref{fig:probes}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/bshmm-chip-corr1/bshmm-chip-corr1}
\caption{{\label{bshmm_real_data}
\textbf{a.} Distribution of correlations values between BSHMM methylation estimation and experimental microarray data on \textit{Phaeodactylum Tricornutum} taken from. Smoothing BSHMM data (in purple) drastically reduced the sparsity of the probability distributions, resulting in higher correlation. Furthermore, the noisy nature of microarray detection, in addition to the overlapping of aligned probes on the genome (the direct consequence of this effect makes every values in this datasets reflects anything but an average over the size of the probe) widens the gap between both datasets and complicating the process of side-by-side comparison. \textbf{b.} Representation of the global ratio of methylated cytosines in \textit{Phaeodactylum Tricornutum}. The numbers or methylated C (higher than 50\% methylation probability) adds up to 2.8\% of the whole genomes.%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/probes2/probes2}
\caption{{\label{fig:probes}
Probe positions in microarray based experiments lead to imprecise estimates of DNA methylation level. Every measure of a probe is nothing but an average of methylation level over all of the cytosines it covers. The --intentionally simplistic- diagrams above pictures how different probes can overlap the same cytosine and still yield different results.%
}}
\end{center}
\end{figure}
\section{Methylation patterns in diatoms}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/tree3/tree3}
\caption{{\label{fig:tree}
a) Phylogenetic tree of the profiled species (with the exception of \textit{F. cylindrus} which is not studied here) and b) their corresponding fraction of mehtylated cytosines. excerpt from \protect\cite{Huff_2014}, fig. 1.%
}}
\end{center}
\end{figure}
During the second phase, we use BSHMM for a more exploratory purpose. We want to investigate the methylation patterns of \textit{P. Tricornutum } in a way similar to the analyses in \cite{Huff_2014} of other diatoms. The available datasets for \textit{P. Tricornutum} are shown in figure. \ref{fig:igv}.
There are two distinct characteristics we want to explore. First we analyze every highly methylated region and look for any repetitive pattern similar to other organisms, independently for any other data. The second objective is to explore the correlation between the methylation profile and snRNA expression levels.
But before getting into the analysis of \textit{P. Tricornutum}, we want to make sure our methods are reliable by trying to replicate some of the results exposed in \cite{Huff_2014}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/igv3/igv3}
\caption{{\label{fig:igv}
Different datasets aligned over \textit{Pheodactylum tricornutum} genome using Integrated Genome Viewer (IGV).%
}}
\end{center}
\end{figure}
\subsection{Validation of methods}
Based on bisulfite sequencing data from GSE46692, we have reconstructed the methylation profiles of \textit{Emiliania Huxleyi} and \textit{Aureococcus anophagefferens}. Both organisms have been shown to exhibit very clear periodic methylation patterns \cite{Huff_2014} and we are looking to replicate these results using our own pipeline of analysis. \\
We started with GFF files containing count data in the same manner as pictured in Fig. \ref{fig:hmm}b. After basic pre-processing, we ran the data in BSHMM to reconstruct a smoothed methylation profile instead of relying on fractional estimates.
At this point, visualizing the results in IGV already gives a very strong impression of a repetitive pattern (Figure \ref{fig:acf1}, bottom)
We go on by running an autocorrelation function over the methylation profile based on a full convolution of the signal with itself and we finally detect the size of the most meaningful period using a peak detection algorithm.
What we have found is very consistent with the published findings, the periodic signal is strong and the periods are similar with a negligible difference of 5 to 10bp.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/aano-ehux-acf2/aano-ehux-acf2}
\caption{{\label{fig:acf1}
Periodicity of Methylation levels in a) \textit{A. anophagefferens} and b) \textit{E. huxleyi}. Top: We report the values of the autocorrelation function acf(k) for values of $k$ below 1000 bp. Bottom: genomic view of the methylation level for CG (blue) and CHG (green) methylation. CHH methylation levels are extremely low in those organisms and thus have been omitted.%
}}
\end{center}
\end{figure}
\subsection{Methylation patterns in \textit{Phaeodactylum Tricornutum}}
At this stage, we have concluded that our analysis pipeline is sound and yields the expected results when tested on studied organisms.
We want to apply the same analysis on the methylation profile \textit{P. tricornutum}.
The first observation is that 5-methylcytosine positions in \textit{P. tricornutum} don't show any sign of a perdiodical pattern. We have run an autocorrelation analysis over different regions : highly methylated regions, genes and transposable elements.
Raphael Champeimont, a Ph.D. student in the lab, has helped by using his own FFT-based algorithm to provide us with a set of regions that contain repetitive snRNA levels. We ran the same analysis again, and still, the 5-mC signal inside those regions seems random without any recognizable pattern.
The same results have been observed for \textit{T. pseudonana}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/acf-phaeo2/acf-phaeo2}
\caption{{\label{fig:acf_phaeo}
snRNA expression levels exhibit a very clear 164bp periodic pattern, whereas methylation levels show no sign of any periodicity at all in \textit{P. tricornutum}. Moreover, no significant correlation has been found between both signals in any HMR.%
}}
\end{center}
\end{figure}
With these results and given the limited data that is available to us, we cannot confidently establish the reasons why \textit{P. tricornutum} lacks the methylation patterns that are observed in other very closely related diatoms. On the other hand, there are some interesting observations that we can make which can potentially lead to further investigations.
We can't help but notice that \textit{P. tricornutum} has a significantly lower median GC content (46.9\%) compared to \textit{E. huxleyi} for example. (65.7\%). Furthermore, the difference in the fraction of methylated genomic cytosines is even higher, in fact higher than what the GC content difference alone can explain.
The paper of Huff et al. \cite{Huff_2014} also explored the nucleus density of some diatoms, based on the fact that methylation patterns seem to drive nucleosome positions and hence the spatial arrangement of chromosomes. Unfortunately, they didn't provide data for \textit{P. tricornutum} and \textit{T. pseudonana}. We have looked for the information in the literature and found estimates for both organisms \cite{Connolly_2008}. Even though the \textit{P. tricornutum} strain for which we have the nucleus volume is different (CCMP1327 instead of CCMP632), we assume that it provides us with a relevant order of magnitude for the very least. We have added these data points to the plot from Huff. et al. (Figure \ref{fig:nuclei_density} in red and orange) and it is very clear that both organisms have much lower nuclear density than other related diatoms.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/nuclei/nuclei}
\caption{{\label{fig:nuclei_density}
Figure from \protect\cite{Huff_2014}. The base-10 logarithms of genome sizes in base pairs versus nuclear volumes in mm\textsuperscript{3} are shown for individual species/cell types. We have added the data points for \textit{P. tricornutum} and \textit{T. pseudonana} which were missing from the original figure. Unsurprisingly, both turn out to have the least constrained nuclei of all the studied diatoms in the paper.%
}}
\end{center}
\end{figure}
\subsection{DNA Methylation and snRNA expression levels in \textit{P. tricornutum}}
It has been mentioned earlier in this report how \textit{P. tricornutum}'s snRNA expression levels are organized in a fashion very similar to the periodical organization of 5-mC in E. huxleyi or \textit{A. aanophageferrens}. We have been able to observe some of the findings that have been documented in \cite{Rogato_2014}, namely the significant overlap between HMRs, transposable elements and snRNA coding regions. These features strongly hint to a more complex and still unknown mechanism that binds them together, and at this point in the analysis, we can confidently assert that the internal DNA methylation profile patterns within HMRs (and by extent within transposable elements) do not exhibit the spatial periodicity feature that is observed in snRNA expression levels (see figure \ref{fig:acf_phaeo}, top).
\section{Conclusion}
The use of a statistical model relying on a Markov process, as simple as it is, proved to be a pretty efficient and accurate method for DNA profiling. We have successfully used it to replicate results from other papers and also to investigate some new aspects of DNA methylation. The model and its implementation still offer plenty of room for improvement by including more parameters and external sources of bias to account for the high heterogeneity of sequencing data. In fact, such effort is already put into practice by others researchers \cite{_ij__2016}.
Furthermore, we have been able to highlight clues that may potentially lead to an explanation for the lack of periodicity in \textit{P. tricornutum}'s methylated regions, although no causal relationship can be established just yet. When the results are seen comparatively, we notice major differences between organisms that exhibit methylated DNA patterns and those who don't, in terms of GC content, genome length, nuclear density and global methylation level.
\section{Aknowledgements}
This work would have never been possible without the relentless help and support of my teachers Hugues Richard, who is also my advisor, and Alessandra Carbone, head of the Analytical Genomics teams and director of the LCQB. I would also like to thank my family for their moral support and all the members of the LCQB, especially Bogdan Mirauta and Raphael Champeimont who have directly contributed to the internship.
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{plain}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}