\documentclass{article}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{setspace}
\usepackage{titlesec}
\usepackage{lineno}
\usepackage{xcolor}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{apacite}
\usepackage{eso-pic}
\AddToShipoutPictureBG{\AtPageLowerLeft{\includegraphics[scale=0.7]{powered-by-Authorea-watermark.png}}}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{Using gene genealogies to localize rare variants associated with complex traits in diploid populations}
\author[ ]{Charith Bhagya Karunarathna}
\author[ ]{Jinko Graham}
\affil[ ]{}
\vspace{-1em}
\date{}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\doublespacing
\linenumbers
\section{Introduction}
Most genetic association studies focus on common variants, but rare genetic variants can play major roles in influencing complex traits.\cite{Pritchard_2001,Schork_2009}. The rare susceptibility variants identified through sequencing have potential to explain some of the 'missing heritability' of complex traits \cite{Eichler_2010}. However, for rare variants, standard methods to test for association with single genetic variants are underpowered unless sample sizes are very large \cite{Lee_2014}. The lack of power of single-variant approaches holds in fine-mapping as well as genome-wide association studies. \par
In this report, we are concerned with fine-mapping a genomic region that has been sequenced in cases and controls to identify disease-risk loci. Our work extends an earlier comparison of methods for {\em detecting} disease association in cases and controls \cite{Burkett_2014} to a comparison of methods for {\em localizing} the association signal. In the previous investigation, cases and controls were sampled from a haploid or one-parent population. However, in the current investigation, cases and controls are sampled from a diploid or two-parent population to mimic studies in human populations.
A number of methods have been developed to evaluate the disease association for both a single variant and multiple variants in a genomic region. Besides single-variant methods, we consider three broad classes of methods for analysing sequence data: pooled-variant, joint-modelling and tree-based methods. Pooled-variant methods evaluate the cumulative effects of multiple genetic variants in a genomic region. The score statistics from marginal models of the trait association with individual variants are collapsed into a single test statistic by combining the information for multiple variants into a single genetic score \cite{Lee_2014}. Joint-modeling methods model the joint effect of multiple genetic variants on the trait simultaneously. These methods can assess whether a variant carries any further information about the trait beyond what is explained by the other variants. When trait-influencing variants are in low linkage disequilibrium, this approach may be more powerful than pooling test statistics for marginal associations across variants \cite{Cho_2010}. Tree-based methods assess whether trait values co-cluster with the local genealogical tree for the haplotypes (e.g., \citeNP{Bardel_2005}). A local genealogical tree represents the ancestry of the sample of haplotypes at each locus. Haplotypes carrying the same disease risk alleles are expected to be related and cluster on the genealogical tree at a disease-risk locus. \citeNP{Mailund_2006} has developed a method to reconstruct and score local genealogies according to the case-control clusters. \par
In practice true trees are unknown. However, clustering statistics based on true trees represent a best case for detecting association as tree uncertainty is eliminated. \citeNP{Burkett_2014} used known trees to assess the effectiveness of such a tree-based approach for detection of disease-risk variants in a haploid population. They found that clustering statistics computed on the known trees outperform popular methods for detecting causal rare variants in a candidate genomic region. Following Burkett et al., we use Mantel tests as the clustering statistics based on true trees. These tree-based statistics, which rely on known trees, serve as benchmarks against which to compare the popular association methods. However, unlike Burkett et al., who focus on detection of disease-risk variants, we here focus on localization of association signal in the candidate genomic region. Moreover, we use a diploid disease model instead of a haploid disease model.\par
In this report, we compare the performance of selected association methods for fine-mapping a disease locus in the middle of a larger, candidate, genomic region. In our simulation study, we use variant data generated under the coalescent model. To illustrate ideas, we start by working through a particular example dataset as a case study for insight into the association methods. We next perform a simulation study involving 200 sequencing datasets and score which association method localizes best, overall. Our results indicate the potential of ancestral tree-based approaches for localizing the association signal.
\section{Methods}
In this section, we describe our data simulation, the association methods we considered and the way we assessed localization and detection of the association signal. We next describe the popular association methods we evaluated for fine mapping. We finally explain the simulation study involving 200 sequencing datasets to address the signal localization and detection.
\subsection{Data simulation}
We first report how we simulated haplotype data from ancestral trees. We then describe how we assigned the disease status to individuals and sampled data for our case-control study.
We used fastsimcoal2 \cite{Excoffier_2013} to simulate ancestral trees and 3000 haplotypes of 4000 equispaced single nucleotide variants (SNVs) in a 2 million base-pair (Mbp) genomic region with the recombination rate of $1 \times 10^{-8}$ per base-pair per generation \cite{johnston2012population} in a haploid population of constant effective size, $N_{e} = 6200$ \cite{Tenesa_2007}. To mimic random mating in a diploid population, we then randomly paired the 3000 haplotypes into 1500 diploid individuals. Next, disease status was assigned to the 1500 individuals based on randomly-sampled risk SNVs from the middle genomic region of $950kbp-1050kbp$. For risk SNVs, the number of copies of the derived (i.e. mutant) allele increased disease risk according to a logistic regression model,
$$
{logit}\{P(D=1|G)\} = {logit}(0.2)+ \sum_{j=1}^{m} 2 \times G_j,\;\;\mbox{where,}
$$
\begin{itemize}
\item $D$ is disease status ($D = 1$, case; $D=0$, control).
\item $G=(G_1, G_2, \ldots , G_{m})$ is an individual's multi-locus genotype at $m$ risk SNVs, with $G_j$ being the number of copies of the derived allele at the $j^{th}$ risk SNV.
\item the value of the intercept term is chosen to ensure that the probability of sporadic disease (i.e. $P(D=1|G=\underset{^\sim}0)$) is approximately $20\%$.
\end{itemize}
To select risk SNVs in the model, we randomly sampled SNVs from the middle region one-at-a-time, until the disease prevalence was between $9.5-10.5\%$ in the $1500$ individuals. After assigning disease status to the 1500 individuals, we randomly sampled 50 cases (i.e. diseased) from the affected individuals and 50 controls (i.e. non-diseased) from the unaffected individuals. We then extracted the data for the variable SNVs in the resulting case-control sample to examine the patterns of disease association in subsequent analyses.
\subsection{Association Analysis}
In this section, we review the methods for association mapping we have considered. These methods fall under four categories: single-variant method, pooled-variant methods, joint-modeling methods and tree-based methods.
\subsubsection{Single-variant method}
For the single-variant method, we used the standard Fisher's exact test of disease association with each of the SNVs in the case-control sample. In Fisher's exact test, each of the variant site in the case-control sample was tested for an association with the disease outcome using a $2 \times 3$ table to compare genotype frequencies at each variant site. This single-variant association was assessed with the p-value of Fisher's exact test on the contingency table. Each row in a table represents disease status of individuals, and a column represents the three possible genotypes. The $-log_{10}$ p-value from the test was recorded as the association signal for each variant. However, single-variant tests are less powerful for rare variants than for common variants \cite{Asimit_2010}. We therefore considered three other ways to assess the association signal based on pooled-variant, joint-modelling and tree-based methods.
\subsubsection{Pooled-variant methods}
For the pooled-variant methods, we evaluated the Variable Threshold (VT) and the C-alpha test. The variable threshold approach of \citeNP{Price_2010}, is based on the regression of phenotypes onto the counts of variants in the genomic region of interest which have minor allele frequencies (MAFs) below some user-defined threshold (e.g. $1 \%$ or $5\%$). The idea is that variants with MAF below some threshold have a higher prior probability of being functional than the variants with higher MAF, based on population-genetic arguments. For each possible MAF threshold, VT computes a score measuring the strength of association between the pheonotype and the genomic region, and uses the maximum of score over all allele frequency thresholds. The statistical significance of the maximum score is then assessed by a permutation test. \citeNP{Price_2010} found that the VT approach had high power to detect the association between rare variants and disease traits when effects are in one direction. Unlike the VT test, the C-alpha test of \citeNP{Neale_2011} is a variance components approach that assumes the effects of variants are random with mean zero. The C-alpha procedure tests the variance of genetic effects under the assumption that variants observed in cases and controls are a mixture of risk, protective or neutral variants. \citeNP{Neale_2011} found that the C-alpha test showed greater power than the burden test when the effects are bi-directional. We applied the VTWOD function in the R package RVtests \cite{Xu_2012} for the VT-test and the SKAT function in the R package SKAT\cite{Wu_2011} for the C-alpha test. We used sliding windows of 20 SNVs overlapping by 5 SNVs across the simulated region.
\subsubsection{Joint-modeling methods}
For the joint-modeling methods, we evaluated the CAVIARBF \cite{Chen_2015} and Elastic-net \cite{Zou_2005} methods. CAVIARBF is a fine-mapping method that uses marginal test statistics for the SNVs and their pairwise association to approximate the Bayesian multiple regression of phenotypes onto variants that is implemented in BIMBAM \cite{Servin_2007}. However, CAVIARBF is much faster than BIMBAM because it computes Bayes factors using only the SNVs in each causal model. These Bayes factors can be used to calculate the posterior probability of SNVs being causal in the region (the posterior inclusion probability). To compute the probability of SNVs being causal, a set of models and their Bayes factors have to be considered. Let $p$ be the total number of SNVs in a candidate region, then the number of possible causal models is $2^p$. To reduce the number of causal models to evaluate and save computational time and effort, CAVIARBF imposes a limit, $L$, on the number of causal variants. This limitation reduces the number of models to evaluate in the model space from $2^p$ to $ \sum_{i=0}^{L} \dbinom{p}{i} $. Since there were 2747 SNVs in our example dataset, to keep the computational load down, we considered $L=2$ throughout this investigation.
Elastic-net is a hybrid regularization and variable selection method that linearly combines the L1 and L2 regularization penalties of the Lasso \cite{Tibshirani_2011} and Ridge \cite{Cessie_1992} methods in multiple regression. This combination of Lasso and Ridge penalties provides a more precise prediction than using multiple regression, when SNVs are in high linkage disequilibrium \cite{Cho_2009}. In addition, the elastic-net can accommodate situations in which the number of predictors exceeds the number of observations. We used the elastic-net to select risk SNVs by considering only the main effects. In elastic-net, we used the SNV's variable inclusion probability (VIP), a frequentist analog of the Bayesian posterior inclusion probability, as a measure of its importance for predicting disease risk\cite{Cho_2010}. To obtain the VIP for a SNV, we re-fitted the elastic-net model using $100$ bootstrap samples and calculated the proportion of samples in which the SNV was included in the fitted model. In our analysis, we applied the elastic-net using the R package glmnet \cite{Friedman_2010,Simon_2011,Tibshirani_2011a}.
\subsubsection{Tree-based methods}
We considered two tree based methods to assess clustering of disease status on the genealogy connecting haplotypes at a putative-risk variant: Blossoc (BLOck aSSOCiation; \cite{Mailund_2006}), which uses reconstructed trees, and a Mantel test which uses the true trees. Blossoc aims to localize the risk variants by reconstructing genealogical trees at each SNV. The reconstructed trees approximate perfect phylogenies \cite{uhler2008finding} for each SNV, assuming an infinite-sites model of mutation. These trees are scored according to the non-random clustering of affected individuals. The underlying idea is that genomic regions containing SNVs with the high clustering scores are likely to harbour risk variants. Blossoc can be used for both phased and unphased genotype data. However, the method is impractical to apply to unphased data with more than a few SNVs due to the computational burden associated with phasing. We therefore assumed the SNV data were phased, as might be done in advance with a fast-phasing algorithm such as fastPHASE \cite{Scheet_2006}, BEAGLE \cite{Browning_2011}, IMPUTE2 \cite{Howie_2009} or MACH \cite{Li_2010,Li_2009}. We evaluated Blossoc with the phased haplotypes, using the probability-score criterion which is the recommended scoring scheme for small datasets \cite{Mailund_2006}.
In practice, the true trees are unknown but as the data were simulated we had access to this information. Also, the cluster statistics based on true trees represent a best case insofar as tree uncertainty is eliminated \cite{Burkett_2014}. We therefore included two versions of the Mantel test as a benchmark for comparison. In the first version, the phenotype corresponding to a haplotype is scored according to whether or not the haplotype comes from a case. We refer to the first version as the {\em naive} Mantel test because all case haplotypes are treated the same, even those not carrying any risk variants. In the second version, the phenotype is scored according to whether or not the haplotype comes from a case and carries a risk variant. We refer to the second version as the {\em informed} Mantel test because it takes into account whether or not a case haplotype carries a risk variant. The informed Mantel test is a best-case scenario insofar as the uncertainty about the risk-variant-carrying status of the case haplotypes is eliminated. These Mantel tests correlate the pairwise distance in the known ancestry with those in the phenotypes. Following \citeNP{Burkett_2014}, we used pairwise distances calculated from the rank of the coalescent event rather than the actual times on the tree. To focus on the rare variants, the test statistic upweights the short branches at the tip of the tree by assigning a branch-length of one to all branches, even the relatively longer branches that are expected close to the time to the most recent common ancestor. Pairwise distances between haplotypes on this re-scaled tree are then correlated to pairwise phenotypic distances. We determined the distance measure matrix, $d_{ij}=1-s_{ij}$, where $s_{ij} = (y_i-\mu)(y_j-\mu)$ is the similarity score between haplotype $i$ and $j$, $y_i$ is the binary phenotype (coded as $0$ or $1$), and $\mu$ is the disease prevalence in the 1500 simulated individuals. We then used the Mantel statistic to compare the phenotype-based distance matrix, $d$, with the re-scaled tree-distance matrix. Note that we define a 'phenotype' for each haplotype within an individual. Therefore, an individual has two phenotypes rather than one.
\subsection{Scoring localization and detection}
To address the question of localization, we scored the distance of the peak association signal from the risk region based on the average distance of peak signals across the entire genomic region. We took the average distance when there were multiple peaks with the same maximum strength of association. Specifically, for each method, on each dataset, we computed the average distance (in bases) of the peak association signals from the risk region, and we then plotted the empirical cumulative distribution function (ECDF) of the average. Thus, the ECDF at point $x$ is the proportion of the 200 samples with average distance less than or equal to $x$. Therefore, a method with higher ECDF than another method at a fixed value of $x$, localizes the signal better.
To detect association, on a given simulated dataset and a given method, we used a maximum score across all the SNVs to obtain a global test of association across the entire genomic region. We determined the null distribution of the global test statistic by permuting the case-control labels. For the global test statistic, we used either a maximum statistic or the maximum of $- log_{10}$ of p-values across the entire genomic region.
The global test statistics for the different association methods are not comparable since they are not on the same scale. To make these global test statistics comparable across methods, we considered their permutation p-values. We defined these p-values as the proportion of scores under the permutation-null distribution that are greater than or equal to the observed value. We then compared the distribution of the resulting p-values for the different methods by plotting their ECDFs.
\section{Results}
In this section, we first present the summaries of our example dataset and the resulting plots from the selected association methods. We then present our results of the simulation study for localizing and detecting the association signal.
\subsection{Example dataset}
\subsubsection{Population and sample summaries}
In the population of 1500 individuals that was simulated for the example data set, we obtained 4000 SNVs, of which 16 were risk SNVs. Of the 4000 SNVs in the population, 2747 were polymorphic in the sample of 50 cases and 50 controls. Of the 16 risk SNVs in the population, 10 were polymorphic in the case-control sample. Table 1 summarizes the physical distances, number of recombinations and minor allele frequencies (MAFs) of the 16 risk SNVs in the population.
%\begin{table}[!]
% \begin{tabular}{|c|c|c|c|c|c|c|}
%\textbf{Risk SNV} & \textbf{position (kbp)} & \textbf{Polymorphic\\ in sample} & \textbf{Number of \\ recombinations $^a$} & \textbf{MAF in \\ population} & \textbf{MAF \\ in cases} & \textbf{MAF\\ in controls} \\ \midrule
% rSNV1 & 963.0 & no & NA & 0.0003 & 0.00 & 0.00 \\
% \textbf{rSNV2} & \textbf{975.0} & \textbf{yes} & \textbf{NA} & \textbf{0.0010} & \textbf{0.01} & \textbf{0.00}\\
% \textbf{rSNV3} & \textbf{990.0} & \textbf{yes} & \textbf{18} & \textbf{0.0087} & \textbf{0.02} & \textbf{0.00} \\
% \textbf{rSNV4} & \textbf{990.5} & \textbf{yes} & \textbf{0} & \textbf{0.0133} & \textbf{0.07} & \textbf{0.01} \\
% \textbf{rSNV5} & \textbf{991.5} & \textbf{yes} & \textbf{2} & \textbf{0.0390} & \textbf{0.07} & \textbf{0.03} \\
% \textbf{rSNV6} & \textbf{993.0} & \textbf{yes} & \textbf{0} & \textbf{0.0310} & \textbf{0.08} & \textbf{0.02} \\
% \textbf{rSNV7} & \textbf{997.5} & \textbf{yes} & \textbf{0} & \textbf{0.0063} & \textbf{0.02} & \textbf{0.02} \\
% rSNV8 & 1003.0 & no & NA & 0.0010 & 0.00 & 0.00 \\
% \textbf{rSNV9} & \textbf{1005.5} & \textbf{yes} & \textbf{10} & \textbf{0.1150} & \textbf{0.27} & \textbf{0.07} \\
% \textbf{rSNV10} & \textbf{1012.5} & \textbf{yes} & \textbf{5} & \textbf{0.0127} & \textbf{0.05} & \textbf{0.01} \\
% \textbf{rSNV11} & \textbf{1019.0} & \textbf{yes} & \textbf{4} & \textbf{0.0010} & \textbf{0.01} & \textbf{0.00} \\
% rSNV12 & 1036.0 & no & NA & 0.0003 & 0.00 & 0.00 \\
% rSNV13 & 1038.0 & no & NA & 0.0013 & 0.00 & 0.00 \\
% \textbf{rSNV14} & \textbf{1038.5} & \textbf{yes} & \textbf{14} & \textbf{0.0133} & \textbf{0.04} & \textbf{0.01} \\
% rSNV15 & 1039.5 & no & NA & 0.0030 & 0.00 & 0.00 \\
% rSNV16 & 1047.5 & no & NA & 0.0053 & 0.00 & 0.00 \\
% & & & & & & \\
% \end{tabular}
% \centering\caption{Population summaries for risk SNVs. SNVs that are polymorphic in the case-control sample are emboldened.}
% \smallskip
% $^{a}$The number of recombinations between current and previous risk SNV in sample.
%\end{table}
Figure 1 compares the distribution of risk haplotypes in cases and controls. We define the risk haplotypes to be a haplotype carries a risk SNV.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/barplotofRiskHaplos-data1/barplotofRiskHaplos-data1}
\caption{{Figure 1. Number of haplotypes that carry risk SNVs (risk haplotypes) for both cases and controls.%
}}
\end{center}
\end{figure}
\subsubsection{Association results}
{\bf Single-variant method:} Figure 2 shows the Manhattan plot of results from Fisher's exact test. In our example dataset, Fisher's exact test does not localize the peak signal, which is distal to the disease risk region.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/Fisher/Fisher}
\caption{{Figure 2. Manhattan plot for Fisher's exact test in the 50 cases and 50 controls. The base-10 logarithm of the association p-values are plotted against physical SNV positions. The horizontal dashed line represents the 5\% significant threshold based on permutation and adjusted for multiple testing across the entire genomic region. The red vertical dashed lines represent the disease risk region.%
}}
\end{center}
\end{figure}
{\bf Pooled-variant methods:} Figure 3 shows the Manhattan plot of results from the VT and C-alpha tests. The C-alpha test has stronger associations than the VT test, and localizes the highest association signal to the disease risk region.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/VT-Calpha2/VT-Calpha2}
\caption{{Figure 3. Manhattan plot p-values, obtained by applying VT test and C-alpha test across the simulated region using sliding windows of 20 SNVs overlapping by 5 SNVs. The horizontal dashed lines represent the 5\% significant thresholds based on permutation and adjusted for multiple testing across the entire genomic region.%
}}
\end{center}
\end{figure}
{\bf Joint-modeling methods:} Figure 4 shows the estimated variable (VIP) or posterior (PIP) inclusion probability for the SNVs computed from elastic-net and CAVIARBF, respectively. We used 100 bootstrap samples to estimate VIPs via elastic-net. CAVIARBF provides estimates of to the PIPs at each SNV. In our example dataset, both elastic-net and CAVIARBF do not localize the peak signal to the risk region.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/PIP-VIP2/PIP-VIP2}
\caption{{Figure 4. Variable-inclusion probabilities (VIPs) for SNVs computed from elastic-net (left axis), and posterior inclusion probabilities (right axis), computed from CAVIARBF, across the simulated region. The horizontal dashed lines represent the 5\% significant thresholds based on permutation and adjusted for multiple testing across the entire genomic region.%
}}
\end{center}
\end{figure}
{\bf Tree-based methods:} Figure 5 shows the association mapping results obtained from Blossoc. We performed Blossoc for our simulated phased haplotypes data, using the probability score criteria for each SNV across the region. Our example dataset shows relatively high association, but the peak signal is out side the risk region.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/Blossoc1/Blossoc1}
\caption{{Figure 5. Clustering scores from Blossoc for each SNV across the region, using the probability scores criteria. The horizontal dashed line represents the 5\% significant threshold based on permutation and adjusted for multiple testing across the entire genomic region.%
}}
\end{center}
\end{figure}
{\bf Naive Mantel test:} Figure 6 shows the statistics computed from the naive Mantel test. Our example dataset shows relatively high association signal at the tree positions in the risk region but the peak signal is distal to the risk region.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/naiveMantel/naiveMantel}
\caption{{Figure 6. Naive Mantel statistics for each tree position (SNV) across the simulated region. The horizontal dashed line represents the 5\% significant threshold based on permutation and adjusted for multiple testing across the entire genomic region.%
}}
\end{center}
\end{figure}
{\bf Informed Mantel test:} Figure 7 shows the results of association mapping obtained from the informed Mantel test using our example dataset. This informed Mantel test successfully localizes the peak signal to the risk region.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/informedMantel/informedMantel}
\caption{{Figure 7. Informed Mantel statistics for each tree position (SNV) in the genomic region. The horizontal dashed line represents the 5\% significant threshold based on permutation and adjusted for multiple testing across the entire genomic region.%
}}
\end{center}
\end{figure}
\subsection{Simulation study}
We first present the simulation results for localizing the association signal, followed by the results for detecting the association signal.
\subsubsection{Localizing the association signal}%: ecdf of avg. distance from the peak
Figure 8 compares the ability of the different methods to localize the association signal. For each of the 200 simulated datasets, we considered the distance of the peak association signal from the risk region. If there were ties in the peak signal, we took the average distances. The figure shows the ECDFs of these distances for the 200 datasets comparing the eight methods. Informed Mantel test outperforms all the other methods; that is, informed Mantel test has the highest proportion of simulated dataset at lower average distances. Fisher's exact test, C-alpha test, CAVIARBF, and Blossoc perform comparably and relatively well for localizing signal. VT and naive Mantel test have the worst localization performance. As observed in the example dataset, VT has worst performance than C-alpha for localizing the signal.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/AvgDistfromRR-ecdf1/AvgDistfromRR-ecdf1}
\caption{{Figure 8. ECDFs of average distances of the peak association signals from the risk region for the 200 datasets. Eight methods are compared: Fisher's exact test, VT test, C-alpha test, CAVIARBF, Elastic-net, Blossoc, naive Mantel test and informed Mantel test. To better compare methods, the x-axis is shown only for distances $\leq 250kbp$.%
}}
\end{center}
\end{figure}
\subsubsection{Detecting the association signal}
Figure 9 compares the ability of the methods to detect any association with the disease across the entire genomic region that is being fine-mapped. For each method, we here compare the ECDFs (over the 200 datasets) of the permutation p-values computed from the corresponding scores. As we expected, informed Mantel test performs better than all the other methods. Elastic-net has the worst performance for detecting the signal. VT appears to be less powerful for our simulated datasets.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/SignalDetectionPvalues-ecdf1/SignalDetectionPvalues-ecdf1}
\caption{{Figure 9. ECDFs of permutation p-values from a global test of association in the genomic region. Eight methods are compared: Fisher's exact, VT, C-alpha, CAVIARBF, Elastic-net, Blossoc, naive Mantel test and informed Mantel test. The x-axis is shown only for p-values $\leq 0.20$ for a better resolution.%
}}
\end{center}
\end{figure}
\section{Discussion}
In this study, through coalescent simulation, we have investigated the ability of several popular association methods to fine-map trait-influencing genetic variants in a candidate genomic region. As the first step, we worked through a particular example dataset as a case study for insight into the methods and then performed a simulation study to score which method best localizes the risk region.
In our simulations, the informed Mantel test localized the association signal the best among all the methods considered. By contrast, the naive Mantel test performed poorly relative to the other methods. In fact, the naive Mantel test localized the risk region more poorly than Blossoc, CAVIARBF, C-alpha, and Fisher's exact test. Our results for {\em localizing} risk variants in {\em diploid} populations therefore stand in contrast to previous results for {\em detecting} risk variants in {\em haploid} populations \cite{Burkett_2014}, which found that the naive Mantel test and related methods performed very well. The poor performance of the naive Mantel test in our simulations can be explained by the misclassification of haplotypes. In {\em haploid} populations, case haplotypes without a risk variants are rarer than in {\em diploid} populations, and so fewer would be misclassified by the naive case-control phenotypes. However, in {\em diploid} populations, we do not know which of the two haplotypes in a case carries the disease-risk variant and score both as being ``affected". When only one of the two haplotypes in a case carries a risk variant, defining both haplotypes as ``affected" misclassifies one. Therefore, not surprisingly, the informed Mantel test which defines only the case haplotype that carries the risk variant as ``affected" outperformed all the other methods. The development of methods to identify which haplotypes carry risk variants would therefore be an avenue for further research in genealogy-based approaches to fine-mapping risk variants.
Our simulation study also provides a comparison of VT to the C-alpha test. Even though the effects are one directional, C-alpha showed higher localization signal in the risk region than VT. Our results for localization with the VT and C-alpha tests in a diploid population are consistent with the results of Burkett et al. for detection with these tests in a haploid population. We would expect better results from the VT test than the C-alpha test since VT is for rare variants having the same direction of effect which we simulated \cite{Price_2010}. However, variance-component tests such as C-alpha have higher power than burden tests such as VT when the proportion of risk variants in the set of tested variants is low \cite{thornton2015statistical}. For an example, in our example dataset, the highest proportion of risk variants within moving windows of 20 SNVs was 20\%.
In the example dataset, most risk variants were rare. Of the 10 risk SNVs that were polymorphic in the sample, 8 had population minor allele frequencies $< 2\%$. In addition, most cases carried a single risk haplotype and most risk haplotypes contained a single risk SNV. Based on these findings in the example dataset, we expect the results under a dominant model of genetic risk to be similar to our results under an additive model. From the results based on the case study, we found that C-alpha test and the informed Mantel test, based on known trees, were the only methods that successfully localized the association signal. However, the peak signals from all the other methods (Fisher's test, VT, CAVIARBF, Elastic-net, Blossoc and naive Mantel) were close to the disease-risk region.
We have here focused on a simple model of disease risk under additive effects with no covariates. However, simulations with risk models that include non-genetic covariates would also be an area for further research. In the approaches we have considered, the phenotypes also can be adjusted for non-genetic covariates. For tree-based methods, differentiating between case haplotypes which carry or don't carry risk SNVs improves localization of the risk region. Preliminary work (not shown) suggests that carrier and non-carrier haplotypes can be differentiated on the basis of their number of positively-associated alleles. In future work, we will pursue this idea with informed Mantel tests on reconstructed trees.
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{apacite}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}