Materials and Methods
Dataset
African population genetics dataset
We used the dataset of 3,283 individuals from 48 African and 12 Eurasian populations typed at 328,000 SNPs previously published by us
\citep{Busby2016AdmixtureAfrica}, and available at
https://www.malariagen.net/resource/18. This dataset comprises 14 Eurasian and 46 African groups, mostly from sub-Saharan Africa [Fig.
\ref{fig:supppopsmap}]. Half (23) of the African groups represent subsets of samples collected as part to the MalariaGEN consortium. Details of partners involved in the consortium and recruitment of samples in relation to studying malaria genetics are published elsewhere
\citep{MalariaGenomicEpidemiologyNetwork2015ASelection}. The remaining groups are from publicly available datasets and the 1000 Genomes Project, with Eurasian groups from the latter included to help understand the genetic contribution from outside of the continent. To study patterns of diversity, we created an integrated dataset at a common set of high-quality SNPs and used SHAPEITv2
\citep{Delaneau2012AGenomes} to haplotypically phase these genotype data, generating a final set of 6566 haplotypes.
Defining Ancestry Regions
We used a combination of geographic, ethnic, and genetic information to group our sample into populations and eight ancestry regions [Fig. \ref{fig:fig1}]. We analysed each population in turn, and inferred local ancestry using only those populations from outside of their ancestry region, which we term non-local paintings.
Chromosome painting procedure
Inferring population specific prior copying probabilities and painting parameters
We utilised ChromoPainter \citep{Lawson2012InferenceData} to infer local ancestry. In order to do so we first needed to generate population-specific prior copying probabilities. The algorithm initially assumes that each donor haplotype is equally likely to be copied from, which in reality is not the case. Moreover, when there is less contextual information available for painting, for example at the beginning and ends of chromosomes, in its default setting, the algorithm will assign ancestry based on the on the initial prior copying probabilities, which will lead to noisy ancestry inference. We therefore performed initial runs of ChromoPainter where for each recipient population we used the Expectation-Maximisation (EM) algorithm to infer the prior copying probabilities from each recipient group (using the \(-ip\) flag). We ran ten EM iterations across 4 chromosomes (1,4,10,15) using 5 recipient individuals per population, and averaging across individuals and chromosomes to infer population specific prior copying probabilities.
We next estimated the recombination scaling constant, using the \(-in\) flag, and the mutation (HMM emission) probabilities, using the \(-im\) flag, separately for the same 5 individuals and 4 chromosomes noted above for each recipient population, and again averaged across chromosomes and individuals. We then painted all individuals from a given recipient population using these population-specific prior copying probabilities and model parameters.
Inferring local ancestry with ChromoPainter
To estimate local ancestry uncertainty, we take multiple sample paths through the HMM to generate 10 ancestry paintings per recipient haplotype. We note that in an ideal scenario, we would use ChromoPainter to infer the true uncertainty in the ancestry at a SNP by computing the full set of posterior marginal copying probabilities for each recipient haplotype at every position of the genome, giving us the probability that a recipient haplotype copies from all donor haplotypes at a given position. However, given \(N\) recipient haplotypes and \(M\) donor individuals, this would give an \(N*M\) matrix at each of \(L\) loci for each recipient population, which quickly leads to computational storage issues. To overcome this, we instead store 10 painting samples, leading to \(N*10\) matrix being stored at each locus and use these painting samples to estimate uncertainty in the local ancestry inference.
Identifying local deviations in local ancestry
Binomial logistic model
The approach we take for identifying deviations in local ancestry is based on the idea that, in the absence of any other population genetic process, an individual’s ancestry proportions at a test SNP should be within the distribution of their genome-wide ancestry proportions. In theory, we can generate this null distribution of ancestry expectations based on the paintings at all genome-wide SNPs, or a subset thereof. In practice, we use a leave-one-out approach, and compute an individual’s genome-wide ancestry proportions from all SNPs that are not on the same chromosome as our test SNP that we want to test.
We model the proportion from each donor ancestry region separately via a binomial model with a mean estimated genome-wide and a linear increase on the logistic scale determined by \(\beta\). We used a likelihood ratio test to determine if this parameter was significantly different from zero at its maximum.
\begin{equation}
{}_{0}:y_{i}\sim\textrm{Binom}(\theta,S)\ ,\ \log\left[\frac{\theta}{1-\theta}\right]=\mu_{i}\nonumber \\
\end{equation}
\begin{equation}
{}_{1}:y_{i}\sim\textrm{Binom}(\theta,S)\ ,\ \log\left[\frac{\theta}{1-\theta}\right]=\mu_{i}+\beta\nonumber \\
\end{equation}
We calculate the MLE of beta by finding \(\hat{\beta}\) that solves:
\(\begin{equation}\[\frac{1}{S} \underset{i=1}{\overset{N}{\sum}} y_i = \underset{i=1}{\overset{N}{\sum}} \frac{e^{\mu_i + \hat{\beta}}}{1+e^{\mu_i + \hat{\beta}}}\]\end{equation}\)
Then we use that MLE to calculate our likelihood ratio test statistic \(\Lambda\) and compare it to a \(\chi^{2}_{1}\) distribution to get our pvalue.
\(\begin{equation}\[\Lambda = 2 \ \left( \hat{\beta} \underset{i=1}{\overset{N}{\sum}} y_i + S \ \underset{i=1}{\overset{N}{\sum}} \log\left[ \frac{1+e^{\mu_i}}{1+e^{\mu_i + \hat{\beta}}} \right] \right)\]\end{equation}\)
We then use chi-squared distribution to test if this statistic is in the tail.