Scientific Project Objectives

Describe as explicitly as possible the scientific objectives that you wish to reach in this project proposal. What is the essence of the intended scientific progress from which the various concrete goals, criteria, activities and the desired results can be understood? List explicitly and concretely the results (or partial results) that will have to be achieved, such as specific knowledge, the solution to particular problems and the academic breakthroughs. Per sub-aspect, mention the major quantitative (preferably) and qualitative target values, criteria, requirements and norms that at the end of the project could determine to what extent the anticipated results were reached.
Overall Research Objective: Development of robust and flexible data analysis methods for the analysis of microbiome studies, with a focus on longitudinal studies, visualisation, causal inference and biomarker discovery. A pilot microbiome study on the influence of the plant microbiome on the host microbiome, immune cell balance and disease is also included.
 Work package 1 (WP1): Development of flexible,  robust and efficient statistical methods for analysing microbiome data from clustered and longitudinal studies and for causal mediation analysis.
Supervisors: Prof. O. Thas (DSI), Prof. D. Valkenborg (DSI) and Prof. J. Vangronsveld (CMK)
RO1: We aim to develop a valid statistical modelling framework for analysing clustered and longitudinal microbiome data. The methods must be (1) flexible in the sense that they adapt to complex study designs (multiple factors, hierarchical dependence structure, correction for covariates/confounders, …)(RO1.1), (2) robust in the sense that they do not rely on unrealistic distributional assumptions and can cope with all microbiome-specific data characteristics (RO1.1) , and (3) efficient in the sense that the method provides precise estimates and powerful tests (RO1.2), despite its flexibility.
RO2: We aim to adapt counterfactual causal mediation methods to make them generally applicable to microbiome data (RO1. (RO2.1). 
RO3: Part of the objective of WP1 is it to use machine learning methods for (1) improving the efficiency of the estimators (RO3.1+RO1.2), and (2) estimating the causal effects (RO3.2+RO2)
Essence of intended scientific progress: As in many empirical research disciplines, longitudinal studies are very common in microbiome research, but none of the many existing methods for longitudinal data analysis works well for microbiome data. There is thus a need for appropriate methods that are sufficiently flexible to be applicable to a wide variety of longitudinal study designs. If they work well and a convenient software implementation is available, they will be applied at large scale. Moreover, our envisaged methods will not only be applicable to microbiome data, but also to other data types that cause problems with existing methods. 
Many longitudinal  microbiome studies with randomised treatment assignment aim at unravelling the causal effect of the treatment on an outcome, either a direct effect, an indirect effect via the microbiome, or a combination of both. Current mediation analysis methods, however, are inappropriate for the microbiome as high-dimensional mediator. We do not aim at contributing to the fundamental theory of causal inference, but rather to use state of the art mediation methods and adapt them to the very specific data characteristics of microbiome data by using the wide scope of data science methods (statistics, machine learning, visualisation) that is available at DSI.   
Our approach to increase the efficiency of estimators (or power of hypothesis tests) is very novel and it can be ported to many other high dimensional statistical inference problems. When no distributional assumptions are made, efficiency can be gained by using the data itself for adapting the estimator to make it more efficient (adaptive estimators). Such methods usually rely on asymptotic theories and work only well on large datasets (large in the sense of number of observations). Our method improves the efficiency of an estimator related to one taxon by using the data from the other taxa. We anticipate that our approach will work well even when the number of observations for that one taxon is we use small. Although we will develop the method in the context of longitudinal data analysis, the general approach is more widely applicable to other high dimensional statistical inference problems.

Research Methodology and Work Plan

Describe how the research project will be executed and justify why this approach was chosen and why any specific strategic choices were made. Clarify how the scientific objectives will be reached, taking into account the proposed (sub)objectives, criteria and the capabilities of the partners.From the perspective of this general approach, describe the experimental set-up, the cohesion of the work packages as well as the milestones.Describe in the work plan: WHAT (divided into work packages), WHY and HOW (research approach, work method), WHEN (planning) and WHO (division of tasks, synergy and complementarity). Also state how interim decision points and general project risks are taken into account.Provide an overview of the deployment of the person power across the different work packages and its positioning in time for each of the work packages. Indicate whether it concerns scientific or technical personnel and their needed seniority. Personnel allocation should be given as estimated person-months (pm).Indicate how the coordination and monitoring of the project will be organized and how the cooperation among the various partners will be structured.
Overal Work Plan
explain why we have 4 WPs 
Work package 1 (WP1): Development of flexible and robust statistical methods for analysing microbiome data from clustered and longitudinal studies and for causal statistical inference
Supervisors: Prof. O. Thas (DSI), Prof. D. Valkenborg (DSI) and Prof. J. Vangronsveld (CMK)
WP1.1 Semiparametric rank-based method for longitudinal analysis for a single taxon
Starting from the observation that sequencing count data very often cannot be properly described by a (zero-inflated) negative binomial distribution \cite{hawinkel2020sequence} and that almost all available methods for the analysis of longitudinal microbiome data rely on such distributional assumptions for the count data, we will develop a class of methods that does not rely on these assumptions. However, the methods must target informative parameters with unambiguous interpretation and they must allow for the flexible analysis of a wide variety of study designs, including one or more covariates or factors. For all these reasons, the method cannot be fully parametric and also not fully nonparametric. We therefore will develop a method based on a semiparamatric model. In particular, we will build upon the Probabilistic Index Models (PIM) \cite{thas2012probabilistic}, which form a class of methods that includes the classical Wilcoxon-Mann-Whitney test, among many other rank tests \cite{de2015regression}, but which also allows for inclusion of covariates and which provides parameters with clear and informative interpretations. PIMs have been used before for flexible and powerful analysis of data from high-throughput molecular technologies such as qPCR \cite{de2014unifiedwmwqpcr}\cite{de2013extension}and scRNASeq \cite{assefa2019probabilistic}, but these PIMs were not applicable to longitudinal or clustered data. In this project, we will use the PIM framework for modelling the count outcome of a single taxon (referred to as the target taxon). The method can of course be applied to all taxa separately as in conventional differential abundance testing. 
In the PhD thesis of \cite{zhang2012parametric} a first attempt to extend PIMs to clustered data structures was explored. However, this path was not further elaborated because the proposed estimating equations observations only used pairs of observations within clusters and hence ignoring some information in the data. The model was based on the inclusion of random effects in an outcome model for the outcome Y, and integrating out the random effects distribution from the implied PIM. In this project we propose to walk along a different path. We will start from formulating two PIMs. The first PIM is the within-subject model,
\[g(\prob{Y<Y^*} \mid S=S^*, X,X^*, T,T^*) = \beta_0 + \beta_X^t(X^*-X)+\beta_T^t(T^*-T)\]
where S, X and T are the subject, covariate and time of outcome Y, respectively (similarly for S*, X*, T* and Y*). The second PIM is the between-subject model,
\[g(\prob{Y<Y^*} \mid S\neq S^*, Z,Z^*, T,T^*) = \alpha_0 + \alpha_Z^t(Z^*-Z)+\alpha_T^t(T^*-T)\]
where Z and Z* are covariates corresponding to Y and Y*, respectively. These covariates may overlap with X and X*. For some applications, it may make sense to further restrict the second model to T=T*.If we denote the outcome for a single taxon by \(Y_{is}\) for subject i at time t, then parameter estimators will be defined as solutions to estimating equations of the form
\[\sum_i \sum_s \sum_j \sum_t A_{isjt} \left(\indicat{Y_{is}\leqs Y_{jt}} - \mu_{isjt}(\alpha,\beta)\right)=0 \label{Eq_EE}\]
where \(A_{isjt}\) is a matrix, which is referred to as the index matrix, that may depend on the covariates of observations (i,s) and (j,t), and \(\mu_{isjt}(\alpha,\beta)\) comes from PIM 1 or PIM 2, depending on whether i=j or not. The exact form of the matrix  \(A_{isjt}\) will be inspired by (1) the index matrix of the ordinary PIM \cite{thas2012probabilistic} and by (2) the working correlation matrices of GEE for longitudinal models. The variance-covariance matrix of the parameter estimator can be consistently estimated by a sandwich estimator. Asymptotic normality of the parameter estimators and consistency of the sandwich estimator will be demonstrated. The same theories as for ordinary PIMs can be used. In particular, as in \cite{de2013probabilistic}, the theory of two-step estimators  \cite{newey1994large} can be used (also referred to as V-estimators in the current context). 
The theoretical properties of the estimators will be empirically evaluated in a simulation study and compared to competitor methods. For this purpose the flexible microbiome simulation method NPSimSeq \cite{assefa2020spsimseq} will be extended to allow for simulation of realistic longitudinal microbiome data.  NPSimSeq already includes a method for simulating correlated taxa, by combining estimating sparse  correlation networks \cite{kurtz2015sparse}  and Gaussian copulas to backtransform to the correct marginal taxon-wise count distributions. This method can be extended to impose the serial correlations into all simulated taxon-wise longitudinal trajectories. However, we will make some working assumptions to make the method work when data with realistic sample sizes are available: (1) stationarity of the correlation network, and (2) taxon-wise serial dependence is modelled by an AR(1) process. This will allow for estimability of all parameters, and the Gaussian copula can again be used. 
All methods will be implemented in R packages. 
Application of method ...
...
WP1.2 Improving efficiency 
The estimation methods developed in WP1.1 are based on the naive estimation theory of \cite{thas2012probabilistic}, which does not result in (semiparametric) efficient estimators . For the simple i.i.d. case, \cite{vermeulen2017semiparametric} developed the semiparametric efficient estimator of the parameters in the PIM. However, their simulation study demonstrates that the efficiency gain is only marginal and does not outweigh the additional computation cost. In this project we take a different approach for improving estimation efficiency. We will take advantage of the high dimensionality of the microbiome data, which will allow us to construct data-adaptive estimators for which hardly a price has to be paid. 
We will make use of the property that the distribution theory of \cite{thas2012probabilistic} for the parameter estimator defined as the solution of equation \ref{Eq_EE}, will remain valid as long as the index matrix A does not depend on the outcome data. In other words, a data-driven construction of the index matrix would complicate the distribution theory. However, in typical microbiome datasets, data on hundreds to thousands of taxa are available. Hence, under the (working) assumption that the data from the other taxa are independent of the data of the target taxon, we may construct the index matrix from data from the other taxa, without jeopardising the asymptotic distribution theory. We will develop an algorithm that aims at improving the efficiency of the parameter estimators by computing the index matrix  as the minimiser of some loss function computed over the the observed distributions of the other non-target taxa. In particular, let \({\cal{P}}=\{P_j: \;\; j=1,\ldots, p\}\) denote the set of the distribution functions of the outcomes of the p other taxa (w.l.g. we ignore the conditioning on covariates) and suppose that we are interested in a parameter \(\beta_j\) that can be defined as a functional of Pj, i.e. \(\beta_j=b(P_j)\) for some b(.). For each other taxon j the outcome distribution can be estimated, say \(\hat{P}_j\), from which \(\tilde{\beta}_j=b(\hat{P}_j)\) serves as the true parameter. Repeatedly sampling from \(\hat{P}_j\), say N times, gives for each repeated sample an estimate of \(\beta_j\) by solving the estimating equation \ref{Eq_EE}. Based on these estimates and the true parameter \(\tilde{\beta}_j\), the MSE can be approximated, say \(\text{MSE}_j\). The performance of the index matrix A in estimating equation \ref{Eq_EE} can be quantified by means of a summary of the N MSEs, which is denoted by \(L(A)=L(A; \hat{P}_1,\ldots, \hat{P}_p)\). We will use machine learning methods for finding the optimal matrix A in the sense of  \(A_\text{opt}=\text{ArgMin}_A L(A)\). One particular example is \(L(A) = \max_j \text{MSE}_j\). For this choice, \(A_\text{opt}\)will give a minimax estimator but with the restriction that the data-generating distribution is a member of \({\cal{P}}\). This shows resemblance with the procedure of \cite{luedtke2020learning}, except that (1) we restrict the set of data-generating distributions to those observed in the other taxa, and (2) we consequently require other machine learning methods.  
Although the new method is motivated by the longitudinal or clustered setting, we will also evaluate the method for the simpler situation of cross sectional studies studies with only a treatment indicator in the model (this is the traditional two-sample problem for testing for differential abundance). All evaluations will happen in simulations studies, using the SPSimSeq \cite{assefa2020spsimseq} simulation method (and its extension to longitudinal data, as developed in WP1.1). 
All methods will be implemented in an R package. 
Application ...
WP1.3: Mediation analysis
Given that research on causal mediation analysis with high-dimensional mediators and for longitudinal studies has only started less than about five years ago, and is still a very active research topic, we anticipate that further improvements will have been published by the time this WP will commence. Developing new causal methodology, however, is no objective of this project; only the adaptation, evaluation and application of appropriate methods to realistic microbiome studies is part of this project. 
From the current status of the literature, we anticipate that we will need to focus on interventional (in)direct effects, as in \cite{loh2020non}, for which the taxa do no have to be modelled jointly and for which a complete decomposition of the total effect may be feasible. This theory, however, does not encompass compositional mediators for which the dependence in the joint distribution is not structural (causal), but imposed by the compositionality. Log-ratio transformations could resolve this issue, but they do not deal properly with the many zeroes. We will therefore rely on a normalisation method that expresses the counts relative to the average count of a reference frame of taxa \cite{morton2019establishing} that are stable between conditions (treatments) and over time. The selection of the taxa in the reference frame can proceed in a data-adaptive manner as in \cite{brill2019testing} without affecting the statistical inference. A further advantage of this approach is that it allows for inference on absolute abundances, under mild conditions. Our developments in this WP, will therefore also be applicable in the (near) future when new  technologies will allow for interrogating absolute abundances. 
In this project we aim at estimating the outcome model \(\E{Y \mid A,M}\) with machine learning methods. The flexibility of machine learning methods should reduce the risk of model bias. In high dimensional settings, however, regularisation is often required, which trades a reduction in variance for bias (estimation bias).  For this reason, we will adopt the double/debiased machine learning (DML) approach \cite{chernozhukov2018double}\cite{farbmacher2020causal}. The estimation of the (interventional) effects also involves models for the individual mediators (individual taxa). Instead of relying on resampling, as in \cite{loh2020non}, we will rely on the flexible semiparametric modelling framework of \cite{assefa2020spsimseq} for describing the mediator distributions so that parametric bootstrap samples can be generated from the fitted distributions. This will result in additional smoothing. Inference (e.g. hypothesis testing) will be based on the bootstrap procedure, as in many papers referred to in this proposal. 
The procedure will eventually result in estimates for the indirect effects of all taxa. Depending on the taxonomic level that is targeted, the  number of simultaneous effect estimates and corresponding hypothesis tests may be large. We will examine different strategies, ranging from exploratory to formal hypothesis testing. For the latter, we will  aim at FDR-control using the taxonomic hierarchy information available in the phylogenetic tree, as in \cite{bogomolov2017testing} or \cite{barber2017p}. As an exploratory tool, visualisations such as forest plots may be helpful. Forests plots are often used in meta-analysis for summarising effect size estimates from different studies \cite{lewis2001forest}, but they could also be used for the visualisation of taxon-specific indirect effect estimates. We will further improve this visualisation by incorporating the phylogenetic tree structure. 
The performance of the new methods will be assessed in a simulation study. As before, SPSimSeq will be used, but now the data simulation must proceed in two steps: (1) simulate all potential outcomes; (2) sample observed data from these simulated potential outcomes. See e.g. \cite{loh2019interventional,loh2020non}. If competitor methods are available, they will also be included in the simulation study. 
Back-up plan in case no further methodological progress in mediation analysis: mediation analysis at each time point separately, or longitudinal mediation analysis after dimension reduction.