Causal Mediation Analysis
General Concepts
Unless applied to well designed (randomised) and executed studies, most statistical methods only allow for establishing associations but no causal effects. To translate microbiome research into solutions for crop production and treatment of diseases, causal relations need to be detected. Causal inference is well on its way to become an established theory and it is already frequently applied in e.g. clinical research, epidemiology, econometrics and sociology. Several authors have already cried out for the need for causal inference methods for microbiome studies (e.g. \cite{gilbert2016microbiome} \cite{maruvada2017human}), and few methods have already been proposed for microbiome studies (see further). In microbiome research, it is often not known whether an intervention (e.g. change of diet) directly affects an outcome (e.g. immune response) or whether the outcome is caused by a change in the microbial communities, or both. Moreover, the microbiome can in turn also be affected by the intervention. This is illustrated in Figure \ref{435225}. Mediation analysis may reveal the causal pathway of how a treatment or an intervention (A) affects an outcome (Y). This effect may take the direct path from A to Y (direct effect) and/or the path may go via a mediator (M) (indirect effect). In our context, this mediator is the (relative) abundance of a single taxon. Traditional methods for mediation analysis, as developed by \cite{judd1981process} and \cite{baron1986moderator}, were developed for a continuous outcome Y and rely on linear Structural Equation Models (SEM); see \cite{mackinnon2012introduction}. In particular, the direct and indirect effects are estimated from two models: (1) an outcome model for \(\E{Y \mid A, M}\), and (2) a mediator model for \(\E{M \mid A}\). The microbiome comprises more than a single taxon, but the general framework of \cite{mackinnon2012introduction} allows for mediation analysis with multiple mediators (see Figure \ref{558318}).
The concepts of direct and indirect effects are properly defined in the causal inference literature, independently of the SEM formulation \cite{robins1992identifiability}. We first introduction notation for counterfactual outcomes. Let \(Y_{am}=Y(a,m)\) denote an outcome in treatment group \(a\in\{0,1\}\) and with mediator M equal to \(m\in\mathbb{R}\), and let \(M(a)\) denote a mediator in treatment group a. These outcomes and mediators may be counterfactual to what has been observed in the data. Another way of looking at it: \(Y(1,M(1))\) and \(Y(0,M(0))\) may be looked at as the two potential outcomes of a single subject; of course only one of the two can be realised. The (total natural) direct effect of the treatment on the outcome is defined as \[\text{NDE}=\E{Y(1,M(0)) - Y(0,M(0))}\]
and the (total natural) indirect effect is defined as\[\text{NIDE}=\E{Y(0,M(1)) - Y(0,M(0))}\]
In the context where the causal effects are the target of statistical inference, they are referred to as the estimands. The next step is then to find good estimates of these estimands, which typically . estimands, which typically requires a connection between expectations of potential outcomes and conditional expectations as in the outcome model. To allow for this connection, a set of identification assumptions must hold (see e.g. \cite{vanderweele2015explanation}). With the outcome and mediator models of a SEM, these expectations can be expressed in terms of the model parameters, and upon plugging in the maximum likelihood estimators of the SEM, the (in)direct effects can be estimated. Despite the flexibility of this approach, it suffers from a very important drawback: correct estimates of the (in)direct effects can only be obtained if the two models in the linear SEM are correctly specified \cite{vanderweele2009conceptual}. This becomes particularly difficult with high-dimensional mediators, such as the taxa in the microbiome. Moreover, when the outcome is not continuous (e.g. binary) and nonlinear models are specified in a SEM, then there is no longer a theoretical justification for the use of these models for estimating the causal mediation effects; see e.g. \cite{pearl2012causal} and reference therein. The causal mediation literature has also proposed other estimation methods that require less stringent model specifications, e.g. natural effect models and inverse probability weighting \cite{lange2012simple}\cite{vansteelandt2012imputation}, G-estimation and imputation. The former are models for \(\E{Y_{aM(a^*)}}\) and hence they directly relate to the natural effects; these models reduce to marginal structural models when \(a=a^*\).
Bias in the estimators of the (in)direct effects may can from three sources. First, identification bias comes from violations of the identification assumptions. Model bias comes from misspecified statistical methods (e.g. outcome and mediator models), and, finally, estimation bias comes from inappropriate estimation procedures. \cite{diaz2020machine} mentions only two types of bias, but types of bias; he merged two type first types.
Multiple mediations and longitudinal studies
Two extensions of the classical mediation analysis will be important for this project: (1) multiple mediators and (2) longitudinal data. Both topics are the subject of intense ongoing research in the causal mediation analysis community. When multiple mediators are involved as in Figure \ref{558318}, the problem cannot be simplified by studying one mediator at a time, unless under the unrealistic assumption that the mediators do not affect one another \cite{vanderweele2014mediation}. Model-based methods have been developed \cite{vanderweele2014mediation}, but correct inference requires correct model specification. With multiple mediators this may be hard, moreover because these models must also be compatible (i.e. not contradict one another). Probability weighting \cite{vanderweele2014mediation} is an estimation method that may partly resolve this issue. When multiple mediators are involved, several methods are only applicable when the mediators can be causally ordered \cite{daniel2015causal}\cite{steen2017flexible}, but this seems not feasible for the high-dimensional microbiome as mediator. \cite{vansteelandt2017interventional} relaxed the structural dependence requirement, by shifting the attention to interventional (in)direct effects, which were first introduced by \cite{vanderweele2014effect} in the context of a single mediator. The estimation methods of \cite{loh2019interventional}, based on interventional effects models, still need the specification of the joint mediator distribution, but \cite{loh2020non} further relaxed the conditions so that only the marginal mediator distributions need to be modelled, making the method more appropriate for the microbiome mediator.
When the microbiome is considered as a mediator, longitudinal studies should be set up so as to follow up the dynamic response of the microbial community. Several causal longitudinal mediation analysis methods have been proposed \cite{bind2016causal}\cite{vanderweele2017mediation}\cite{zheng2018mediation}. We only mention two papers. The recent work of \cite{mittinty2019longitudinal} extended the natural effects models to the longitudinal setting. They use inverse probability weighting for parameter estimation, and for correct standard error calculation (accounting for the serial dependence), they propose a procedure that combines GEE with the parametric bootstrap. A second paper \cite{vanderweele2017mediation} focuses on interventional effects, and use G-estimation for parameter estimation. To our knowledge, there are still no methods for longitudinal mediation analysis with high-dimensional mediators.
In this brief literature overview of mediation analysis, we so far avoided going into several other technical, though important issues: (1) identifiability assumptions, (2) confounding and covariates, and (3) decomposition. Regarding (2), most of the papers that we cited in the previous paragraphs, also consider the presence of confounders and covariates. In as far as they are observed, they can often be accounted for. Most problematic are the unmeasured confounders. Hardly any method can guarantee a correct causal conclusion in the presence of unmeasured confounders, unless very restrictive assumptions are satisfied. This brings us needlessly to (1): all methods require identifiability conditions that need to be satisfied in order to make the (in)direct effects estimable. Throughout our literature overview, the definitions of (in)direct effects changed. Roughly summarised, it changed from effects, over natural effect to interventional effects. One of the motivations for changing definitions, is that it allows for relaxed identifiability assumptions, and with moving from a single mediator to multiple mediators and time-varying exposures and mediators. Another motivation for changing definitions is related to (3): the decomposition of the total effect of the intervention into direct and indirect effects so that e.g. the indirect mediator effect can be expressed as a percentage of the total effect. This was the original crux of mediation analysis, but even without a decomposition, . decomposition, the estimates of the (in)direct effects are just as informative. When moving from a single to multiple mediators, and moving from cross sectional to longitudinal settings, the changes of the definitions of the effects were necessary to allow for such decompositions.
In previous paragraphs: better stress continuous versus generalised approaches.
Mediation Analysis for Microbiome Studies
\cite{zhang2019testing} proposed a method in the SEM framework \cite{mackinnon2012introduction} for microbiome data with a continuous outcome variable. Although their method allows for multiple mediators (taxa), their inference is targeted towards a single taxon. Their technical focus is on the high dimensionality and the compositionality of the mediator. For tackling the latter they transform the counts to their isometric log-ratio transform, and the high dimensionality is resolved by first fitting the outcome model with the lasso and then de-biasing the lasso estimator that corresponds to the targeted taxon \cite{zhang2014confidence}. Their method only focuses on testing the mediator effect on the outcome, and it does not directly address the distinction between the direct and indirect effects, nor do the authors consider the sparseness of typical microbiome data. \cite{sohn2019compositional} also focused on testing of the mediation effect. They also followed the path of the SEM and they also used a lasso-type of penalised estimation method for the outcome model to deal with the high-dimensionality; this method was particularly developed for high-dimensional compositional regressors \cite{shi2016regression} which also includes a de-biassing step. The compositionality of the microbiome mediator is also accounted for in the mediator model, which is now modelling the full vector of additive log-ratio transformed counts by assuming that this vector is approximately normally distributed. The bootstrap is used for hypothesis testing. Yet another recent SEM-based mediation analysis method for microbiome studies was proposed by \cite{wang2020estimating}. Their outcome model includes the high-dimensional taxa data as additive log-ratios, but they also include the interaction effects between these taxa and the treatment. The model parameters are estimated using an penalised least squares method \cite{radchenko2010variable} that involves two penalty terms to address both the hierarchy of main and interaction effects in the model and the high-dimensionality of the mediator. The model for the high-dimensional mediator makes use of the Dirichlet distribution; parameters are estimated by maximising the log-likelihood function with lasso-penalty terms. The latter helps in selecting the taxa that are most affected by the treatment. In contrast to the two previous methods, \cite{wang2020estimating} does not use the model parameter estimates directly for statistical inference. They continue with defining the average causal direct effect of the treatment and the average mediation effect of the microbiome on the outcome, as in the counterfactual causal framework for mediation analysis \cite{vanderweele2014mediation}. The parameter estimates of the two models are subsequently used for estimating these two effect sizes. For hypotheses testing for the mediation effect, a permutation scheme is used. The counterfactual paradigm has also been used by \cite{li2019mediation} in a similar fashion. However, their models consider only a single mediator (and hence the mediation analysis has to be repeated for each taxon separately) and account for the many zero counts by separating zeroes from non zeroes. For the mediation model, they consider zero-inflated (ZI) distributions, such as ZI Poisson (ZIP, ZI Beta and ZI Lognormal. Maximum likelihood is used for parameter estimation, and these estimates are plugged into expression for the direct and indirect effects as defined in the counterfactual framework. Resampling methods are used for p-value calculation. \cite{zhang2018distance} also starts from a SEM, but instead of imposing a penalty in the estimation of the parameters in the high-dimensional models, they propose to perform a dimension reduction method on the microbiome data, and include the resulting low-dimensional constructs in the outcome model, and also use these constructs as outcome variables in the mediation model. The dimension reduction is accomplished by applying metric multidimensional scaling to an appropriate distance matrix. They consider distance metrics that are frequently used in ecology and microbiome research: Jaccard, Bray-Curtis, unweighted, weighted and generalised UniFrac. The latter three metrics account for the taxonomic relations between the taxa. The mediation effect is tested by relying on permutation testing.
All microbiome-specific methods from the previous paragraph may suffer from serious validity issues, as discussed in some more detail earlier in this project proposal. First, many methods rely on the SEM framework, which is, however, only valid if all models are correctly specified and only for continuous outcome variables \cite{vanderweele2009conceptual}. Correct model formulation becomes particularly hard when high-dimensional taxa data are involved. A taxon-by-taxon approach, as in \cite{li2019mediation}, has also been demonstrated to be invalid \cite{vanderweele2009conceptual}. In conclusion, most methods are model-based and do not rely on modern causal mediation analysis results.