Microbiome Data Analysis

Many statistical methods for microbiome data rely on distributional assumptions, e.g. the negative binomial (NB) or the zero-inflated NB (ZINB). However, recently we have demonstrated that these assumptions do not hold for the majority of the taxa \cite{hawinkel2020sequence}; this explains the poor performance of (ZI)NB-based methods w.r.t. their control of the false discovery rate \cite{hawinkel2019broken}. Other methods rely on compositional data analysis (CoDa), which is a well established and mathematically supported class of methods for compositional data\cite{gloor2017microbiome}\cite{aitchison1982statistical}. However, most CoDa methods start with log-transforms of (ratios of) counts, which are problematic with zero counts which are high highly abundant in typical microbiome datasets. 

Longitudinal Data Analysis

Most of the statistical methods for microbiome have been developed for independently distributed observations. However, complex experimental designs and longitudinal studies may result in observations that show a particular dependence structure for which other statistical methods are needed for correct statistical inference. The importance of longitudinal studies is stressed by several authors (e.g. \cite{schmidt2018human}\cite{mai2016moving}), particularly because they are needed for studying the dynamics of microbial species and for discovering microbiome biomarkers and establishing causal relationships. Several methods for longitudinal data analysis of microbiome studies have been proposed, but these again rely on unrealistic assumptions. For example, \cite{chen2016two} proposed a two-components model based on the zero-inflated beta distribution. This method is further improved by \cite{ho2019metamicrobiomer} by making it more flexible, but still relying on zero-inflated beta families. The methods of \cite{zhang2018negative} and \cite{zhang2020fast} rely on the negative binomial (NB) and zero-inflated NB (ZINB) distribution, respectively. All these authors have performed simulation studies from which they conclude that their methods work well, but they all make the same fallacy: they simulated data from the same model as the one their method is based on. So this results in a self-fulfulling prophecy. For the evaluation of statistical tests for differential abundance in the simple 2-sample setting, simulation methods for generating realistically distributed count data have been proposed \cite{hawinkel2019broken,assefa2020spsimseq}, but these are not applicable yet for longitudinal study designs. \cite{williams2019microbiomedasim} have developed a simulation framework for longitudinal microbiome data, but it is based on the multivariate normal distribution. 
Other methods for the analysis of longitudinal microbiome data have been developed in the CoDa paradigm. \cite{xu2020generalized} proposed a two-parts model based on a zero-inflated multivariate normal distribution. The latter distribution should approximately hold for the log-ratio transformed counts. The between-taxa correlation structure is deduced from the taxonomic relationships and the serial correlation is assumed to have the exchangeable structure. The parameter are estimated as in the semiparametric GEE (generalised estimating equations) method, in which the correlation structure is part of the working correlation matrix. Valid asymptotic inference is thus guaranteed, even if the working correlation structure is misspecified. On the other hand, the parameter estimators are only semiparametric efficient if the correlation structure is correctly specified. The simulation study makes use of the multivariate normal distribution, which is consistent with their estimation equations and therefore they make the same fallacy as mentioned earlier. 
A completely different approach has been taken by \cite{wang2020microbial}. Their method (MTA: Microbial Trend Analysis) combines splines for modelling the time trend and principal component analysis for dimension reduction. The method can be represented as a penalised matrix decomposition problem. In contrast to the methods we discussed earlier, MTA can be used for studying the time trend of a community instead of individual taxa. The bootstrap is used for p-value calculation and hence no distributional assumptions are imposed on the count data. The performance of the method is empirically evaluated in a simulation study in which the relative abundance data were generated from a Dirichlet distribution, which is not very realistic because it can only induce negatieve correlations between taxa. Also the methods of \cite{shields2018splinectomer} and \cite{luo2017informative} make use of splines for nonparametrically modelling the time trend. The former makes use of permutations for null distribution enumeration and can be used for comparing time trends between groups, and the latter can also identify time intervals in which the time trends differ between groups.  The nonparametric nature of these methods make that no continuous covariates can be included and that no clear interpretable parameter estimates are provided. 

Nonparametric and Semiparametric Methods

For the i.i.d. setting in which two groups are compared, several simulation studies have indicated that rank tests, such as the Wilcoxon rank sum test or SAMSeq\cite{li2013finding},  applied to appropriately normalised data work well in the sense that they control the FDR at the nominal level. These nonparametric tests, however, also suffer from a few disadvantages. The first is that they are not applicable to studies with a design that is more complicated than comparing two or more groups. For example, the inclusion of a continuous covariate or confounder is not possible. The second disadvantage is that these procedures are specifically developed for hypothesis testing and that they do not come with a corresponding estimate of a relevant effect size. Parametric methods, on the other hand, can be used for testing hypotheses that are explicitly formulated in terms of e.g. (log) fold changes, and these methods also provides estimates of this effect size. Finally, these rank tests are often much less sensitive than their parametric competitors. This is a price that these methods pay for being nonparametric. 
Probabilistic Index Models (PIM) are a class of semiparametric models \cite{thas2012probabilistic} that (1) generate the classical rank tests \cite{de2015regression}, (2) provide estimates of corresponding and informative parameters (effect sizes), and (3) possess the flexibility of regression models and hence allow for the inclusion of covariates. Although the original theory is asymptotic in nature, hypothesis tests can often be implemented as permutation tests or empirical likelihood methods can be used, with good results with sample sizes as small as 25 \cite{amorim2018small}. Thus, PIMs resolve the first two disadvantages that we listed for the classical nonparametric rank tests. Regarding the third disadvantage (small sensitivity), semiparametric methods may bring a solution. Upon using semiparametric efficiency theory \cite{tsiatis2007semiparametric}, the estimation method can be designed so as to provide semiparametric efficient parameter estimators within a well-defined class of semiparametric models and restricted to asymptotically linear estimators. The latter can be considered as a compromise between the very wide class of nonparametric models and the very restrictive class of parametric models. For PIMs these semiparametric efficient estimators have been constructed and studied by \cite{stijn}. Their simulation study, however, indicated that, as compared to the original estimator of \cite{thas2012probabilistic}, only a small gain in precision could be obtained, and this came with a strongly increased computational cost.
Recent advances in semiparametric estimation theory and machine learning have resulted in algorithm-driven procedures for the construction of optimal statistical procedures. \cite{luedtke2020learning}, for example, developed a deep adversarial learning method for constructing optimal minimax estimators, without the need of data and not restricted to asymptotically linear estimators. These methods are very useful in settings where a parametric model is known in advance but the optimal estimator is very hard to construct analytically. When data are available, adaptive semiparametric efficient estimators may be constructed as in \cite{bickel1993efficient} or by the method of targeted maximum likelihood learning \cite{van2006targeted}\cite{van2018targeted}. These methods adapt the estimator to the data that have to be analysed. A disadvantage that these procedures have in common is that they typically require a large sample size. In high-dimensional or large scale settings, as in microbiome studies, we are in a different situation: we have data on many similar features (taxa), but for each feature we only have a small number of observations. This largely prohibits the use of adaptive procedures to be applied to each individual feature, but on the other hand, information can be shared among the features, particularly because it is realistic to assume that all features have distributions that are not very dissimilar.