Introduction
Performing data analysis has become ubiquitous across scientific disciplines. Along with that, securing data analysis reproducibility has been identified as a major challenge
\cite{Mesirov2010,Baker2016,Munaf__2017}. In consequence, recent years have seen a wide adoption of scientific workflow management systems by the community. Countless workflow management systems have been published (see
https://github.com/pditommaso/awesome-pipeline). Roughly spoken, these can be partitioned into four niches, for which we will highlight the major representatives below.
First, workflow management systems like Galaxy \cite{Afgan2018} offer a graphical user interface for composition and execution of workflows. The obvious advantage is the shallow learning curve, making such systems accessible for everybody, without the need for programming skills.
Second, with systems like Anduril
\cite{Cervera2019}, Balsam
\cite{papka2018}, Hyperloom
\cite{cima2018hyperloom}, Jug
\cite{Coelho_2017}, Pwrake
\cite{Tanaka_2010}, Ruffus
\cite{Goodstadt2010}, SciPipe
\cite{Lampa2019}, SCOOP
\cite{SCOOP_XSEDE2014}, and COMPSs
\cite{Lordan_2013} workflows are specified using a set of classes and functions for generic programming languages like Python, Scala and others. Such systems have the advantage that they can be used without a graphical interface (e.g. in a server environment), and that workflows can be straighforwardly managed with version control systems like Git (
https://git-scm.com).
Third, with systems like Nextflow \cite{Di_Tommaso_2017}, Snakemake \cite{Köster2012}, BioQueue \cite{Yao2017}, Bpipe \cite{Sadedin2012}, ClusterFlow \cite{Ewels2016}, Cylc \cite{J_Oliver_2018}, and BigDataScript \cite{Cingolani_2014}, workflows are specified using a domain specific language (DSL). Here, the advantages of the second niche are shared, while adding the additional benefit of improved readability since the DSL provides statements and declarations that specifically model central components of workflow management, thereby obviating superfluous operators or boilerplate code. In case of Nextflow and Snakemake, where the DSL is implemented as an extension to a generic programming language (Groovy and Python), even access to the full power of the underlying programming language is maintained (e.g. for implementing conditional execution and handling configuration).
Fourth, with systems like Popper \cite{Jimenez_2017}, workflow specification happens in a purely declarative way via configuration file formats like YAML. Here, most of the benefits of the third niche are shared. In addition, workflow specification can be particularly readable for non developers. This comes however wit the downside of being more restricted since the facilities of imperative or functional programming are not available.
Fifth, there are system-independent workflow specification languages like CWL
\cite{cwl} and WDL
\cite{voss_full-stack_2017}. These define a (declarative) syntax for specifying workflows, which can be parsed and executed by arbitrary executors, e.g. Cromwell (
https://cromwell.readthedocs.io), Toil
\cite{Vivian_2017} and Tibanna
\cite{Lee_2019}. Here, a main advantage is that the same workflow definition can be executed on various specialized execution backends, thereby promising scalability to virtually any computing platform.
Today, several of the above mentioned systems support full in silico reproducibility of data analyses (e.g. Galaxy, Nextflow, Snakemake, WDL, CWL), via allowing the definition and automatic scalable execution of each involved step, together with the ability to define and automatically deploy the software stack needed for each step (e.g. via the Conda package manager,
https://docs.conda.io, or Docker containers
https://www.docker.com).
Reproducibility is important to generate trust in scientific results. However, we postulate that a truly sustainable data analysis needs to consider a full hierarchy of interdependent aspects (see Fig. \ref{374653}).