Between workflow caching
While data analysis usually entails the handling of a set of datasets or samples that are specific to a particular project, it often additionally relies on a set of steps that retrieve and post-processes some common datasets. For example, in the life sciences, such datasets are reference genomes and corresponding annotations. Since such datasets potentially reoccur in many analyses conducted in a lab or institute, re-executing the corresponding analysis steps for retrieval and post-processing thereof for each individual data analysis would waste both disk space and computation time.
Historically, the solution in practice was to build up shared resources with the postprocessed datasets that could then be referrred to from the workflow definition. For example, in the life sciences, this has led to the Illumina iGenomes resource (
https://support.illumina.com/sequencing/sequencing_software/igenome.html) and the GATK resource bundle (
https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle). In addition, in order to provide a more flexible way of selection and retrieval for such shared resources, so-called reference management systems have been published, like Go Get Data (
https://gogetdata.github.io) and RefGenie (
http://refgenie.databio.org). Here, the logic for retrieval and post-processing is curated in a set of recipes or scripts, and the resulting resources can be automatically retrieved via command line utilities. The downside of all these approaches is that the transparency of the data analysis is hampered since the steps taken to obtain the used resources are hidden away and inaccessible without additional work for the reader of the data analysis.
Snakemake provides a new, generic approach to the problem which does not have this downside (see Fig. \ref{284352}). Leveraging the workflow-inherent information, Snakemake can calculate a hash value for each job which unambiguously captures exactly how an output file is generated, prior to actually generating the file. This hash can be used to store and lookup output files in a central cache (e.g. a folder on the same machine or in a remote storage). Hence, for any output file in a workflow, if the corresponding rule is marked as eligible for caching, Snakemake can obtain the file from the cache if it has been created before in a different workflow or by a different user on the same system, thereby saving computation time, as well as disk space (the file can be linked instead of copied).