These notes are mostly a copy of the use-case and notes I posted top the project mailing list some time ago, which I'm reposting here so they are easier to reference.
I had a very interesting meeting with Alistair Miles and his team at the Wellcome Human Genetics Centre in Oxford. Mostly, we were looking at what would be needed to replicate part of their SNP-discovery workflow to a different environment. Out of this, some requirements for workflow preservation were suggested that I'll try and articulate in the next section.
Use case: SNP variation discovery workflow preservation
The SNP discovery processing pipeline undergoes a number of stages, starting with raw paired-reads from a number of Plasmodium falciparum samples, created by an Illumina sequencing system. The samples are (re)aligned against the reference genome, processed through a potential-SNP-detection phase, and then passed through a SNP-filtering pipeline that aims to identify a number of "high quality" SNP sites, whose genotype readings can be correlated with drug-resistance characteristics of malarial infection.
The processing is currently performed on a cluster computing facility at Sanger institute, using a combination of standard bioinformatics community software tools and some specialized Perl scripts. Data in the pipeline is largely in the form of TSV files that are streamed (piped) between elements in the pipeline.
The original data consists of over 1000 sequencing samples of a roughly 25Mbp genome with 100x coverage. With additional information and data coding overheads, this amounts to about 2-3Gb/sample, or several Tb overall of raw input data. Similar pipelines for organisms other than P. falciparum would be dealing with orders of magnitude more data (e.g. Anopheles ~250Mbp (?), Human ~2.9Bbp). The storage and processing needs for dealing with such datasets leads to processing pipelines that tend to be idiosyncratically tuned to a particular execution environment, in order to be runnable within that environment, giving rise to additional challenges for workflow preservation.
Suggested preservation requirements
For an initial 1-week scoping exercise, we decided to look initially at the filtering pipeline, as that deals with relatively modest volumes of data, and does not depend on external software.
In the process of discussing this, a number of possible workflow preservation requirements were raised that I do not believe have been noted in our own requirements capture:
- an important goal of workflow preservation and reproducibility is to move workflows from one execution environment to another, being adaptable to different resource constraints.
- provenance traces used to compare workflow executions should be augmented with resource usage traces that can be used to assess whether a given workflow execution is runnable in some given target execution environment. (It was specifically noted that it could be hard to work with priori resource limitations or usage estimates.)
- the RO encapsulating a workflow execution would usefully contain a "manifest" that parameterizes the execution, and can be used to tune the workflow to a particular execution environment, or to allow variations to be easily tried.
- quality assessments should be stored 1:1 with corresponding "manifest" settings, so that possible tuning factors affecting output quality can be isolated.
- a possible requirement for the kind of partial reproducibility discussed in the iPRES paper submission. Specifically, having high-volume and/or processing-intensive steps performed once and their results certified and re-used could prove very useful when trying workflow variations to improve output quality.