Blog RO provenance querying by user Marco
Example question from the user view on provenance table:
As a researcher working on a Live RO, when I click on a workflow I would like to see:
Previous runs, when they were run, who ran them, summary of results and comments on the run.
Can I do this with SPARQL?
Part I - Querying the Allegrograph/WINGS Knowledge Base ("naive" validation)
In this part I am using the WINGS example RDF provided via the endpoint http://wind.isi.edu:10035/catalogs/java-catalog/repositories/WINGSTemplatesAndResults, because this was the reference to an endpoint that I found first on the Showcase 22 page. Initially, I looked only at what I could find at this SPARQL endpoint and did not consult further documentation on the Showcase 22 page. or elsewhere. In part II, I did consult more information, e.g. I found out that there is a second end point used by showcase 22, that contains the wf4ever model artefacts. This obsoletes questions 1,2,4,5 in my notes below.
Objective 1: Find previous runs of a workflow.
Assuming that the sample data will contain one workflow, I expect to find the URL for one run of one workflow.
My first steps is to try to find the reference for the workflow run. I cannot find 'run' somewhere directly, so I have started with a result (as that must be a result of a run). I found an instance of the class SMAPComparisonResults, and used that as my starting point. It wasGeneratedBy the ProcessInstance COMPARELIGANDBINDINGSITESV211332778615941, which hasProcessTemplate ABSTRACTSUBWFLIGANDBINDINGSITESCOMPARISON_COMPARELIGANDBINDINGSITESV21. This is an instance of ProcessTemplate with label "Process template CompareLigandBindingSitesV21".
Now I wish to find out of what workflow template ProcessTemplate ABSTRACTSUBWFLIGANDBINDINGSITESCOMPARISON_COMPARELIGANDBINDINGSITESV21 is a component of, such that I can find the overall workflow.
I performed this query to find the subject the hasTemplateComponent ABSTRACTSUBWFLIGANDBINDINGSITESCOMPARISON_COMPONENT2
The name suggests that we are again dealing with a subworkflow, so I try the above again:
Apparently, ABSTRACTSUBWFLIGANDBINDINGSITESCOMPARISON_COMPARELIGANDBINDINGSITESV21 is not a TemplateComponent of anything. I conclude that this must have been the workflow that was run to produce the SMAPComparisonResults that I started with.
produces the same list of 38 items, showing that the range of hasProcessTemplate was limited to ProcessInstances in this repository.
Part II - Querying the Wf4ever Knowledge Base
''In search for how to query the wf4ever RO model, I now consulted the Showcase 22 documentation on http://www.wf4ever-project.org/wiki/display/docs/Showcase+22+Querying+workflow+execution+provenance'''
I found a reference to another SPARQL endpoint on the Showcase 22 wiki page: http://test-wf4ever.isoco.com/test/
Indeed it seems that the example queries run here. Unfortunately, the endpoint does not seem to have a simple UI for its results. I saved and displayed each query result manually.
gives me these results:
Two queries to learn about the predicates associated with the wfruns (query4: RDF):
and (query5: RDF):
The results are saved in the files associated with this blog. At first glance, the repository indeed contains the results of the Allegrograph repository. I would like to check that more specifically: can I find Process Instance COMPARELIGANDBINDINGSITESV211332778615941 and its Process Template?
Query to find the uri of COMPARELIGANDBINDINGSITESV211332778615941 from the Allegrograph repository:
All its predicates and objects, for where it is the subject (query6: RDF)
and its subjects and predicates, for where it is the object (query7: RDF):
query 6 gives a quite extensive report of what is associated with this ProcessInstance, including the relation that this ProcessInstance wfprov:wasPartOfWorkflowRun of http://wings.isi.edu/opmexport/resource/Account/ACCOUNT1332778615941 (one of the wfRuns reported by query 3).
Query 7's results are more sparse and only provides references to outputs. Together query 6 and 7 seem to give a comprehensive 'report' on COMPARELIGANDBINDINGSITESV211332778615941
The type information from query 6 tells me that COMPARELIGANDBINDINGSITESV211332778615941 is a ro:Resource, wfprov:Process, and wfprov:Artifact. This seems a little high level.
My starting point should be a workflow. See if I can find that using http://wings.isi.edu/opmexport/resource/Account/ACCOUNT1332778615941 as my starting point (query8: RDF):
to my surprise I find three workflows:
ABSTRACTSUBWFLIGANDBINDINGSITESCOMPARISON, ABSTRACTSUBWFDOCKING, ABSTRACTGLOBALWORKFLOW2
Looking at the names, I suspect that ABSTRACTGLOBALWORKFLOW2 has ABSTRACTSUBWFLIGANDBINDINGSITESCOMPARISON, ABSTRACTSUBWFDOCKING as components. Possibly all three were returned by query 8 by inference. ABSTRACTGLOBALWORKFLOW2 should be the overall workflow that I was looking for.
Back to the example user query:
''As a researcher working on a Live RO, when I click on a workflow I would like to see previous runs, when they were run, who ran them, summary of results and comments on the run.''
"Click on a workflow": Assume this is ABSTRACTGLOBALWORKFLOW2
"see previous runs" (query9: RDF):
Unfortunately, I did not find timestamps, or relations that point to timestamps. Similarly, I did not see the relations to people. NB to my surprise I saw that the type of the workflow run is wfprov:WorkflowRun and wfdesc:Workflow.
I remember that I saw 'Daniel' in the Allegrograph triples store. In queries 12 and 13 I requested all triples for <http://wings.isi.edu/opmexport/resource/Agent/DANIEL>, but none were found (query12: RDF, query13: RDF)
Next I looked at the RO vocabulary specification v0.2 and found that Dublin Core Terminology terms 'created' and 'creator' are suggested. So, I probed the repository for createds (query14: RDF) and creators (query15: RDF)
Result:https://raw.github.com/wf4ever/ro-catalogue/master/v0.1/wf74/ created 2012-03-26T16:41:29
Result:https://raw.github.com/wf4ever/ro-catalogue/master/v0.1/wf74/ creator "Test User"
It seems that for the WINGS workflow these annotations were not applied.
Finally, out of curiosity I looked if anything was aggregated for the WINGS workflow (query16: RDF):
Many references to wf74, but no reference with 'wings' in the uri.
Idem for annotations (query17: RDF):
To my surprise it appears that no resources in the repository are annotated with the Annotation Ontology.
- In general I am happy with the results so far. The RDF of the workflows that I looked at seems pretty ok, both in the Allegrograph KB and the wf4ever KB. The mapping between the RDF produced from the WINGS workflow seems to have worked. Getting to workflow templates, their results and runs or 'process instances' via their interrelationships was not difficult using the lists of predicates and classes provided by the endpoints (as self-explanatory as one may expect from RDF triples).
- Some information appears to be missing or items may not have been mapped to the wf4ever models yet (see questions below).
- I could not answer the first user provenance question of the user view on provenance fully (timestamps, credit).
- I did not check how exhaustive the annotations are, e.g. with respect to the user view on provenance|wiki/display/docs/User+view+on+Provenance|||\. This would require going through all examples there.
- For testing/validating the KB, it would be convenient if the wf4ever KB would have a more feature-rich user interface (Sesame?, Allegrograph?)
- The knowledge base endpoints were not highly visible on the showcase 22 report. Because I wanted to do the validation as naive as possible going straight from driving user question to querying the endpoint, I started on the wrong KB.
In order to acquire some hands-on experience with a tangible RDF RO artefact, I am trying to link references for the materials that Kristina added to the RO using the RO tools as instances in the ontologies suggested for the RO. The immediate motivation is the Semantic Web Applications and Tools for Life Science (SWAT4LS) workshop, where we would like to present our project on a poster. I would like to show a tiny prototype Semantic Web example of a RO.
Comments 28/11/2011 - 1/12/2011
* I cannot find the property that links a workflow file to a workflow instance: what is the recommended way of instantiating a particular workflow?
* I wonder what the added benefit of a WorkflowResearchObject is; it is properly defined as an RO with at least a Workflow, but a RO with only a workflow would not be a good RO imo (at least not a published RO)
* The explicit distinction between living and published ROs
* Datatype properties for Artifacts (documents, files etc.): how do I add a reference to file on my harddisc? It is not common practice to use such references directly for the identifiers of these intances, is it?
* Warning: using (file) names for identifiers may lead to ambiguity problems later on. I suggest that if we have no better ID, we use uuids, and the names in labels.
* I can't find how to relate the Experimental Hypothesis to a scientific task or investigation
* Artifact is used in three different vocabularies (opmv, wfdesc, wfprov); are they all different? Are their relations defined?
* What do we use the Annotation ontology for exactly? This question is partly because the annotation classes are not very clearly annotated (ironically). I can imagine that some Semantic Links can be regarded annotations. Do we then use these to define that certain classes are annotations?
* I see that we previously took a different modelling decision regarding process and workflow. We did not define a workflow as a process directly. Instead we defined the following pattern: a (Text mining) process /has_implementation/ workflow, a process /is_run_by/ a process_run, a process_run /is_performed_by/ a workflow_run. Not sure about the consequences in practice, but I am not sure if we can define workflow as a description as well as a process. I guess process would have to be described as a process *description*.
* I am curious about 'WorkflowInstance'. If scufl2 can contain these extra things, then I guess a workflow description could indeed be a WorkflowInstance as defined. Otherwise I would probably create relations between the workflow and its parameters etc.. In that case it would make sense to define this class as a defined class: a workflow with parameter properties is a (directly runnable) WorkflowInstance. As an experiment (poc) I defined this in the wf4ever_experiment ontology.
* Could 'hasArtifact' also apply to files that represent e.g. the workflow. Currently, I used the reference from the manifest.rdf file (which reads like a file reference) as the identifier for the workflow. This seems to be a discrepancy: I use a different pattern for the inputs and outputs: I/O Files (Artifacts) are linked to workflow runs by usedInput and wasOuputFrom.
* Created several instances in EXPO: ComputationalExperiment, ScientificInvestigation, HypothesisRepresentation = instance of Artifact <- Document,
** If I had doubts about the ID to use, I created a uuid. All my instances have labels (or they should have)
* Linked EXPO and wf by
** defining that a Workflow is a type of RepresentationExperimentalExecutionProcedure
* Added one example output and two inputs as file instances. Linked them to the workflow run.
* Aggregated all my instances in the RO instance.
* To link the workflow to its hypothesis, I used the hasContent relation based on this relation's domain and range. The relation is not further defined, so I cannot know if this is its purpose (its name suggests something different).
==== Earlier comments =====
wf4ever models comments
* In general classes for scientific experiments; I think we need these for annotating
** Hypothesis (and document representing a hypothesis)
** The experiment ('executed by' (everything in) a Research Object)
** Interpretation (and artefact representing the interpretation)
** Possible ontologies to reuse:
*** I found 'EXPO' (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1885356/ and http://www.sourceforge.net/projects/expo)
*** Some domain specific ontologies, such as http://www.w3.org/wiki/images/e/ea/HCLSIG_BioRDF_Subgroup%24%24Tasks%24%24Experiment_Ontology%24LSCDDMetadata.owl
*** More options are welcome, because these do not seem to be well maintained; their URIs do not resolve on the web.
*** Related (but less generic than you might think): Experimental Conditions Ontology (http://bioportal.bioontology.org/ontologies/1585); Experimental Factor Ontology (http://bioportal.bioontology.org/ontologies/1136)
* I find two 'Documents' (wf4ever and foaf) with insufficient information to understand their relation.
* Classes from imported ontologies often have poor or no labels, and no comments. For classes/properties that are important to our purposes I suggest we provide labels and comments (in a separate owl file).
* Out of curiosity: How is a Person a 'spatial thing'?
* Inputs and outputs are defined as types of parameters. Some would consider these different things I think?
* Not many classes are defined by properties; should we try to increase that?
* Also for ontologies I find examples (in comments) highly useful. For instance, I would like more information on how the annotation classes/properties should be used.
Expectations from an RO model towards submission, archival, and dissemination for the Metabolic Syndrome case and epigentics-Huntington's Disease case.
- BioRO's folder in DropBox
- Draft RO model in DropBox (RO task force)
- e-mail by Jose
- Discussions about nano-publications (BioSemantics group and OpenPhacts)
- We consider that our efforts towards a good archive should be inspired by how we expect/would like that the archive will be used (to prevent creating a dead archive).
- In this sense, are we missing the perspective of personal use (we also archive for ourselves), or even usage as an explicit perspective? Do 'archival' or 'dissemination' cover these?
- It might be interesting to compare our archives with 'implicit' archives; for instance: the 'Leiden Open Variation Database' (LOVD) is developed as a dynamic resource for genetic information; the archiving aspect is more-or-less a side effect. I think myExperiment, BioCatalogue, or Galaxy 'toolsheds' were also originally built with usage as primary objective. I would like to think that archiving infrastructure could be used as backbone for such resources, adding the value of proper archiving (including provenance).
NB LOVD is used in examples for nano-publishing (see [http://nanopub.org)
- Following our internal discussions about nano-publishing genetic variants, submission is the best moment to help users follow standards (for useful preservation; i.e. enhance usage later on)
- For comparison: when genetic variants are submitted to an archive and as part of a paper, we can suggest the proper encoding for genetic variants (this is non-trivial). This semantically links the papers mentioning the variant to one another and to any database using the same encoding for variants.
- For workflows I imagine that submission would be the moment of matching a workflow with the RO model. When mismatches occur, we can ask the user to disambiguate or add missing information.
- This may come with some 'pain', because it may require a mapping for existing resources.
- The process is facilitated by publishing the RO model.1 The RO model would give developers and users a reference.
- For the examples below, I distinguish the living RO from the published RO:
- Example for living ROs
- NB: I would like to see these in the context of helping users keep a proper notebook. In my vision, the notebook creates the archive (or prepares for archival). Every submission of an artefact+notes comes with storing provenance information, that a user can later use to look up information before publishing. So for instance, when a workflow produces an output that we wish to keep, we don't have to worry about its timestamp and metadata remaining linked to it.
- Both Eleni and Kristina have added to DropBox documentation about their initial plans for an experiment: their hypothesis, the data they intend to use, expected outcome. They also submit the data they use in the experiment (sometimes by reference), and their intermediate results (note: these are not Taverna-style intermediate results per se, but results from experiments that we wish to keep for a while, because they may be needed later on). For every submission, I expect some provenance information to be added, possibly visible in a notebook-like viewer. [Should we make this example even more specific based on the data in the BioRO folder?]
- In summary: I expect that the process of submission for living ROs would look like an aid in keeping track of your experiment for users.2
- Example for publication ROs
- I expect that the process of publishing is a matter of copying, pruning, and editing a living RO. This may be facilitated by matching to the potential difference that may exist in the RO model instance of the 'living' and 'published' ROs. It can guide in what to remove (e.g. intermediate datasets), and it may be more strict about which information should minimally be provided, similar to the strict requirements for publications3. (In practice this is the time where as user you feel sorry that you didn't provide this information to the living object earlier.)
- Taking Eleni's case as an example. She is producing a workflow from the compendium of R scripts that she used to explore the effect of so-called CpG islands (specific sequences often found in front of genes; their locations taken from an on line data source) on the change of gene expression in brain regions affected by Huntington's Disease. She has done several analyses for different kind of statistical plots (we assume these are in the living RO4), and now wishes to fix the workflow (in Taverna) and select the results for publication.
1Note: within our department we are developing a tool for generating interfaces for command line tools; the underlying model is in RDF. For each tool (e.g. Galaxy tools) a mapping is made in a template language. This is a bottleneck, but the RDF part facilitates interoperability.
2It may be worthwhile to align with Study Capture frameworks here. I suggest to try to use the same semantic models under the hood as much as possible.
3Note that biologists are used to quite strict rules when submitting papers and submitting some types of data to data repositories.
4In reality, Eleni is not using the BioRO folder to its fullest. One reason is that the data set is large; the other is that Eleni is not yet used to the idea. Apart from learning to use this, we can use it as a requirement: It may help if living RO's seem entirely personal, until a user is ready to share. Eleni is a typical PhD student/scientist; feeling uncomfortable sharing until she is entirely confident about the result. NB, following the wet-lab analogy this is not acceptable; notebooks are in principle property of the department and open to it (in practice this is not nastily adhered to though).
Should we do something special for working ROs that will never make it to publication, but have valuable components? For simplicity I stated that I would like lower ambitions for archiving working ROs, compared to published ROs. However, it would be a shame if we have nothing for discontinued unpublished ROs that are however valuable enough to archive for the long term.
I would still plea to first pay attention to the working and published ROs, but we should perhaps not forget about these.
I just had an interesting conversation with one of my colleagues Jeroen Laross. I imagine that this could have a translation into what we are doing.
Jeroen and colleagues are building a bioinformatics pipeline out of command line tools. When a tool is updated and they build that in, the behaviour of the pipeline could change. Therefore they keep track of the tools' version numbers (e.g. by doing --version or if there is no version information do a MD5 checksum - do I spell that right?). If a tool's version has changed, they change the version number of the pipeline.
Would we be able to have something similar for workflows? I guess that including an update of a command line tool is always a deliberate step; for remote Web Services this is automatic and unseen. The workflow would not be changed, but its functionality might have. Perhaps a RO can have an additional version number reflecting underlying changes? I would argue we would need the facility first, regardless of whether it is easy to find out if underlying services were updated.
This may seem to make published ROs mutable. However, a published RO should have time stamps all over the place: it only makes guarantees for the time of publication (analogues to paper publications). It is a best practice to keep everything working after publication.
Still, is 'version dependency' somehow a requirement for working and published ROs?
PhD student Eleni Mina (LUMC) has produced a 'table of contents' for us that conveys what components she would expect to put in a Research Object for her upcoming experiment. She is at the planning stage of the experiment, which means that the references and plans are all real, but she hasn't performed the experiment yet. The file is in the experimental RObox DropBox folder 'BioRO'. Below is a short description.
I expect that Eleni's TOC would eventually map to the RO model and the RObox interface, and that the RO model would provide the structure and interlinking of the components, while the RO model is used to help structure the RObox interface (a consequence of defining the structure of a RO).
Eleni defined component types, and 'attributes' of components:
- RO component types: Datasets, Scripts, Web Services, Workflows, Documentation; on my request she added Hypothesis as a component too.
- RO component attributes: Name, Description, Type, Version; on my request Eleni added 'Role in experiment' which otherwise ended up in the Description.
- Type: Dataset
- Name: Huntington Disease dataset 1
- Description: Human brain dataset. 44 HD samples, 36 Controls age and sex matched. Brain areas:caudate nucleus, frontal cortex and cerebellum. Affymetrix platform
- Datatype: GEO series datafile (NB this also conveys the origin of the datafile).
- Version: <unknown>
- Role in experiment: input (as her supervisor, I would ask for more explanation here; how does this data help address the hypothesis?)
- Type: Script
- Name: Analyse_data
- Input: data matrix, tab delimited.
- Output: list of differentially expressed genes
- R packages needed: Bioconductor, Affy, Limma, GEOquery
- Script type: R
- Version: R version 2.10.1
- Role in experiment: Script file for the process and analysis of the input affymetrix datasets
This dynamic RO will evolve as Eleni is designing and executing her experiment. I expect this to lead to more annotations/notes and more detail.
This follows on the plea I made to think about services on top of Semantic Web / Linked Data resources. Carole called them 'top-middleware' services I think.
Would this make sense?
-> exposes content via Generic Semantic Web interface (Sesame API?, - we have Web Service available for that.) [Note from Juande: Sesame is a web service for giving back information on astronomical object names, what's this Sesame?]
-> RO-services expose RO specific interface for (web) applications (SOAP service, Ruby gem, etcetera?)
-> Various web applications can perform RO-specific I/O through an application-specific user interface (typically not made by us)
- The RO model is independently exposed (hence, RO semantics available for any tool to use)
- Combination of exposed RO-model (RDF) and generic RDF interface would do the trick already; RO-services make it convenient and allow us to control access.
- RO-box would be one of the applications; Taverna, myExperiment, etcetera could be others, use for annotation?
- The RO-services hide (SPARQL) details and are RO-specific.