Skip to end of metadata
Go to start of metadata

Blog

Expectations from an RO model towards submission, archival, and dissemination for the Metabolic Syndrome case and epigentics-Huntington's Disease case.

Source information:

  • BioRO's folder in DropBox
  • Draft RO model in DropBox (RO task force)
  • e-mail by Jose
  • Discussions about nano-publications (BioSemantics group and OpenPhacts)

Notes upfront

  • We consider that our efforts towards a good archive should be inspired by how we expect/would like that the archive will be used (to prevent creating a dead archive).
    • In this sense, are we missing the perspective of personal use (we also archive for ourselves), or even usage as an explicit perspective? Do 'archival' or 'dissemination' cover these?
    • It might be interesting to compare our archives with 'implicit' archives; for instance: the 'Leiden Open Variation Database' (LOVD) is developed as a dynamic resource for genetic information; the archiving aspect is more-or-less a side effect. I think myExperiment, BioCatalogue, or Galaxy 'toolsheds' were also originally built with usage as primary objective. I would like to think that archiving infrastructure could be used as backbone for such resources, adding the value of proper archiving (including provenance).
      NB LOVD is used in examples for nano-publishing (see [http://nanopub.org)
      ]

Submission
Notes

  • Following our internal discussions about nano-publishing genetic variants, submission is the best moment to help users follow standards (for useful preservation; i.e. enhance usage later on)
    • For comparison: when genetic variants are submitted to an archive and as part of a paper, we can suggest the proper encoding for genetic variants (this is non-trivial). This semantically links the papers mentioning the variant to one another and to any database using the same encoding for variants.
    • For workflows I imagine that submission would be the moment of matching a workflow with the RO model. When mismatches occur, we can ask the user to disambiguate or add missing information.
      • This may come with some 'pain', because it may require a mapping for existing resources.
    • The process is facilitated by publishing the RO model.1 The RO model would give developers and users a reference.

Examples

  • For the examples below, I distinguish the living RO from the published RO:
  • Example for living ROs
    • NB: I would like to see these in the context of helping users keep a proper notebook. In my vision, the notebook creates the archive (or prepares for archival). Every submission of an artefact+notes comes with storing provenance information, that a user can later use to look up information before publishing. So for instance, when a workflow produces an output that we wish to keep, we don't have to worry about its timestamp and metadata remaining linked to it.
    • Both Eleni and Kristina have added to DropBox documentation about their initial plans for an experiment: their hypothesis, the data they intend to use, expected outcome. They also submit the data they use in the experiment (sometimes by reference), and their intermediate results (note: these are not Taverna-style intermediate results per se, but results from experiments that we wish to keep for a while, because they may be needed later on). For every submission, I expect some provenance information to be added, possibly visible in a notebook-like viewer. [Should we make this example even more specific based on the data in the BioRO folder?]
    • In summary: I expect that the process of submission for living ROs would look like an aid in keeping track of your experiment for users.2
  • Example for publication ROs
    • I expect that the process of publishing is a matter of copying, pruning, and editing a living RO. This may be facilitated by matching to the potential difference that may exist in the RO model instance of the 'living' and 'published' ROs. It can guide in what to remove (e.g. intermediate datasets), and it may be more strict about which information should minimally be provided, similar to the strict requirements for publications3. (In practice this is the time where as user you feel sorry that you didn't provide this information to the living object earlier.)
    • Taking Eleni's case as an example. She is producing a workflow from the compendium of R scripts that she used to explore the effect of so-called CpG islands (specific sequences often found in front of genes; their locations taken from an on line data source) on the change of gene expression in brain regions affected by Huntington's Disease. She has done several analyses for different kind of statistical plots (we assume these are in the living RO4), and now wishes to fix the workflow (in Taverna) and select the results for publication.

1Note: within our department we are developing a tool for generating interfaces for command line tools; the underlying model is in RDF. For each tool (e.g. Galaxy tools) a mapping is made in a template language. This is a bottleneck, but the RDF part facilitates interoperability.
2It may be worthwhile to align with Study Capture frameworks here. I suggest to try to use the same semantic models under the hood as much as possible.
3Note that biologists are used to quite strict rules when submitting papers and submitting some types of data to data repositories.
4In reality, Eleni is not using the BioRO folder to its fullest. One reason is that the data set is large; the other is that Eleni is not yet used to the idea. Apart from learning to use this, we can use it as a requirement: It may help if living RO's seem entirely personal, until a user is ready to share. Eleni is a typical PhD student/scientist; feeling uncomfortable sharing until she is entirely confident about the result. NB, following the wet-lab analogy this is not acceptable; notebooks are in principle property of the department and open to it (in practice this is not nastily adhered to though).