Skip to end of metadata
Go to start of metadata

Esteban text

Introduction

The main goal of the overall showcase 44 is to provide a way to scientist for searching workflows by its funcionality, properties, or other conceptualization allowing their easy accessibility.

The showcase is split and the first showcase 44a has as its main goal the identification or definition of macros to be used later on for mathcing workflows or parts of it (showcase 44b).

Problem

What is the problem that we are trying to solve? -- khalid

  • Given a workflow repository and a user query specified as a set of keywords, identify the workflows that meet user expectation? (This is what myExperiment provides now, I am giving here, just to illustrate the level of details we need to give when defining the problem that we want to solve :-) --khalid
  • Given a workflow repostory and a user query specified as a set of requirements (i.e., I want the workflows that produce this type of data and accept as input this type of data), identify the workflows that meet the user expectation. --dani.
  • Given a workflow repository and a workflow query, recommend me similar workflows (at a highel level of abstraction) in the repository --dani
    • A subcase of the previous problem is what Antoon solved: given a workflow repository of templates, return all workflows that share substructures with the query (subgraph isomorphism). Ralph Bergmann tackled this problem with semantic similarity as well.
  • Given a workflow repository and a query workflow, find the workflows that do the same function as the query workflow. This problem is what we talked about about in the first meeting and it is very interesting from the research point of view, although it is daunting: how do I know that 2 workflows that are structurally different perform a similar analysis?--dani
  • Given 2 different workflows from 2 different systems, what are their commonalities? (From the functionality point of view, not from a structure point of view) --dani
    • This is interesting and daunting as well: maybe a reconstruction of provenance would be needed.
  • Given an incomplete workflow and a workflow repository, could I reccomend a suggestion for the next component to the workflow designer? (Or the next 2 or 3 suggestions)--dani
    • This is recommendation based on statistical data mining analysis. This is what I think Esteban is aiming to achieve here.--dani
  • Given two workflows (or services) that takes as input inputs of the same domain and output values of the same domain, identify if the two worfklows perform the same task --khalid
    • I talked about this problem in the eScience paper on service substitutability [8]. I did not semantic annotation to identify if two services (workflows) performs the same task, but rather used provenance information to do so. --khalid

Use case scenario

This section aims to provide the main use case scenario where the indexing and classification of workflows is going to be applied.

  1. Taken from Dave mail (full reference at [1] : I would very much like Wf4ever to make this process "assistive" so that when a workflow/pack/RO is uploaded the system provides recommendations for the provenance metadata.
    1. I am missing what the recommendation would be in this case. --khalid
  2. Taken from Pique comments at showcase 45: In multi-Wf ROs composed of several Wfs, when comparing several similar ROs, the workflows that are unusual might be very relevant to a user ---they are a distinguishing feature of the RO. On the contrary, Wfs that occur together in different ROs represent a pattern that can be of interest to the users. This use case may be expanded to comparison of several Wfs having unusual or common/patterns as scripts, web services or modules.
    1. Sorry to be picky :-) What does Pique means by unusual workflow (if I remember well it is related with the discovery of parts of workflows which are characteristic of that workflow and therefore it identifies in some sense what it does).
    2. ^^ I guess unusual means what there are not many of the same kind. What is important to find out is how would it be useful. As a recommendation? Over the templates? over the runs?
  3. Also there is a description of the problem and some comments from Marco at ?[2], where he presents some queries that he would like to
  4. Pinar also highlighted during the sprint planning meeting the repair WF scenario and preservation of workflows as use cases. "yes I think it is more relevant to the focus of the workflow for ever project which is about preservation. REpair is an activity performed as part of conservation of workflows. Therefore I believe it is important to support such use cases. Related work in this area exist mainly based on Case Based Reasoning [9]. "--pinar 
  5. If we are targeting workflow similarity, then rather than relying on structural or semantic based similarity, I would ask the users to provide examples of similar (and different) workflows, and ask them why they think that they are similar or different. This may give us better clues on when two workflows are similar or not. --khalid
    1. Warning: I think Antoon tried to create a golden standard to compare workflows asking different authors whether the workflows were similar/if what similar workflows would be, and he failed because they would not agree. Part of this showcase could be to explore alternatives and then try to confirm them with the users.--dani

EG: IMO either of the above presented scenarios needs the following:

  • A semantic layer that provides some information about the processes.
  • A macro definition or identification which allows to match/compare workflows or parts of it.
  • A structure to index the previous introduced macros and makes its accesibility easier (specially for large amounts of data). There is a need of stablishing a match between macros and workflows in order to categorize them and index them.

Syntactic and semantic similarities

This sections aims to provide a first systematic study of the different approches that can be used to provide a workflow abstraction perspective for indexing and classification purposes. So far we have talked about to main general approaches

  1. Syntactic apporach: use only the processes syntax (tags, identifiers, etc) to provide an abstraction of the workflow by abstracting each one of its components (see [6] section 2 for different levels of abstraction clarification) . An example of this type of indexing which makes use of the information at a process level can be found at [5]. A survery can also be found at [7]. 
    1. I took a quick look at [6], the approach they are taking is quite similar to what Wings does. Regarding [7], I have the impression that the work does not aim to abstract workflows, but rather to discover actual workflows from execution logs. --khalid 
  2. Semantic approach: there are some work done in using reasoning and list structures to provide sequential pattern recoginition [3]. There are still some misssleadings due to there is not a way of counting or having trace of the level of the deepness of the sequence. A brief review can also be found at [4].

Something that has to be considered is that a workflow is a DAG and its provenance of workflow results is a sequence of processes executed, therefore the patterns of templates are a ordered set of artifacts of variable size.

Using wfprov and wfdesc to retrieve semantic similarity

(Daniel has written this) A way to detect that 2 processors are of the same type is if they share the same script/ws. Therefore the wfdesc:Process should be able to link processors across templates. Once we have wfdesc in the PROV/wfprov export from Taverna, we should check:

  1. That 2 processors that are of the same type share the same wfdesc:Process (OR have a link to the same script/ws).
  2. What happens if 2 processes of the same type are in the same template?
  3. Could the processors be recognized by the role they play? (Or the role of their inputs an outputs).

In my opinion, this are valuable thing to test in the export, because they provide linking between templates AND runs, which is crucial for detecting macro abstraction. If I discover new things, I'll post them in this section.

Macro identification and abstract templates

This section aims to collect the information regarding how macros and/or abstract templates which are associated to workflow sub-graphs are constructed.

A study of the most popular processors and the consecutive process at myExperiment has been adopted and a doc has been created for this analysis https://docs.google.com/spreadsheet/ccc?key=0ArOWQMw0LDGDdFNtLW1GZ1o2QUw1N21rNUEzdFg1NUE#gid=0

For doing the study the following queries have been used:

Query1:PREFIX comp:<http://rdf.myexperiment.org/ontologies/components/>

select ?service_name ?service_uri (count(?workflow) as ?number_of_workflows)
where {
?processor comp:belongs-to-workflow ?workflow .
?processor a comp:WSDLProcessor .
?processor comp:processor-uri ?service_uri .
?processor comp:service-name ?service_name .
}
group by ?service_name
order by desc(?number_of_workflows)

Query2:

PREFIX comp:<http://rdf.myexperiment.org/ontologies/components/>

select ?service_name ?service_uri ?workflow
where {
     ?processor comp:belongs-to-workflow ?workflow .
     ?processor comp:WSDLProcessor .
     ?processor comp:processor-uri ?service_uri .
     ?processor comp:service-name ?service_name .
      FILTER regex(?service_name'run_eSearch'.
      FILTER (?service_uri <http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdl>)

 }
     GROUP BY ?workflow

Process syntactic similarity

(Daniel has written this) During the last plenary, we spent some time trying to detect whether 2 different processors were of the same type.

For our analysis we used the SILK framework, which analyzes the similarity of 2 different processes. We did so because we thought that different processors could have different names in different templates.

We compared the different processors, using the dc:title property. This is the configuration file resultant from this configuration:

<Interlinks>
    <Interlink id="aemet-geo">
      <LinkType>owl:sameAs</LinkType>
      <SourceDataset dataSource="myExp1" var="a">
        <RestrictTo> ?a rdf:type myExp:Processor . </RestrictTo>
      </SourceDataset>
      <TargetDataset dataSource="myExp2" var="b">
        <RestrictTo> ?b rdf:type myExp:Processor . </RestrictTo>
      </TargetDataset>
      <LinkageRule>
        <Compare weight="1" threshold="0.1" required="true" metric="levenshteinDistance" id="unnamed_5">
          <TransformInput function="lowerCase" id="unnamed_3">
            <Input path="?a/dct:title" id="unnamed_1"></Input>
          </TransformInput>
          <TransformInput function="lowerCase" id="unnamed_4">
            <Input path="?b/dct:title" id="unnamed_2"></Input>
          </TransformInput>
          <Param name="minChar" value="0"></Param>
          <Param name="maxChar" value="z"></Param>
        </Compare>
      </LinkageRule>
      <Filter></Filter>
      <Outputs>
          <Output type="file" minConfidence="0.1">
            <Param name="file" value="C:\DOld\SILK\silk_2.5.3\silk_2.5.3\MyExperimentProcessors.nt"/>
            <Param name="format" value="ntriples"/>
          </Output>
       </Outputs>
    </Interlink>
  </Interlinks>

The endpoint used is: http://rdf.myexperiment.org/sparql

As a result we ended up with a 5.6 mb file with 24586 "sameAs" relationships. For example:

<http://www.myexperiment.org/workflows/1782/versions/1#dataflow/dataflow/components/1>  <http://www.w3.org/2002/07/owl#sameAs>  <http://www.myexperiment.org/workflows/1650/versions/1#dataflow/dataflow/components/1> .

This kind of relationships could link processes of the same "type", but it needs further exploration.

Indexing and searching inside workflows

The created trie structured provided for indexing purposes has been encapsulated in order to provided the next two services :

  • Search: it receives a sequence of processes and searches for workflows that contains that sequence. Service_call: /wfabstraction/rest/search?process=Processor regex_value Inputs: sequence of processes names (?process=text1&process=text2)   Output: xml or json structure providing the following info ( process_id, freq, URIs). An example of output is:
  • Recommend: it returns the most frequenly next process given a sequence of previous ones. Service_call: /wfabstraction/rest/recommend?process=Processor regex_value Input: sequence of processes names (?process=text1&process=text2) Output: xml or json structure providing the following info ( id, prob, freq). An example of output is:

Meetings

Minutes of sprint planning meeting:

[05/09/2012 18:10:58] pinar.alper: ok
[05/09/2012 18:12:35] pinar.alper: we could use the processor content
[05/09/2012 18:12:37] pinar.alper: to identify
[05/09/2012 18:17:48] Esteban García Cuesta: domain ontology needed to get higher conceptual level
[05/09/2012 18:17:59] Esteban García Cuesta: Dani said:
[05/09/2012 18:18:38] Esteban García Cuesta: Dani: how to know that two sub-graphs do the same?
[05/09/2012 18:19:13] khalid.belhajjame: Process is associated to a service or a script
[05/09/2012 18:19:16] Daniel Garijo: not subgraphs, 2 processors.
[05/09/2012 18:23:15] khalid.belhajjame: Pinar: By Macro do yuo mean processors that are semantically identical with resepct to the action they perform, or whether they are the same implementation?
[05/09/2012 18:23:27] khalid.belhajjame: Esteban: both
[05/09/2012 18:23:31] pinar.alper: ok
[05/09/2012 18:25:36] Esteban García Cuesta: Daniel: from workflow templates to find common things...
[05/09/2012 18:25:41] khalid.belhajjame: Dani: given a repository of tempalte, cluster templates into different groups according to the macros found in the templates
[05/09/2012 18:25:59] khalid.belhajjame: Esteban: How do you define template
[05/09/2012 18:26:19] khalid.belhajjame: Dani: In Wings there is a "semantic" tempalte
[05/09/2012 18:28:38] Esteban García Cuesta: Dani: start by web services as types
[05/09/2012 18:29:09] Esteban García Cuesta: Dani: find what processors are the same?
[05/09/2012 18:29:19] Daniel Garijo: sorry, the same type.
[05/09/2012 18:29:59] Esteban García Cuesta: Esteban: how to define the similarity of two processes
[05/09/2012 18:30:26] pinar.alper: esteban it is also worth looking at this
[05/09/2012 18:30:28] pinar.alper: http://thedata.org/book/unf
[05/09/2012 18:30:43] pinar.alper: a fingerprint can be generated for processors excluding the processor name
[05/09/2012 18:31:05] Daniel Garijo: i see
[05/09/2012 18:33:52] Esteban García Cuesta: Daniel: do we have to constraint to Taverna?
[05/09/2012 18:34:12] Esteban García Cuesta: Khalid: focus on find similarity of two processes
[05/09/2012 18:35:47] Esteban García Cuesta: Khalid: not clear when a complete workflow are similar
[05/09/2012 18:36:02] pinar.alper: agree with khalid
[05/09/2012 18:37:35] Daniel Garijo: then, why not REuse Antoon's work?
[05/09/2012 18:42:27] pinar.alper: yes there's a lot of research work done in wf recommendation (using Cse Base REasoning, DL reasoning etc), but I do not think there are any usable tools out of that research
[05/09/2012 18:43:11] Daniel Garijo: that is why I changed a bit my research problem
[05/09/2012 18:43:21] Daniel Garijo: when finding macro structures
[05/09/2012 18:44:44] Daniel Garijo: (the motif we described for escience)
[05/09/2012 18:44:55] pinar.alper: ok I see
[05/09/2012 18:45:13] Daniel Garijo: but esteban is solving another problem.
[05/09/2012 18:45:59] pinar.alper: I think we can investigate what are dmensions of similarity, why is similarty needed
[05/09/2012 18:46:48] pinar.alper: from our project's perspective aswell wf preservation...so I think scenarios like helping WF repair
[05/09/2012 18:46:51] pinar.alper: etc. could help
[05/09/2012 18:46:56] Esteban García Cuesta: @pinar, yes finding similarity dimensions sounds good
[05/09/2012 18:48:14] pinar.alper: I'm afraid I have to leave the mtg in two minutes, but I can say, I have done way too much reading on WF discovery recommendation at the very beginning of my project. Until Carole stopped me :)
[05/09/2012 18:48:28] pinar.alper: So I can provide some input on similarity dimensions etc
[05/09/2012 18:49:01] khalid.belhajjame: Ok Pinar have a good evening
[05/09/2012 18:49:07] Esteban García Cuesta: thanks pinar
[05/09/2012 18:49:09] pinar.alper: ok thx bye
[05/09/2012 18:49:15] Daniel Garijo: bye pinar!
[05/09/2012 18:50:16] Esteban García Cuesta: Khalid: have a similarity of workflows state of the arts
[05/09/2012 18:50:23] Esteban García Cuesta: discussion
[05/09/2012 18:51:23] Esteban García Cuesta: I think that similairty of processes is also a first step
[05/09/2012 18:53:29] Esteban García Cuesta: have wiki page with papers, problems, suggestions...
[05/09/2012 18:54:00] Daniel Garijo: notonly the approach, the problem as well.
[05/09/2012 18:55:59] Esteban García Cuesta: bye!
[05/09/2012 18:56:02] Esteban García Cuesta: thanks

[1] https://lists.isoco.net/pipermail/wf4ever/2012-February/003191.html

?[2] http://www.wf4ever-project.org/wiki/display/docs/Workflow+abstraction+for+Indexing%2C+comparasion+and+explanation

[3] http://ceur-ws.org/Vol-216/submission_12.pdf&nbsp; (ordered sequences by using owl-dl)

[4] http://etaxonomy.org/mw/Sequences_in_RDF

[5] http://www.cs.ucr.edu/~skulhari/SuffixTreeSearch.pdf

[6] http://www.nd.edu/~mog/Papers/XiangX-workflow.pdf

[7] http://140.115.80.66/data%20mining%20paper%20databases/Data%20and%20Knowledge%20Engineering/Workflow%20mining%20A%20survey.pdf

[8] Khalid Belhajjame, Carole A. Goble, Stian Soiland-Reyes, and David De Roure. Fostering Scientific Workflow Preservation Trough Discovery of Substitute Services. In the proceedings of the IEEE eScience Conference (eScience 2011), IEEE CS, Stockholm, Sweden, 2011.

[9] Towards Case-Based Adaptation of Workflows Mirjam Minor, Ralph Bergmann, Sebastian Görg, and Kirstin Walter.Towards Case-Based Adaptation of Workflows
Mirjam Minor, Ralph Bergmann, Sebastian Görg, and Kirstin Walter

  • No labels