Skip to end of metadata
Go to start of metadata

Individual results

Integrity and Authenticity Evaluation Service

Description

In this service we will build algorithms to compute quality value of Research Objects and design data models to capture essential information required for this evaluation. In this way the service will enhance the trustworthiness of a Research Object when a scientist reuse, re-execute or share it.

This service takes an Research Object as the input to evaluate the level of integrity and authenticity of the RO(from now an on also  so called quality of the RO). It is to be implemented in a Restful style so that it can be easily used by other Wf4Ever components. The service is currently designed to provide a quality measurement of an RO, however, it can also be used to evaluate information quality of any information that can be encoded in the RO model. Furthermore it also can be used to evaluate the quality of a workflow using its previously stored provenance information. Therefore, this service will not only enhance the quality awareness of ROs for Wf4Ever users but also provide a generic service that can be adapted to any information on the Web. This will fill in the gap in the current (Semantic) Web research, i.e. a lack of quality awareness of the data on the Web.

Who else is doing this

To our knowledge, although there is a raising awareness (http://answers.semanticweb.com/questions/1072/quality-indicators-for-linked-data-datasets) in the problem of quality of data on the Web, there is little established work in this field. The Pedantic Web group (http://pedantic-web.org/) is proposing solutions for monitoring and measuring the freshness of data published on the Web. The Data Quality Constraint Library (http://semwebquality.org/documentation/primer/) is another ongoing development towards spotting potential data quality issues in the current Web.

Differentiating aspects with respect to others

Unlike the above mentioned work, our service will combine some domain neutral with some domain-specific information to compute the quality of an RO. Our design is grounded by concrete scientific needs from actual scientific communities. We are going to design and develop algorithms for a suite of information quality criteria, including completeness, accuracy, and etc. Although we start with services catered for specific needs, we bear flexibility in mind, allowing users to configure and select the set of parameters mattered for them for the evaluation. So the specific algorithms can be extended to support more generic needs.

Individual SWOT analysis of an I&A Service

 

Positive

Negative

Internal Factors

Strengths
•    Trustworthiness in Research Object.
•    Include expertise from real domain experts in designing evaluation algorithms
•    Background in provenance, information quality, and security related research
•    Extension of vocabularies to describe the provenance traces and allow the measurement of workflows quality.
•    Iimplementation of a provenance quality tracking.

Weaknesses
•    Lack of a large amount real research data and research objects for evaluation
•    Not clear users' requirements
•    Need to build ad-hoc provenance information collection infrastructure (This is not really true. See: http://glocal.isoco.net/eps. For Workflow provenance, we also have Taverna's logging system)
•    Difficulties for the definition of the quality measurements criteria due to its intrinsic ambiguity.

External Factors

Opportunities
•    First actual implementation evaluating the quality of information shared on the Web
•    Refine understanding about trustworthiness of data on the Web
•    Information quality is the next most important topic in the ever fast growing Semantic Web (Web of Data)
•    Contribute to upcoming standardization effort in the W3C Provenance Working Group
•    Evaluating trustworthiness of data is a highly desired functionality in the context of the Semantic Web.

Threats
•    Information available in the wild may not be expressible by the Research Object model
•    Technology may not be mature enough.
•    Difficulties to obtain provenance data in other domains out of the scope of the two scenarios belonging to  the WF4Ever project.

 

 

 

RO Model

This result will provide a conceptual model for workflow-centric Resarch Objects, encapsulating process specifications along with metadata, in order to provide a preservable, shareable self-contained unit. In order to ensure focus, the project will focus on Research Objects that encapsulate workflows. The current approach draws on existing vocabularies in order to

  1. Reuse that existing work and avoid duplication/reinvention;
  2. Provide opportunities for interoperation with existing tools.

For example, the aggregation structure of an RO is described using OAI-ORE, while annotations on aggregated resources use Annotation Ontology terms (and this may provide opportunities for the use of tools such as Utopia or the tools generated by the Mindinformatics group at Harvard). The contribution of the RO Model will be to identify the additional vocabulary needed to describe Research Objects that support the preservation of workflows along with their lifecycle.

Competition/Differentiation

A general notion of a Reproducible Results System is described by Mesirov (Mesirov 2010) describes the notion of Accessible Reproducible Research, where scientific publications should provide clear enough descriptions of the protocols to enable successful repetition and extension.

Other approaches that might be considered as Research Objects include the Scientific Knowledge Objects (Giunchiglia 2009) of the LiquidPub project. These describe aggregation structures intended to describe scientific papers, books and journals. The approach explicitly considers the lifecycle of publications in terms of three ``states'': Gas, Liquid and Solid, which represent early, tentative and finalised work respectively.

Groth et al (Groth, 2010) describe the notion of a ``Nano-publication'' – an explicit representation of a statement that is made in scientific literature. Such statements may be made in multiple locations, for example in different papers, and validation of that statement can only be done given the context. An example given is the statement that malaria is transmitted by mosquitos, which will appear in many places in published literature, each occurrence potentially backed by differing evidence. Each nano-publication is associated with a set of annotations that refer to the statement and provide a minimum set of (community) agreed annotations that identify authorship, provenance, and so on. The nano-publication approach is more ``fine-grained'' than the proposed RO model, focusing on single statements along with their provenance.

A number of systems described in the Executable Paper Grand Challenge 2011 also focus on reproducability. Collage (Nowakowski 2011) provides infrastructure which allows for the embedding of executable codes in papers. SHARE (Vangorp 2011) focuses on the issue of reproducability, using virtual machines to provide execution. Finally, Gavish and Donoh (gavish 2011) focus on verifiability, through a system consisting of a Repository holding Verifiable Computational Results
(VCRs) that are identified using Verifiable Result Identifiers
(VRIs). None of these particular systems focuses on an explicit notion of "Research Object", however, and in addition, provenance information is only explicitly considered on the third proposal.

Individual SWOT analysis of RO Model

 

Positive

Negative

Internal Factors

Strengths
• Team with lengthy experience in Semantic Web technology, provenance, support for workflows, ontology and vocabulary development.
• Domain experts/stakeholders within project
• Existing sister projects providing experience and results
• Background activities providing a starting point

Weaknesses
• Danger of over-elaboration
• Danger of over-generalisation/over-scoping
• Narrow focus on two domains (under-scoping)
• Emphasis on "interesting CS questions" to the detriment of low-hanging fruit/easy wins

External Factors

Opportunities
• Engagement /dissemination with other projects through external contacts
• RO Modelling as exemplars of emerging standards (e.g. AO or OAC)
• Models are needed for new publication paradigms, so potentially high impact
• Potential contributions to standardisation activity

Threats
• Other models developed and adopted
• Developed models not generally applicable
• Developed models bad fit with future standards
• Annotation vocabularies change

Recommender system

Description

The Recommender System is a Research Object and aggregated resources recommender system with the following characteristics:

  • Ontology-based inference techniques. The recommender system will use formal techniques for inferring new user ratings and new recommendations from pre-existing ones. This technique will rely on ontologies and the application of the Constrained Activation Technique (Crestani, 1997), an algorithm that propagates item properties through the semantic network formed by the ontology and its possible instantiations.
  • Different granularities capable. The recommender system not only will be able to provide Research Object recommendations as a whole, but will also consider the resources that compose a Research Object. Therefore, when making recommendations to a given user it might suggest new Research Objects; or just resources that might a useful addition/alternative to the ones already aggregated by the Research Objects that the user is currently using or creating
  • Hybrid approach. The recommender system will use a set of different but complementary recommendation algorithms that will lessen the effects of well-known state of the art recommender systems issues, such as cold-start problems, sparsity problem, gray-sheep problem, etc.
  • Specific policies configurable. Unlike inherently vertical recommender systems, the recommender system will provide recommendations tailored in terms of the concrete policies of different research communities. It will provide both means to define such policies and mechanisms to enact them.
  • User's feedback aware .The recommender system will provide mediums to gather and will take in to consideration implicit and explicit user feedback in order to improve its future recommendations
  • Content-based discovery. Besides offering new ways for the discovery of scientific material, the recommender system will also provide recommendations based on the way that search and retrieving of scientific content is already performed by researchers. It will allow thus keyword-based subscriptions to Research Objects and research resources content; or fields such as authors, abstract, publication dates, etc.
  • REST+SPARQL endpoints. The recommendation system results will be accessible both in by procedural and data-oriented standard conforming interfaces.

Who else is doing this (competition)

There are many initiatives going on in the recommender systems field (see (Burke, 2002) (Adomavicius and Tuzhilin, 2005)). Those that are more closely related to the recommender system can be classified in two categories:

  • Recommender systems that are applied in the scientific domain. CiteULike (Bogers and vand den Bosh, 2008) treats users' reference collections as their historical record and applies pure collaborative filtering for recommending new research papers. TechLens (Mcnee et al., 2002)(Torres et al., 2004) also applies rating-based collaborative filtering. Nonetheless, this approach tries to avoid the cold-start problems of such family of algorithms adding content-based recommendation; and more importantly, importing the information about citations contained in the ResearchIndex (Lawrence et al., 1999) citation web. This initiative was later extended by (Ekstrand et al, 2010) with the addition of several new content-based and collaborative filtering algorithms; and the inclusion of citation graphs-based algorithms to determine the relevance of research material. This last proposal was also included in the Synthese (Vellino and Zeber, 2007) recommender system, which in a similar fashion uses the PageRank (Page et al., 1999) algorithm for the initialization of the rating-based collaborative filtering algorithm on the basis of papers relevance.Finally, the Scienstein(Gipp et al., 2009) also uses collaborative filtering algorithms as the preciously summarized initiatives, but complements them with several search-alike Information Retrieval content-based techniques for avoiding the cold-start problems.
  • Recommender systems that apply similar ontology-based techniques. Regarding the use of ontology-based techniques for enhancing recommender systems, we base our approach initially on (Middleton et al., 2004)). On the topic of using Constrained Spreading Activation technique and formal models as a mechanism for inferring new recommendations we base our method on the work summarized in (Cantador et al., 2011).

Differentiating aspects with respect to others

The differentiating aspects of our approach with relation with the afore mentioned initiatives that resemble ours:

  • We shall recommend Research Objects, not just conventional research papers. They are defined as semantically rich aggregations of resources (Bechhofer et al., 2010). As such, in comparision with research papers they are multidimensional structures that also include information about the datasets used in the investigation, executable workflows that implement the proposed algorithms, provenance information about experiments that justify contributions, etc.
  • We combine conventional recommendation algorithms based on collaborative filtering and content matching with those that rely on formal models.  Both approaches are valid under different circumstances and have different strengths and weaknesses; and as such we consider them.
  • We propose the use of declarative recommendation policies that specify the criteria of item relevancy on each of different users' communities and fields.

Individual SWOT analysis of the Recommender Service

 

Positive

Negative

Internal factors

Strengths

  • Experience in the development of ontology and knowledge-based applications.
  • Members of the project consortium with previous experience in e-Science social scenarios.
  • Members of the project consortium that can provide us with with real new research material polices, and who can evaluate the provided results.
  • Developers' background in machine learning and artifical intelligence.

Weaknesses

  • Reduced experience in the development of production-deployable recommender systems.
  • Reduced number of core developers in order to integrate the diverse recommender algorithms that we plan to use.

External factors

Opportunities

  • Once we can provide a first running version of the recommender, it can be easily deployed in other e-Science scenarios outside Wf4Ever.
  • The recommender system capability if being tailored to specific fields can be used to tune the recommender system to operate in other domains outside e-Science.
  • The recommender system can be prvided as Software-as-a-Service

Threats

  • As has happened before in myExperiment, the previous inititative in which members of the consortium have attempted to create social e-Science scenarios, the lack of incentives fro researcher to share their content (or to participate actively in the social network ) may lead to the absence of the necessary minimum data to provide meaningful recommendations as pointed out in (Zhang et al. 2011).
  • Other groups may be already trying to apply similar techniques in this area.

Collaboration Spheres Construction

Description

Following the real life metaphor, not every relation is equal; we do share some things with work mates, others with family, and others with our closer friends. Most of the times the RO objects and provenance workflows are treated equally from a point of view of access but that is not adequate for sharing knowledge between members of different institutions, enterprises, universities, etc. Due to problems of privacy, spying, copyright, etc. The policy access jointly with an exploration visual metaphor based on collaboration spheres will allow the management of the ROs/workflows and their recommendation based on their characteristics, the different types of users, and their roles in the Wf4Ever project.

The collaboration spheres module will use the information related with the different ROs/workflows to show specific characteristics as recommendation distances or collaboration strength between the different users/institutions.

Some characteristics of this module will be:

  • Representation of different ROs/Workflows and their relations in a graph format based on metrics as similarity or membership.
  • Representation of collaboration spheres with preservation and reutilization as main goals.
  • Control of access and visualization properties based on the users social networks (creator, collaborator, ...,untrusted) and access policies.
  • To develop and adapt existing vocabularies for describing collaboration spheres and access policies for the specific scientific communities of the project.
  • Allow Users' interaction to navigate through the different ROs/WFs to exploit recommendations and explore quicker the possible ROs/WFs of interest.
  • To make new proposals or suggestions of possible new paths of exploration to help the users finding what they are looking for.

Who else is doing this (competition)

Although there are different initiatives at different areas to use visual metaphors to do recommendations, or to develop workflows, as far as we know there is not a complete system which uses this visual metaphor for ROs/WFs recommendation, finds new paths of exploration, and also takes into account an access policy.

For instance there are several visual analytics approaches to represent different interpretations of data and nowadays there exist several communities working in this area [1,2]. Similarly, there are many research groups and companies that are working on recommendation systems (i.e. Amazon, Netflix, etc.)  and there are some well known techniques as collaborative filtering which has been applied for a while. Despite of this, the use of these two approaches together is relatively new and how to apply it to workflow systems is still an open problem for researchers and practitioners [3]. Regarding the user policy access we want highlight that due to the intrinsic nature of privacy and ownership of the data that WF4ever project manages it is necessary to establish a general framework to control the access. This is very well known in contexts as O.S., security, etc. and there exists large amount of literature about it.

Differentiating aspects with respect to others

The differentiating aspects of our approach vs. the above mentioned initiatives and available work are:

  • As far as we know, there is not any system that uses a visual metaphor to visualize recommendation workflows based on their intrinsic characteristics and their users social networks.
  • We propose the recommendation of ROs/Workflows based on MiM models and workflow information and not just by users' preferences or users' similarities.
  • We include the use of circles of friends metaphor to implement the policy access.
  • An interactive explorative tool to find new paths towards the pursued RO/Workflow.

Individual SWOT analysis of the collaboration spheres

 

Positive

Negative

Internal origin

Strengths

  • Experience in the development of visual collaboration spheres metaphors.
  • Experience of the project consortium with recommendation systems.

Weakness

  • Difficulties to integrate the access policies into the whole system.
  • Real time recommendations analysis could be very time consuming.

External origin

Opportunities

  • To be one of the first developing and innovative visual recommendation RO/Workflows.
  • Contributing to new visual exploration metaphors in the visual analytics field.

Threats

  • Lack of users' interest because the visual metaphor usability is low at specific domains.
  • Lack of enough data to make attractive and reasonable recommendations.

[1] http://www.infovis-wiki.net/index.php/Visual_Analytics

[2] http://www.vismaster.eu/

[3] Zhang, Ji and Liu, Qing and Xu, Kai (2009) FlowRecommender: a workflow recommendation technique for process provenance. In: the Eighth Australasian Data Mining Conference (AusDM 2009), 1-4 Dec 2009, Melbourne, Australia.

Web Services in the genomics domain that cover key elements of data integration, interpretation, and relation finding

Description

Web Services to mine biological text (e.g. abstracts in MedLine) for biological findings and to study the meaning of data points according to associated knowledge embedded in literature. The Web Services will employ the Concept Profile technology developed by the BioSemantics group at the Erasmus MC and the Leiden University Medical Centre. A concept represents a unique reference for a biological entity for which many synonyms may be used in literature. A Concept Profile represents a weighted set of relations of a reference concept and all concepts from a predefined set of concepts, typically derived from biomedical ontologies. The Concept Profile method has proven to be useful to help interpret the results of genomics experiments, and to discover implicit biological relations between biological concepts (DOI: 10.1002/pmic.201000398). The technology is implemented in the Java Web Start tool Anni (http://www.biosemantics.org/anni). Anni and the underlying methods are used on a variety of case studies. Anni is currently being refactured to conform to a Service Oriented Architecture. The resulting Web Services will be accepting and producing Uniform Resource Identifiers (URIs), referring to concepts from RDF-based ontologies and the ConceptWiki, a resource that offers a social community approach to managing biological concepts (http://www.conceptwiki.org). The Web Services will be published on BioCatalogue, a curated catalogue of Life Science Web Services, and used in the Workflows produced by WP6 (Bioinformatics) of Wf4Ever.

Who else is doing this (competition)

Main competitors are the University of Manchester's National Centre for Text Mining (http://www.nactem.ac.uk/), the National Centre of BioMedical Ontologies (http://www.bioontology.org/), BIOTEC at TU Dresden (http://www.biotec.tu-dresden.de), and the European Bioinformatics Institute, the largest providers of text mining web services relevant for biology listed in the BioCatalogue. Text mining in general is an active field within bioinformatics (for a recent review see: PMID: 21106487)

Differentiating aspects with respect to others

Our methods use the Concept Profile technology. This allows statistical comparison of sets of concepts by comparing weighted by co-occurence metrics. Concepts are disambiguated references for sets of synonyms used in literature. Therefore, comparing weighted profiles of concepts provides a cleaner method than term-based methods. Moreover, it enables discovery of implicit relations, relations never mentioned in any paper before. The discriminating operations of our Web Services perform statistical algorithms on databases of Concept Profiles. As such they complement other text mining services that can be found in the BioCatalogue. In addition, few text mining services produce URIs, none refer to ConceptWiki entries. This allows us to experiment more directly with linking to the RDF implementations of Research Object facets.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Experience in Web Service development.
  • Possibility to count on real domain experts to evaluate Web Service results.
  • Background in Semantic Web technologies
  • Access to high performance computing facilities to host the web services

Weaknesses

  • Development of several components is 'ongoing', e.g. the linking of concepts in the Concept Profiles with the concepts in the concept wiki is still under development.
  • Web services depend on Concept profile technology implemented in a monolithic application for which the program code is largely undocumented.

External factors

Opportunities

  • Contributing to upcoming standards , i.e. the Concept Web Alliance for the sharing and preservation of knowledge on the web.
  • Contributing to upcoming standards, i.e. the BioCatalogue, for the sharing and preservation of Web Services.

Threat

  • Usability of Web Services depends on timely development of th Concept Wiki as a standard resource for concepts.

Genome Wide Association Study (GWAS) interpretation workflows

Description

Workflows that perform data integration for interpretation of (pre-processed) experimental data from whole genome studies, while adopting the technology developed in Wf4Ever for workflow preservation and revisiting workflow results through provenance and semantic annotation.

Who else is doing this (competition)

There are no other workflows specifically designed for interpreting (pre-processed) experimental data from whole genome studies (to our knowledge, based MyExperiment workflow content).  There are however workflows that perform data integration for interpretation of (pre-processed) experimental data from other type of genomics studies.

Differentiating aspects with respect to others

The differentiating aspects lie in the specific application to data originating from GWAS and the ability to preserve the workflows using Wf4Ever technology

Individual SWOT analysis of GWAS interpretation workflows

 

Positive

Negative

Internal factors

Strengths

  • Experience in Workflow and Web Service development.
  • Possibility to count on real domain experts to evaluate Workflow results.
  • Background in Semantic Web technologies
  • Access to novel GWAS data
  • A complex case that can benefit from preserving experimental methods and results.

Weaknesses

  • Crucial web service componentes for the workflows to run still need to be developed
  • Internal novel GWAS data is not yet replciated and can only be used for publication in a later stage in the project.
  • Wf4Ever technology under development

External factors

Opportunities

  • First experiments to interpreting novel GWAS results while preserving the process in a workflow supported by Wf4Ever technology may facilitate dissemination
  • Contributing to upcoming standards for the sharing and preservation of workflows, initially by publishing the workflows in the current version of myExperiment and later suppotred by Wf4Ever technologies
  • Possibility to evaluate wf4ever results from within a life science domain.

Threats

  • Wf4Ever technology may not mature quickly enough to sufficiently demonstrate its impact on life science practice. This is a lesser threat for the bioinformatics community.

 

Research objects, workflows and services in the Astronomy domain

Definition

Provide workflows and research objects in the astronomy domain compliant with Wf4Ever tools and models, and developed following a set of best practices. These workflows want to be representative of the different techniques used in astronomy, including experiments dealing with 1D catalogues of physical quantities to 3D formatted data, locally stored and processed by users at their workstations, or distributed, over a variety of external repositories, accessed and analysed through Virtual Observatory (VO) compliant web services, or with the help of local software and scripts.  In order to achieve this goal, and more particularly in the case of modelling of 3D data of galaxies, new services for the analysis of 3D formatted data will be developed as well as VO standards for the access of on-the-fly generated data issued from these services.

Who else is doing this (competition)

HELIO[1|#_ftn1] is an EC FP7 project with date of completion June 2012. It is a domain-specific virtual observatory for solar physics that will provide access to services to mine and analyse the data as well as workflows addressing specific needs of their community through the orchestration of their own services and data.

CyberSKA[2|#_ftn2] is a project aimed at exploring and implementing the cyber-infrastructure that will be required to address the evolving data intensive science needs of future radio telescopes such as the Square Kilometre Array. They are developing a web based workflow builder that supports image segmentation, image mosaicking, spatial re-projection, and plane extraction from data cubes through processes provided as web services.

The VO France Workflow Working Group[3|#_ftn3] is an early user and enabler of astronomical workflows. One of the main goals is to provide VO oriented use cases and to implement them as workflows.

Montage[4|#_ftn4] is a toolkit for assembling astronomical images into custom mosaics. This toolbox of components has been well studied in computer science workflow systems, and is used in a number of production astronomy systems.

Other project-specific communities (ESO, ESA, etc.) are subject to provide workflows in the astronomy domain. Most of them are initial pre-processing pipelines whose main intent is to deliver exploitable data issued from a specific instrument.

Differentiating aspects with respect to others

Wf4Ever astronomy workflows and research objects are not tied to the needs of a specific community and intend to cover a wide range of the present astronomical digital experiments. They are packed into research objects, embracing all components involved in an experiment in a structured way, compliant with models addressing preservation issues (reproducibility, re-purposing, versioning, decay, etc.) and improving the way to share the experiments through the development of semantic ontologies to characterize the experiment, as well as methods to evaluate its quality, taking into account the use of licences to protect the knowledge of the experiment. Contrary to existing pipelines, Wf4Ever astronomy workflows and research objects are focused on providing scientific insight from science-ready data, and make use of the interoperable framework of public data and services provided by the VO.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Group of astronomers making use of different techniques
  • Research mainly based on public data
  • Experience in development of VO web services for access of astronomical data
  • Involved in the development of standards for the VO

Weaknesses

  • Small group
  • First contact with workflows and executable documents
  • Wf4Ever technology in development
  • Wf4Ever may not address very specific needs of the astronomical domain related with workflow management functionalities

External factors

Opportunities

  • Increasing interests in workflows in the community due to upcoming big data science
  • Contribution to standards for workflow publishing and preservation in the astronomical community and more particularly in VO
  • Increasing interest in a new way to publish scientific results as interlinked data
  • Promote the use of best practices in astronomy, providing as well the tools

Threats

  • Possible similar initiatives from data publishers may impose their working methodologies in the community
  • Preservation issues may rely too much on the good behavior of external data and service providers

Gathering of the Astronomy community

Definition

Create a community of users interested in the creation, preservation and sharing of scientific workflows and research objects in astronomy. This will require the presentation of Wf4Ever results in specialized events, the show of demonstrations of the technologies being developed, and on their application to the community.

Who else is doing this (competition)

The previously mentioned projects VO France Workflow Working Group by means of national meetings and workshops with astronomers, and CyberSKA through a web portal used as a collaborative working platform, where astronomers need to be registered.

Differentiating aspects with respect to others

Wf4Ever aims to create a wider community of users not specialized on specific radio data techniques (CyberSKA), neither restricted to a national territory (VO France). Wf4Ever working methodology provides the community the means to make and also publish scientific research.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Experience in similar tasks in other projects of e-science
  • Wf4Ever partners experience in community gathering in Bio-genomics domain
  • IAA group members are active participants in the VO community

Weaknesses

  • Lack of a large collection of workflows to be provided as exemplars
  • Lack of Astronomy-oriented workflows management tools

External factors

Opportunities

  • Boost the creation and sharing of workflow knowledge in the astronomical community
  • Promote the use of best practices in the scientific methdology
  • Foster collaborative work and a social network peer-review in Astronomy.
  • Impact on the training of new researchers, teachers, and high school students

Threats

  • Low traction of scientific workflows within Astronomy
  • Development of upcoming big facilities (EELT, SKA, etc.) and service providers postponed, big data science in Astronomy slows down

Models, ontologies, and vocabularies in the Astronomy domain

Definition

Provide interoperable models, ontologies and vocabularies for the characterization of workflows, data and processes involved in the astronomical research, taking into account the specificities of the domain and several aspects as provenance, information quality, integrity and authenticity.

Who else is doing this (competition)

The International Virtual Observatory Alliance[1|#_ftn1] (IVOA) by means of the Semantics Working Group and Data Curation and Preservation Interest Group provides vocabularies for the descriptions of astrophysical objects, data types, concepts, events, or of any other phenomena in astronomy.  This covers the study of relationships between words, symbols and concepts, as well as the meaning of such representations.

The US Virtual Astronomical Observatory[2|#_ftn2] (VAO) Data Curation and Preservation Group has launched an initiative to create an infrastructure supporting curation, discovery and access to VAO resources. The main objectives of the project are to capture and describe, in the most possibly complete way, the lifecycle of the research process through the linkage between astronomical objects, archival datasets from surveys and catalogues, observing proposals and publications.

Differentiating aspects with respect to others

Since most efforts until now have been directed towards data description and curation, Wf4Ever added value contribution will focus on the semantic characterization of methods, processes and workflows, key components in the research lifecycle and not covered in VAO initiative.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Good experience and knowledge of Wf4Ever partners in semantic technologies
  • Involved in the development of standards in the VO

Weaknesses

  • First contact  with semantics in Astronomy
  • Lack of large collection of workflows
  • Wf4Ever technologies in development

External factors

Opportunities

  • Stimulate initiatives for the preservation and linking of all digital components involved in the Astronomy research lifecycle
  • Contribute through semantic interlinking to preservation issues like repeatability of experiments and reproducibility of results
  • Engagement with VAO project in order to achieve digital linking of research object components by complementing their developments with semantics for workflows

Threats

  • Publishers as well as data and service providers may develop their own models
  • IVOA Semantics Working Group and Data Curation and Preservation Interest Group may prioritize data and postpone the adoption of workflows

Integration of existing Astronomy-specific tools with Wf4Ever

Definition

Integration of existing astronomy software implementing SAMP VO libraries with Wf4Ever tools for management of workflows and research objects. e.g. Taverna, RO Manager Command Line Tool, Web GUIs, etc. SAMP[1|#_ftn1] (Simple Application Messaging Protocol) is a messaging protocol that enables astronomy software tools to interoperate and communicate. It enables the applications to share data and take advantage of each other's functionality, allowing software tools to exchange control and data information. SAMP supports communication between applications on the desktop and in web browsers, and is also intended to form a framework for more general messaging requirements. Implementing SAMP in Wf4Ever tools will automatically integrate them into an existing ecosystem of astronomy software.

Who else is doing this (competition)

The UK Virtual Observatory project (AstroGrid) developed in 2007 a version of Taverna consisting of a set of VO plugins and implementing an incipient version of SAMP known as PLASTIC. This AstroTaverna software was based on Taverna 1 demoted version, and they are no longer supported since the disappearance of the AstroGrid project in 2009.

Differentiating aspects with respect to others

Wf4Ever developments will make use of the present standard versions of SAMP, integrating a wider range of existing software. SAMP will be implemented in last version of Taverna 3, taking advantage of up to date features and functionalities.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Good knowledge of SAMP and VO software
  • Good number of Taverna developers amongst Wf4Ever partners

Weaknesses

  • Workflow management tools not known in Astronomy
  • Integration only with tools implementing SAMPL VO libraries

External factors

Opportunities

  • Promote the use of workflows, research objects, and Wf4Ever best practices among the Astronomy community through the seamless integration of Wf4Ever tools and widely known Astronomy software

Threats

  • SAMP may be demoted as VO standard
  • SAMP may not be widely used anymore

Wf4Ever Architecture

Definition

Wf4Ever architecture specifies the design and implementation of scientific workflow preservation systems. It combines elements from scientific workflow management, social networking, and digital libraries, and extends them with contributions from the component level research, including workflow lifecycle management, evolution, sharing and collaboration support, and integrity and authenticity maintenance. In particular, it addresses:

  • The relationship between scientific workflow management systems and repositories, and the external and internal resources (e.g., data services) that they are based on.
  • The relationship between digital libraries, social networks and workflow sharing and collaboration systems.
  • The relationships between scientific workflows, their related objects, and integrity and authenticity mechanisms based on their provenance.

In order to achieve these goals, the architecture defines how the different components work together, including their associated models and interfaces, and it adheres to the following principles:

  • Be lightweight and adaptable.
  • Be flexible enough to cope with new additions.
  • Be compliant with existing and well deployed reference models like OAIS.
  • Adhere to the Linked Data principles for hosting and making the data available within and outside the scientific communities of Wf4Ever.

The definition of the architecture leverages previous experiences of consortium partners in the mentioned areas, and based on them, it is influenced by the most widely deployed scientific workflow sharing infrastructures (myExperiment) and the digital library system dLibra.

Who else is doing this (competition)

The particular goal of preserving scientific workflows and related objects has not been addressed in the past. However, we can find some related works in the literature:

  • Reference models and architectures in some of the areas covered by Wf4Ever are:
    • In the context of digital libraries, the DELOS Network of Excellence (http://www.delos.info/), continued by DL.org project (http://www.dlorg.edu/), proposed a reference model [CCF+08] to characterize the Digital Library universe. This model drives the definition of any reference architecture for a specific class of digital library systems characterized by similar goals, motivations and requirements. This model also deals with the preservation of objects, including the specific concepts and relations necessary to model preservation, critical functions for preservation (e.g., transform, visualize and export), preservation policies and quality parameters for preservation (e.g., authenticity, trustworthiness and provenance).
    • In the context of workflow management, a reference architecture is proposed in [LLF+09], which led to the implementation of the VIEW system.
    • The OAIS provides a reference model for an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. It provides a framework, including terminology and concepts, for describing and comparing architectures and operations of existing and future archives.
  • Other relevant architectures, which are closely guided by OAIS are:
    • Caspar project defines an architecture for the preservation, access and retrieval of cultural, artistic and scientific knowledge
    • Shaman project defines a reference architecture for digital preservation systems
    • PrestoPRIME project defines also a reference architecture for preserving digital contents of archives and libraries

Differentiating aspects with respect to others

The combination of digital library systems, with workflow management capabilities and social networking, in order to achieve an adequate integrated support for the whole lifecycle of scientific workflows and their related objects is still an open challenge. The above mentioned efforts, address only parts of the envisioned Wf4Ever architecture, such as archiving, preservation, access and retrieval and in general they define architectures frameworks dealing with generic objects without considering the complexities of scientific workflows and their related objects, which have both a static and a dynamic dimension. In contrast, Wf4Ever architecture addresses scientific workflows as a core component of complex objects (called Research Objects) that include all workflow related objects, and which can deal with these objects from the original conception of the workflow to its development, either from scratch or by reusing existing workflows, test and documentation, and to its preservation, in a social networking context.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Team members with previous experiences in the definition of architectures of production systems, including digital libraries systems, social networking environments and workflow management systems
  • Possibility to base from previous architecture designs in the above areas during the definition of Wf4Ever architecture 
  • Involvement of real users from representative and heterogeneous domains for the extraction of requirements influencing the architecture design
  • Development of technical use cases that exercise the live specification

Weaknesses

  • The concept and model of research object, the complex object aggregating the workflows and related data which is being preserved, is still under development

External factors

Opportunities 

  • Become a "reference" architecture for the preservation of scientific workflows
  • Implementations of the architecture by external projects/institutions
  • Collaboration with other projects with similar goals
  • Contribution to standardization activities, e.g., CWA, for a preservation frameworks of scientific material

Threats

  • Architecture is not flexible enough to cope with changes in the user/technical requirements
  • Architecture does not fulfill users requirements

Wf4Ever Toolkit

Definition

The Wf4Ever Toolkit is the set of services, tools and applications, corresponding to the reference implementation instantiating Wf4Ever architecture, which enable the preservation and efficient retrieval of scientific workflows and related objects across a range of domains, as well as their effective sharing, reuse and reproducibility. To achieve its goals, the toolkit integrates selected initial technologies (myExperiment and the open source libraries of dLibra), together with services and components that support the workflow lifecycle management, collaboration and sharing support, and integrity and authenticity maintenance.

The Wf4Ever toolkit implementation follows the following principles:

  • Use open standards and protocols (e.g., OAI-ORE and OAI-PMH) as well as RESTful APIs, supporting the integration with different systems and applications.
  • Extend digital library (DL) systems for the management, preservation, indexing and retrieval of workflows and related objects, creating a DL system specialized for workflows

At the moment the Wf4Ever Toolkit comprises the following components (the list will grow/change as the project evolves):

  • RO Digital Library (RODL) for storage, management and retrieval of ROs
    • dLibra for data storage, indexing and retrieval 
    • Semantic Metadata Service (based on Jena) for semantic metadata storage and retrieval 
    • WRDZ for preservation activities
    • UMS (User Management Service) for authentication and authorization of users based on openID
    • Web-based GUI
      • RODL portal 
      • User Management portal
      • myExperiment Import wizard portal 
  • RO Management Command Line Tool (CLT) for management of ROs at the command line level, both locally or remotely 
  • I&A Evaluation Service for evaluation of RO integrity and authenticity
    • CLT  extension
    • Web-based GUI
  • SAMP visualization service for astronomy data visualization
    • CLT extension
    • Web-based GUI
    • Taverna extension
  • Recommender Service for recommendation of ROs
    • Web-based GUI
  • ROBox for synchronization of dropbox resources and RO-DL
  • myExperiment for .....
    • new GUI

Who else is doing this (competition)

There is no concrete implementation of a system focused on the preservation of scientific workflows, which combines technologies technologies from scientific workflow management, social networking, and digital libraries. However, there are several relevant systems in each of those areas as described below:

Differentiating aspects with respect to others

Wf4Ever toolkit will be the first implementation that addresses the preservation and access of scientific workflows, combining technologies from scientific workflow management, social networking, and digital libraries. In contrast to the above mentioned systems, which focus on the storage, preservation and retrieval of digital objects, in the management of scientific workflows or in sharing scientific data, Wf4Ever toolkit will enable the preservation of scientific workflows that goes beyond the storage of simple digital objects (e.g., files in a workflow description language), focused on aggregations of workflows and their related objects, packed into research objects, as well as an effective indexing and retrieval of these stored workflows for their effective reuse in a social network context.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths  

  • Leverage existing systems of project members, including one of the most widely deployed scientific workflow sharing infrastructures (MyExperiment) and the digital library system (dLibra)
  • Team members with long experience in the development of software for digital libraries, social networks and workflow management.
  • Good collaboration among developers
  • Adoption of continuous development and integration process, allowing the interaction of users with the developed technologies as early as possible to provide timely feedback
  • Good number of scientific workflows (with related objects), e.g., from myExperiment, to test the developed technologies
  • Two disparate use cases involved in testing the developed technologies, one with a well establish community using scientific workflows and one emerging community in this respect
  • First implementation of a preservation system for scientific workflows, and their related objects, packed into research objects
  • Adoption of open standards and protocols

Weaknesses

  • Unstable and continuously evolving underlying models 
  • Different technologies and architectural approaches followed by selected initial systems
  • Unclear long-term vision of some components

External factors

Opportunities

  • Adoption of toolkit for preservation of scientific workflows in other domains
  • Exploitation of individual components outside Wf4Ever scope
  • Integration of toolkit with other systems/components outside Wf4Ever scope
  • Raise workflows awareness and interest in the scientific community
  • Promote the new paradigm of publishing scientific work, that is verifiable, repeatable and reproducible

Threats 

  • Toolkit does not fulfill users expectations
  • Difficult to integrate toolkit with other systems
  • Toolkit not applicable in other domains
  • Other systems/components developed outside Wf4Ever provide equivalent functionality earlier
  • The usage of scientific workflows does not grow as expected

Research Object Digital Library 

Definition

The Research Object Digital Library (RODL) is the software system which collects, manages and preserves for the long term aggregations of scientific workflows and their related objects, packed into research objects, and offers, besides from basic digital library functionality (e.g., object registration, storage, search and browse), specialized functionality for research objects, such as management of semantic annotations and metadata based on the research object model. 

RODL has been developed on top of the digital library system dLibra, extending it with components for the management of semantic models and for preservation activities. Nevertheless, because of the usage of standard protocol and interfaces, other implementations of the particular components (e.g., digital libraries systems) could be used eventually. RODL exposes its functionality via a REST interface. Currently, RODL comprises the following components:

  • dLibra for data storage, indexing and retrieval 
  • Semantic Metadata Service (based on Jena) for semantic metadata storage and retrieval
  • WRDZ for preservation activities
  • UMS (User Management Service) for authentication and authorization of users based on openID
  • Web-based GUI
    • RODL portal 
    • User Management portal
    • myExperiment Import wizard portal 

Who else is doing this (competition)

There are a number of software packages for use in general digital libraries. Some of the most relevant digital library systems are dLibra (http://dlibra.psnc.pl/), Libronix DL System (http://www.logos.com/ldls), Greenstone (http://www.greenstone.org/), OpenDLib (http://www.opendlib.com/) and its successor gCube DL management system (http://www.gcube-system.org/), Daffodil (http://www.daffodil.de/), OSIRIS/ISIS(http://dbis.cs.unibas.ch/delos_website/), JeromeDL (http://www.jeromedl.org/), etc Additionally, there have been previous efforts in the construction of institutional repository software, which focuses primarily on ingest, preservation and access of locally produced documents, particularly locally produced academic outputs. Some of the well-known repositories include Fedora (http://www.fedora-commons.org/), DSPACE (http://www.dspace.org/) and EPrints (http://www.eprints.org/). Moreover, some digital object preservation systems include KOPAL (http://kopal.langzeitarchivierung.de), PANIC [HC05] (http://www.metadata.net/panic/), PROTAGE (http://www.protage.eu/), SHAMAN (http://shaman-ip.eu/shaman/), Caspar (http://www.casparpreserves.eu/), Planets (http://www.planets-project.eu/) [FH07], PrestoPRIME (http://www.prestoprime.org/), etc.

Finally, some digital library systems more specific for scientific information include: SPIRE, which deal on imagery focusing on supporting content-based search; DIRECT (http://direct.dei.unipd.it/), for managing the scientific data produced during an evaluation campaign. It manages different types of information resources employed in a large-scale evaluation campaign, e.g., data (experimental collections and experiments), information (performance measurements), knowledge (descriptive statistics and hypothesis tests) and communications (papers, talks, seminars), supporting the different stages of the campaign, and facilitates the sharing and dissemination of the results; Virtual Data Center (VDC), a digital library system for the management and dissemination of distributed collections of quantitative data, which was the base also for Dataverse (http://thedata.org/), a web-based application for publishing, citing and discovering research data, allowing other to replicate and verify a research work.

Differentiating aspects with respect to others

In contrast to the above mentioned systems, RODL is a digital library system specifically designed for research objects, that is, aggregations of scientific workflows with related object. So, besides offering generic digital library functionality, RODL enables the management of semantic annotations and metadata based on the research object model defined as part of the project. As such, RODL is able to store, publish and preserve research objects and the resources it aggregates, including the metadata and semantic annotations on those resources, supporting the implementation of specialized services, such as integrity and authenticity, recommendation, collaboration spheres, etc.

Individual SWOT analysis

 

Positive

Negative

Internal factors

Strengths

  • Partners with strong background in the design and implementation of digital library systems
  • Involvement of domain users during the design of RODL
  • Leverages existing production systems, most of them developed by consortium partners
  • Adoption of open standards and protocols
  • First digital library system specialized for scientific workflows, and their related objects, aggregated in complex digitial objects (Research Objects) 
  • Delivered as open software

Weaknesses

  • Unstable and continuously evolving underlying models
  • Low level of maturity compared to some other conventional digital libraries
  • Some user valued functionality may depend on external services (a direct consequence of a distributed architecture of the Wf4Ever toolkit), and thus on the quality of those services

External factors

Opportunities 

  • Adoption of RODL for preservation of scientific workflows in many domains
  • Instantiation of RODL by 3rd party technology providers
  • Exploitation of RODL components to support other type of complex digital objects
  • Integration of RODL with other systems/components outside Wf4Ever scope
  • Provide an exemplar application for bringing together the digital libraries community and other scientific communities

Threats

  • RODL does not fulfill users expectations
  • Successful RODL may turn to be very domain-specific which will hinder its adoption by other domains outside of Wf4Ever
  • External institutions/projects offer products with equivalent functionality, before RODL

[1|#_ftnref1]http://www.ivoa.net/Documents/latest/SAMP.html

[1|#_ftnref1]http://www.ivoa.net

[2|#_ftnref2]http://www.usvao.org

[1|#_ftnref1]http://www.helio-vo.eu

[2|#_ftnref2]http://www.cyberska.org

[3|#_ftnref3]http://www.france-ov.org/twiki/bin/view/GROUPEStravail/Workflow

[4|#_ftnref4]http://montage.ipac.caltech.edu/

  • No labels