Skip to end of metadata
Go to start of metadata

This question arises from a discussion about whether the services used by a workflow aggregated in an RO are themselves also considered to be aggregated in the RO.

This begs the question: what does it actually mean when we say a resource is aggregated in an RO? Or, more importantly, what purpose does it serve?

I propose that an RO describes the context of an experiment or investigation. It represents an accumulation of researchers' decision making processes that have led to a particular collection of resources, that they consider to be essential, or otherwise germane, to the results being sought or presented. As such, it is very much a reflection of the researchers' knowledge and beliefs.

We can look at aggregation in two ways:

  1. it is something that happens "under the hood", a kind of technical glue that connects the pieces that researchers pull together, both by conscious direct choice (e.g. to apply a workflow W to some data), and by unconscious indirect reference (e.g. workflow W makes use of services WS1, WS2, etc.)
  2. it represents the researchers' conscious choices, and as such the record of aggregation is an important aspect of the knowledge* encapsulated in an RO, not just part of the technical machinery.

Both are reasonable, defensible viewpoints. Which is most useful?

I can't help feeling that the first approach is a duplication of mechanisms that already exist (e.g. the wfdesc description of a workflow referencing a service used), and loses an opportunity to capture (through "sheer curation") important information about the researchers' decisions while conducting an investigation. In this view, the aggregation of a resource in an RO represents an explicit decision that the resource is important to the investigation at hand (either because it is relevant in its own right, or that by not including it other selected components do not adequately serve their purpose (1)). Without this understanding, additional information must be collected to distinguish intentional additions from accidental additions.

(1) For example, a software component "Analyze" may be used explicitly to compute some value from research data, and as such has direct relevance to an investigation. But "Analyze" may call a library "Statistics" that is a separate component that must be obtained separately. A researcher using "Analyze" needs to know this, and take steps to explicitly ensure that "Statistics" is available when running "Analyze" - as such, the inclusion of "Statistics" is a piece of knowledge that a researcher needs in order to complete an investigation. Compare this with a situation where "Statistics" has been packaged together with "Analyze", in which case a researcher does not need to know that "Statistics" has to be available in order to run "Analyze" - that's just taken care of by the packaging used.

A disadvantage of this approach is that tools to create aggregations automatically from available data are at risk of creating "noise" in the form of aggregations that a researcher (or reviewer) does not need to be aware of.

  • No labels