ADMIRAL data packages
This note describes the role of data packages in the ADMIRAL system, with a view to including ADMIRAL experiences in the discussion about Research Objects 1 in the Wf4Ever workflow preservation project 2.
ADMIRAL 3 is a UK JISC-funded project to facilitate the capture of research data and its subsequent publication via an institutional repository. It is being conducted by the Image Bioinformatics Research Group in the Zoology Department of Oxford University, the Oxford Bodleian Library Service, and the British Library. The goal is to make it easy for researchers to collect, curate, submit, publish and review datasets in support of conventional paper publications (we refer to this as "sheer curation" 4).
The main day-to-day interface between researchers and the ADMIRAL system is a shared file system implemented using common open source software (Linux, Samba, etc.), which is easily accessed from most personal computer systems without requiring installation of additional software. Researchers are initially encouraged to use this for keeping copies of their work-in-progress datasets by provision of automatic daily backups to a University-managed backup facility. The shared file system is overlaid with web access; i.e. data in the shared file system can be accessed and updated using HTTP and WebDAV protocols, using the same access credentials. This allows additional services for processing and presentation of data in the file system; one such service is packaging and submission of selected datasets to a data repository run by the university library service. Other such services may facilitate augmentation of local data with information from global repositories, or running the data through externally provided analysis services, or through externally provided workflows.
At all stages of data collection and analysis, additional data and metadata may be collected and associated with the original raw data. At this stage, we do not distinguish between data and metadata: metadata is incorporated as just another file in the filestore. Any file format can be used for metadata (but we have a general preference for RDF, all other things being equal).
As well as a file sharing server, the ADMIRAL system also implements a web server for accessing both the data and locally hosted web applications for submitting datasets to the university data repository, and more.
The role of data packages
When a dataset is to be submitted to the data repository, selected files (data and metadata) are collected in to a package along with a manifest listing the component files. This package is submitted as a single entity to the data repository. This package is what I refer to here as a "data package", which shares some of the characteristics described as pertaining to Research Objects: a transferrable entity (or "object") that contains both data and metadata describing that data.
The notion of curation by addition 5 includes the principle: "Rather than try to get systems to learn the individual ways that researchers will store their stuff, we need to capture whatever they give us and, initially, present that to end-users. In other words, not to sweat it that the data we've put out there has a very narrow userbase, as the act of curation and preservation takes time." While there is a need to capture and store and utilize metadata, there very few requirements on what metadata must be present. As such, and ADMIRAL data package might be seen as a sub-minmal example of what a Research Object might be.
For ADMIRAL, the shared file system represents the "sheer" interface between the researchers' day-to-day work and the data curation tool chain. The data package embodies an interface between the researchers' day-to-day work space and the more controlled data preservation and publication environment. This might be viewed as a specific case of a Research Object that encapsulates information transferred between different stages of research information processing.
Data package abstract structure
@@terminology used here is subject to review
A data package consists of a collection of resources, and is presumed to be identified by a URI (which is not necessarily part of the package, and hence may be undetermined).
Each resource is identified by a URI, which is an extension of the data package URI. This means that for each resource, there is a relative URI references that can be resolved against the data package URI to yield the resource URI.
Each resource has an associated data stream, which is a sequence of octets.
There is a distinguished manifest resource in the data package containing metadata about the other resources in the dataset. Minimally, it contains:
- a dataset local identifier (a local name that is used in constructing a URI for the dataset)
- username of creator
- a one-line title of the dataset
- a simple textual description of the dataset
The manifest resource may also contain references to other resources in the dataset that also contain metadata about the data.
Additionally, it may contain (this information being added when the data package is submitted):
- enumeration of the URIs of the resources in the data package
- a version number for the data package
- embargo status and date
- date the package was created (submitted)
- date the package was last modified
- reference to any resource from which the data package has been derived
Any additional descriptive information may be added to the manifest. Any such value is identified by an arbitrary URI, and may have a value that is one of the XML schema built-in "anySimpleType" datatypes 7.
Data package concrete implementation
The data package format used by ADMIRAL is inspired by BagIt 6: "BagIt is a hierarchical file packaging format for the exchange of generalized digital content. A 'bag' has just enough structure to safely enclose descriptive 'tags' and a 'payload' but does not require any knowledge of the payload's internal semantics. This BagIt format should be suitable for disk-based or network-based storage and transfer."
The BagIt file format is a ZIP file, whose internal directory and file names are interpreted as relative URI references that are resolved against the package URI.
The distinguished metadata resource is identified by the relative URI "manifest.rdf". The data stream associated with this resource carries an RDF/XML representation of the manifest data. In present, the enumeration of package contents uses OAI/ORE 8 vocabulary terms.
For manifest references to additional metadata, we currently use
rdfs:seeAlso, but this may be subject to review. The object of the property is a URI reference to a resource in the same data package that contains further RDF/XML, which is merged with the manifest by a standard RDF graph merge operation.
Our implementation allows a small variation on the ZIP file format which simplifies our client implementation: if the ZIP file contains a single top-level directory then the manifest.rdf file may be located inside that directory rather than in the root of the ZIP file structure. e.g.
Example of data package metadata submitted to the data repository
Data packages submitted to the data repository are completely optional, but in practice we supply minimal descriptive metadata as part of a submission, something as shown below. Additional and arbitrary metadata MAY be included in the RDF manifest file if it is available and known to the submission process. The ADMIRAL submission program maintains a copy of the manifest.rdf file submitted in the source file system.
Example of data package metadata returned by the repository
The Databank repository to which a data package is submitted augments the supplied metadata with an ORE description of the package contents and other information about the package's status within the repository. This is returned when an application access the data package stored in the repository. The metadata returned corresponding to the above submission metadata might look like this:
References and notes
1 "Why Linked Data is Not Enough for Scientists", 2010, http://eprints.ecs.soton.ac.uk/21587/
4 http://en.wikipedia.org/wiki/Digital_curation#Sheer_curation (link retrieved 2011-01-18)
5 http://oxfordrepo.blogspot.com/2008/10/modelling-and-storing-phonetics.html (link retrieved 2011-01-18)
6 https://confluence.ucop.edu/display/Curation/BagIt (link retrieved 2011-01-18)
7 "XML Schema Part 2: Datatypes Second Edition", W3C, 2004, http://www.w3.org/TR/xmlschema-2/