Title: Enabling the reusability of scientific data: Experiences with designing an open access infrastructure for sharing datasets
1- Enabling the reusability of scientific data
Experiences with designing an open access
infrastructure for sharing datasets
Simon J. Coles EPSRC National Crystallography
Service School of Chemistry University of
Southampton
2Data the Publication Problem
2,000,000
25,000,000
450,000
3A Different Approach to Data Publication?
Underlying data
Intellect Interpretation
4Requirements
- Capture of all digital data and information
generated during the course of an experiment - Data validation
- Adding value
- Archival system for data with attached
bibliographic and chemical metadata - Automatic report generation
- Schema and protocols for publication and
dissemination of a dataset
5Open Access Crystal Structure Archive
ecrystals.chem.soton.ac.uk
6Access to the Underlying Data
7Publicising Content
8Harvesting, Linking and Aggregating
9Usability Quality Uniformity of data
- Different laboratories, practices instruments
present a heterogeneous body of data - Publish according to IUCr ratified schema
- To support publication according to this schema a
toolbox add-on to the archive has been developed - Toolbox requires 2 mandatory files only is
capable of performing file format conversions and
generate value added files
10Usability Ease of Deposition Metadata Quality
- Minimal number of manual metadata entries many
can be hardwired into the system - Deposition guidelines initially prepared by
students to provide impartial feedback - Full documentation and in-line help/examples
- Restrained lists, e.g. Keywords
- Data deposited automatically by toolbox
- Automated generation of metadata for report and
OAI interface
11Usability Data Validation
- Peer review removed from self deposit publication
- Simple checks for consistency made by the toolbox
- Checks for crystallographic integrity made
through a web service (IUCr, CHECKCIF) - Introduction of data editor for the archive a
deposition must be signed-off by a recognised
professional before going live - Quality indicators automatically taken from
dataset and presented in HTML jump-off page
12Usability Identifiers
- URL of deposited dataset provides an identifier
- Persistent only if the Institutional support
model is accepted / adopted - Signed-up to an agency to register metadata
relating to datasets with a DOI - Pay registry to ensure that DOI always resolves
to associated dataset (10cents to register 1cent
per annum to maintain) - InChI chemical identifier - a unique text
descriptor for a molecule
13Usability Dissemination Aggregation
- OAI metadata schema ratified by IUCr chemical
community - OAI covers bibliographic terms must introduce
chemical terms - Both library and subject specific aggregators
satisfied - Chemical linking InChI, chemical classifications
and restricted keywords list
14Usability Endorsement
- Feedback during development from technical
publishing arm of IUCr - Designed for automatic incorporation into CSD
(global database operated by CCDC) - Accepted by Executive Committee of IUCr
- Reuse of data achieved in collaboration with
Leverhulme Centre for Molecular Informatics
15Usability Community Uptake
- Southampton archive about to publish routinely
via the archive - Five crystallography laboratories in UK agreed to
adopt philosophy, install and populate archives - CCDC will harvest required data from all archives
- IUCr will harvest and curate all data
- Develop aggregator services in collaboration with
IUCr
16Usability The Next Challenges
- Full acceptance by chemical community
- Validation worries
- Curation worries
- The requirement for as many peer reviewed
publications as possible (despite quality) - Full acceptance by wider chemistry publishing
community - Loss of control over underlying data
- Faith in Open Archives replacing experimental
descriptions in articles - Development of fully functional aggregator
services