Title: eCrystals Federation
1eCrystals Federation Open Repositories for
Data-driven Science Dr Liz Lyon, UKOLN,
University of Bath, UK Dr Simon Coles, University
of Southampton, UK Chemical Informatics
Workshop, Manchester, March 2008
- Context Institutional data repositories
crystallography exemplar - Scale repository federations
- Longevity Digital curation and preservation
- Integration Semantic challenges
3eBank Project building the eCrystals Data
Started Sept 2003 Scholarly knowledge cycle
context UKOLN-led interdisciplinary team
ePrints platform _at_ Southampton Institutional
Repository exemplar Embedded in
workflow http//ecrystals.chem.soton.ac.uk
4Scaling Up Report Phase 3 findings Data policy
should reflect lab practice institutional
model Diverse lab practice LIMS proprietary
formats Data quality criteria/validation Prior
publication problem We need automated assignment
of terms for data discovery No discipline
preservation model
6eCrystals Repository ePrints.org v3.0
7Repository Foundations
Learned society subject repository support
- Using simple Dublin Core
- Crystal structure
- Title (Systematic IUPAC Name)
- Authors
- Affiliation
- Creation Date
- Additional chemical information through
Qualified Dublin Core - Empirical formula
- International Chemical Identifier (InChI)
- Compound Class Keywords
- Specifies which datasets are present in an
entry - Application Profile http//www.ukoln.ac.uk/projec
ts/ebank-uk/schemas/ - DOI links http//dx.doi.org/10.1594/ecrystals.che
m.soton.ac.uk/145 - Rights Citation http//ecrystals.chem.soton.ac.
8Federation interoperability linking services
- Roll-out in 2 phases led by University of
Southampton - Establish Federation policies, application
profile, mappings - Bi-directional links with derived articles in
publisher repositories, IUCr, Royal Society of
Chemistry (RSC), Chemistry Central scholarly
knowledge cycle - StOReLink project - Test linking options StORe
middleware and CLADDIER - OAI-ORE Testbed
eChemistry project
9Laboratory practice workflow
- Community standard CIF
- Mixed lab practice central service facility
versus single staff crystallographer in
department - Achieve end-to-end workflow
- Challenge of instrument manufacturers with
proprietary formats - Repository Lite for smaller lab operations?
X-ray diffractometers
10eBank-UK Phase 3 Curation Preservation Study
Sustainability issues
- http//www.ukoln.ac.uk/projects/ebank-uk/curation/
- Examined four main areas
- Audit and certification (TRAC, DRAMBORA, NESTOR,
ISO International repository audit and
certification BOF Group) - The Open Archival Information System (OAIS) and
Representation Information (RI) - eBank-UK application profile and preservation
metadata - ePrints.org repository platform
Recommendations Self-assessment using
DRAMBORA Consider Representation Information in
wider context Develop preservation
strategy Capture preservation metadata - PREMIS
11Semantic issues
- Crystallographic schema underpins CIF
(Crystallographic Information Framework), but is
limited to data parameters - e.g. cell_length_a
12- IUCr Acta Cryst 1992
- Limited set of keywords describing methods,
properties applications, compounds, attributes
- No established crystallography dictionary or
controlled vocabulary to give chemistry context
13What do we want to do?
- Support depositors keyword/term assignment
- Facilitate and improve automated indexing
- Support advanced search / browse
- Allow metadata validation enhancement
- Apply across a heterogeneous Federation
- Cross search, cross browse functionality
- Link data to all associated digital objects
- Develop domain semantics / vocabulary
- Use domain-specific authority files
- Mine to discover rather than find
- Achieve full inter-disciplinary integration
14Some (semantic) issues..
- How are terms assigned?
- Informal tags and/or structured KOS?
- How is a vocabulary curated and maintained?
- Can a vocabulary be transformed into a (Semantic
Web related understanding) ontology? - Disambiguation, acronyms, IUPAC names
- Persistent identification for data citation
- Granularity of data citation
- Data (and metadata) quality, provenance,
validation - Embedding within complex workflows
- Use collaborative social approaches?
- Community adoption becomes part of the culture
15Questions? Slides will be available at
p http//www.ukoln.ac.uk/ukoln/staff/e.j.lyon/pres