Title: Metadata for digital long-term preservation
1Metadata for digital long-term preservation
- Michael Day,Digital Curation CentreUKOLN,
University of Bathm.day_at_ukoln.ac.uk - MPG eScience Seminar 2008 Aspects of long-term
archiving, GWDG Göttingen, 19-20 June 2008
2Presentation outline
- Some definitions
- An abstract approach OAIS
- A framework for practical implementation the
PREMIS Data Dictionary - Some open questions for e-research data
3Definitions (1)
- Metadata
- A relatively new term that is used to describe a
very old concept - We primarily need to think about the different
functions it enables, e.g. discovery and access
management, the management of resources,
long-term preservation, etc.
4Definitions (2)
- Preservation metadata
- ... the information a repository uses to support
the digital preservation process (PREMIS Data
Dictionary) - Potentially very wide scope
- Technical information on data structures or
formats - Information to help better understand the content
- Information on contexts and provenance
- Information on preservation processes
5Definitions (3)
- Metadata for research data
- Metadata are fundamentally important to the
continued understanding and exploitation of
research data - It is impossible to conduct a correct analysis
of a data set without knowing how the data was
cleaned, calibrated, what parameters were used in
the process (Deelman, et al 2004) - In some cases, extremely detailed documentation
will be required - Captured from various stages of lifecycle
6The OAIS Information Model (1)
- General OAIS background
- An ISO standard (ISO 147212003)
- Development led by the Consultative Committee on
Space Data Systems - Provides standard terminology and defines two
interrelated models (functional model,
information model)
7The OAIS Information Model (2)
- Some general principles
- OAIS entities (Data Objects and Content
Information) are conceptually bound together with
information that provides additional meaning - There are two main classes of this
- Representation Information
- Preservation Description Information
8The OAIS Information Model (3)
- Representation Information
- Is tightly bound with the Data Object
- Provides a bridge between the bit-level
information being stored in an OAIS and something
that can be understood - Describing data structure concepts, or formats
(Structure Information) - Providing additional information on semantics
(Semantic Information)
9The OAIS Information Model (4)
- Preservation Description Information
- The additional information needed to make the
Content Information meaningful for the indefinite
long-term (p. 4-33) - For example, the information needed to preserve
the Content Information, to ensure that it is
clearly identified, and to understand the
environment in which the Content Information was
created (p. 2-6) - Reference, Context, Provenance, Fixity
10The OAIS Information Model (5)
- Lessons from OAIS (1)
- Data objects (and content) need to be closely
coupled with additional layers of information
(metadata) that will help provide meaning and
context, etc. - These layers broadly reflect the main
characteristics of digital information (physical,
logical, intellectual) - Produces self-documenting objects
11The OAIS Information Model (6)
- Lessons from OAIS (2)
- It highlights the importance of preserving
context and provenance (but these are quite
vaguely defined) - OAIS works on an abstract level, but there is a
need to think about what needs to be done in
practical terms to develop preservation metadata
schemata ...
12PREMIS Data Dictionary (1)
- Background (1)
- PREMIS Working Group (2003-2005)
- An attempt to develop something that would be
implementable - Development informed by OAIS model
- Built upon on several initiatives that had been
developing preservation metadata schemas and
frameworks prior to 2003 - Data Dictionary first published in May 2005 v.
2.0 in March 2008
13PREMIS Data Dictionary (2)
- Background (2)
- PREMIS Maintenance Activity set up by Library of
Congress - PREMIS Implementers Group (open discussion list)
- Recent revision of PREMIS takes account of the
experiences of implementers
14PREMIS Data Dictionary (3)
- What PREMIS aims to do
- The Data Dictionary is specifically focused on
defining the core metadata needed for long-term
preservation - ... the information a repository uses to support
the digital preservation process - Related to a series of verbs
- ... functions to maintain viability,
renderablility, understandability, authenticity,
and identity in a preservation context - Based on a data model
15PREMIS Data Dictionary (4)
- PREMIS Data Model
- Recognises that digital preservation is as much
about describing processes as well as objects - Five entities
- Intellectual Entities
- Objects
- Events
- Agents
- Rights
16PREMIS Data Dictionary (5)
PREMIS 2.0 Data Model
Intellectual entities
Rights
Agents
Objects
Events
17PREMIS Data Dictionary (6)
- PREMIS usage (1)
- Survey undertaken for PREMIS Maintenance Activity
(2007) - 16 repositories and projects surveyed (mostly
dealing with documents rather than data) - Survey noted much diversity in the way PREMIS had
been implemented - Tools were being used to capture technical
metadata automatically - Formats could be identified using tools like
JHOVE and PRONOM DROID
18PREMIS Data Dictionary (7)
- PREMIS usage (2)
- No major eScience input into PREMIS
- PREMIS is occasionally used to help inform the
preservation of research data - The National Snow and Ice Data Centre has used
PREMIS as a way of evaluating its own
OAIS-inspired metadata schema - The Stanford Digital Repository has experimented
with the using PREMIS for geospatial resources - Experiments with the Yale Social Science Data
Archive
19PREMIS Data Dictionary (8)
- Lessons from PREMIS
- The Data Model demonstrates the importance of
recording the contexts of preservation (events,
agents), not just metadata on the objects - Currently little used in the e-research domain,
but it has some potential where structured
metadata already exists in some form (e.g.,
CSDGM, DDI)
20Implications for e-research (1)
- The role of standards
- The development of standards (e.g. PREMIS)
assumes that there is some level of commonality
between domains - However, generic solutions are not really
feasible for e-research data because of the
diversity and complexity of - Research data (content)
- Research contexts
- Stakeholders
21Diversity and complexity (1)
- Diversity of content (1)
- Research data is ... any information that can be
stored in digital form, including text, numbers,
images, video or movies, audio, software,
algorithms, equations, animations, models,
simulations, etc. (National Science Board,
Long-lived digital data collections, 2005)
22Diversity and complexity (2)
- Diversity of content (2)
- Research data is extremely diverse - not really a
single category of material - tabular data, images, GIS, etc.
- raw machine output vs, derived data
- varying levels of structure (XML, legacy formats,
etc.) - many different standards
- Research data is not homogeneous
- No one-size-fits-all approach possible
23Diversity and complexity (3)
- There is an even wider range of social contexts
in which data is used (and shared) - DCC SCARP project has been exploring disciplinary
factors in curation practice - Practice even within single disciplines is very
fragmented - Case studies ongoing
- Big-science archives, medical and social
sciences, architecutre and engineering,
biological images
24Diversity and complexity (4)
- Major disciplinary differences
- Attitudes towards data sharing
- Some are very open, some cannot see the point
- Existence of data centre infrastructures
- In UK some centrally funded data centres, not
universal - Where do institutions fit?
- The existence of standards
- Already present in social sciences (DDI), the
geospatial domain (FGDC), and many others
25Diversity and complexity (5)
- Diversity of stakeholders
- The many different actors that have an interest
in data curation means that metadata requirements
may differ - Dealing with data (2007) Scientist, Institution,
Data centre, User, Funder, Publisher - Long-lived data collections (2005) Data authors,
Data managers, Data scientists, Data users,
Funding agencies
26Implications for e-research (2)
- Metadata for digital curation or for long-term
preservation? - The concept of digital curation focuses on reuse
and adding value - long-term preservation is not
always the aim - PREMIS metadata is focused on particular things
(viability, renderablility, understandability,
authenticity and integrity) - What metadata do we need for digital curation?
Could this ever be generic?
27Implications for e-research (3)
- Metadata can be difficult to identify
- Difficult sometimes to work out where data ends
and metadata begins - Depends on the point of view of the researcher
28Implications for e-research (4)
- Lifecycle view
- Metadata has to be captured at multiple places in
the scientiic workflow - Need to capture
- Processes (can be driven by instrumentation)
- Provenance
- Context
29Implications for e-research (5)
- Big science, little science
- Big science is by its nature data driven, and
will often develop appropriate frameworks for its
management and reuse (data centres, data grids) - Other scientific domains (e.g, ecology,
biodiversity, chemistry) are moving in the same
direction, but data retain a high-level of
diversity and complexity
30Summing-up
- The OAIS Information Model provides an abstract
framework for thinking about preservation
metadata - PREMIS provides an implementation framework that
is beginning to be adoped in some domains - There are still many unresolved questions when it
comes to defining metadata for research data
31Acknowledgements