Title: What will you need to know
1What will you need to know?
- The role of metadata in keeping digital content
alive
Robin Wendler, Harvard University
Library November 2, 2005 r_wendler_at_harvard.edu
The Crystal Ball. John William Waterhouse.
Private collection
2Let me count the ways digital stuff goes bad
- Media become obsolete
- Media decay
- Formats are superseded
- Proprietary formats may be orphaned
- Hardware breaks
- Software is orphaned
- Encryption may hinder preservation
- User requirements change
3How will you know?
- Preservation Planning
- Monitor your data through metadata for
- Integrity
- Renderability
- Understandability
- Authenticity
- Identity
- Responsibility
- Monitor the community
- Format support
- Requirements
4What will you do?
- Identify materials at risk
- Analyze options
- Categorize objects
- Formal characteristics
- Purpose
- Antecedents
- Communicate with owners
- Perform preservation actions
- Create audit trail
All utilize and/or generate metadata
5Gradual understanding
- OAIS
- (1st workshop 1995 Blue Book 2002)
- http//www.ccsds.org/docu/dscgi/ds.py/Get/File-143
/650x0b1.pdf - NLA PANDORA (1996-)
- http//pandora.nla.gov.au/index.html
- CEDARS (1998-2002)
- http//www.leeds.ac.uk/cedars/index.htm
- NEDLIB (1998-2000)
- http//www.kb.nl/coop/nedlib/
- OCLC/RLG Preservation Framework Working Group
(2000-2001) - http//www.oclc.org/research/projects/pmwg/wg1.htm
- PREMIS (2003-2005)
- http//www.oclc.org/research/projects/pmwg/
6Preservation Metadata
- the information necessary to carry out,
document and evaluate the processes that support
the long-term retention and accessibility of
digital content. - Moving digital objects and their metadata
across space and time requires standard
mechanisms for encoding and exchange - Brian Lavoie
- Viewed from a preservation lens, all metadata is
preservation metadata - Categories of metadata overlap a single piece of
metadata can serve many purposes
7OAIS Functional Model
Archival Information Systems are permeated by
metadata. Metadata is the difference between a
repository and just files on a disk.
8OAIS Information Model
9OAIS Content Information Framework, Expanded by
OCLC/RLG WG
OAIS Model
OCLC/RLG Extensions
Still a framework, not usable, defined elements
10OAIS Preservation Description Information
Framework
Reference provides identifiers and describes
mechanisms by which ids are assigned Context
documents relationships of content to its
environment (why created, other formats,
editions) Provenance documents the history,
changes, custody of content Fixity documents
data integrity checks or validation and
verification keys to ensure no unauthorized
changes
11Metadata relevant to preservation
- Storage management and fixity
- Technical characteristics
- Structure
- Provenance
- Rights
- Digital signature trail, where applicable
- Intellectual access / description
12PREMISPreservation Metadata Implementation
Strategies
- Surveyed implementation of digital repositories,
assessed adoption of metadata standards
(2003/2004) - Defined a core set of implementable preservation
metadata elements (2005) - Implementation-independent
- Explicit or implicit
- Not reinventing the wheel
- Descriptive, rights, agents
- Privilege automatically-suppliable values
- Defined associated XML schemas
- Set up ongoing maintenance activity
- http//www.loc.gov/standards/premis/
13PREMIS Data Model
Intellectual Entities
Rights
Objects
Agents
Events
14Importance of object modeling
- Metadata must adhere to the right thing
- Representation
- The set of files, including structural metadata,
needed for a complete and reasonable rendition of
an Intellectual Entity. - File
- A named and ordered sequence of bytes that is
known by an operating system. - Bitstream
- A contiguous or non-contiguous data within a file
that has meaningful common properties for
preservation purposes.
Any can express an Intellectual Entity
All are kinds of Objects in PREMIS
All can be affected by Events
Rights adhere to all
15Sample PREMIS semantic unit
16Core Object Metadata(Yes, this is to make you
sweat)
- objectIdentifier
- preservationLevel
- objectCategory
- objectCharacteristics
- compositionLevel
- fixity
- messageDigestAlgorithm
- messageDigest
- messageDigestOriginator
- size
- format
- formatDesignation
- formatName
- formatVersion
- formatRegistry
- formatRegistryName
- formatRegistryKey
- formatRegistryRole
- significantProperties
- storage
- contentLocation
- storageMedium
- environment
- environmentCharacteristic
- environmentPurpose
- environmentNote
- dependency
- dependencyName
- dependencyIdentifier
- software
- swName
- swVersion
- swType
- swOtherInformation
- swDependency
- hardware
- hwName
- hwType
17Significant Properties
- objective technical characteristics
subjectively considered important, or
subjectively determined characteristics. - Requires identification in advance of whats
crucial, what might be at risk, and how to codify
it.
Mondrian. Composition with large red plane,
yellow, black, gray and blue. 1921. Haags
Gemeentemuseum, Hague
Monet. Waterloo Bridge, London, at Sunset,
1904Collection of Mr. and Mrs. Paul Mellon.
National Gallery of Art.
18Rights
- Different flavors
- Rights
- Permissions
- Licenses
- Submission Agreements
- Multiple rights languages
- XrML (eXtensible rights Markup Language)
- http//www.xrml.org/
- ODRL (Open Digital Rights Language)
- http//odrl.net/
- Designed to support DRM
- Complex
- Patent/licensing issues
- PREMIS Rights
- Lightweight
- Focused on right to preserve
- Statements, rather than DRM
19PREMIS Permission Statement
- permissionStatementIdentifier
- linkingObject
- grantingAgent
- grantingAgreement
- permissionGranted
- act
- restriction
- termOfGrant
- startDate
- endDate
- permissionNote
20Event Metadata
- Events in the life of a digital object
- What was done
- Who did it
- When
- Who authorized it
- What was the outcome
- General
- PREMIS Events
- Specific, e.g.
- AES Process History
21PREMIS Events
- Must be related to one or more objects
- Can be related to one or more agents
- Consist of
- eventIdentifier
- eventType
- eventDateTime
- eventDetail
- eventOutcomeInformation
- linkingAgentIdentifier
- linkingObjectIdentifier
22Beyond PREMIS
- Format-specific technical metadata
- Detailed event metadata
- Structural metadata / content packaging
- Specific descriptive metadata
23Technical Metadata
- Formally characterizes
- a class of objects
- an individual object
- Some technical metadata applies to all formats,
most is specific to a category of formats, e.g. - NISO Z39.87 Technical Metadata for Still Images
- http//www.niso.org/standards/resources/Z39_87_tri
al_use.pdf - MIX (XML schema for Z39.87) http//www.loc.gov/st
andards/mix// - Audio Engineering Society Core Technical Metadata
for Audio in draft - TextMD
- http//dlib.nyu.edu/METS/textmd.xsd
24Structural Metadata
- Not only content, but also metadata and binding
must be preserved - Enables a complex object to be assembled from its
constituent parts - Content, Metadata, Relationships, Behaviors
25Structural and Packaging Metadata
- Many formats developed in different communities,
e.g., - Digital library METS
- http//www.loc.gov/standards/mets/
- Commercial media MPEG 21 DIDL
- Available from ISO www.iso.org
- Learning objects IMS Content Packaging
- http//www.imsglobal.org/content/packaging/
- Space data XFDU still in draft
- http//www.ccsds.org/docu/dscgi/ds.py/GetRepr/File
-1912/html - Audio-visual Advanced Authoring Format (AAF)
- http//www.aafassociation.org/html/techinfo/index.
html - Television Television Material Exchange Format
(MXF) - Available from SMPTE www.smpte.org
- No consolidation of formats, but dialog and
mapping
26METS Basics
- METS provides a framework for
- Content files
- Metadata
- Descriptive
- Structural
- Technical
- Provenance
- Source
- Relationships
- Behaviors
- Suitable for
- Open Archival Information Systems
- Archival information package (AIP)
- Submission information package (SIP)
- Dissemination information package (DIP)
- Display and navigation of digital objects
- Sharing of digital objects among libraries and
archives
27RLGs METS Viewer
Structural Metadata
Descriptive Metadata
Behaviors
Content
28Structure of a METS File
METS
metsHdr
Header describing METS file itself
fileSec
Inventory or manifest of component files
dmdSec
Descriptive metadata
Administrative metadata -- technical, source,
rights, provenance
admSec
structMap
Structure map the heart of METS
structLink
Structural map linking, i.e., hyperlinks
behaviorSec
Executable behaviors
Less commonly used
29Structure Map
ORDER1 TYPE
ORDER2 ORDERLABELi FILEIDB ORDER3 FILEIDC v LABELChapter 1 ORDER4 v LABELpage 2 ORDER5 FILEIDE
Title page Preface page i page
ii Chapter 1 page 1 page 2
30Referring to Metadata
METS
METS does not define descriptive or
administrative metadata elements. dmdSec and
admSec are buckets or sockets where
externally-defined metadata can be supplied or
referenced
metsHdr
fileSec
dmdSec
- Three techniques
- In-line XML
- Wrapped base-64 encoded data
- Pointers to external information
- (e.g., URNs, handles)
admSec
structMap
structLink
METS Board endorses range of recommended
extension schemas
behaviorSec
31Use of MODS Extension Schema for Descriptive
Metadata
treasurer DMDIDD1
LABELChapter 1 DMDIDCH1
1 ORDER3
LABELpage 2 ORDER4
Book Chapter 1 page 1 page 2
c.gov/mods/v3" xsischemaLocationhttp
//www.loc.gov/mods/v3
Radcliffe
College
Reports of the president and
treasurer for...
MDTYPEMARC xlinkhrefhttp//... BNI3165/
Catalog record
32Where does all this metadata come from?
- Look, Ma, no hands! (as much as possible, that
is) - Dont make people create it
- Machines are faster, cheaper, more accurate
- Dont make people read it
- Use controlled values
- Expect bulk preservation of like objects
- Artisanal preservation is not affordable
- Develop and share tools to automate creation,
ingest, extraction, exchange
33JHOVEJSTOR/Harvard Object Validation Environment
- Format Identification
- Format Validation
- Well-formedness (Syntactical)
- Validity (Semantic)
- Format Characterization
- http//hul.harvard.edu/jhove/
- Modules for
- AIFF
- ASCII
- BYTESTREAM
- GIF
- HTML
- JPEG
- JPEG2000
- PDF
- TIFF
- UTF8
- WAVE
- XML
34Automatic Exposure
- RLG initiative advocates for capturing standard
technical metadata about digital images
automatically as part of image creation - engage manufacturers in dialog about what
technical metadata their products currently
capture vs what is required for digital archiving - leverage existing industry efforts
- identify and evaluate tools for harvesting
technical metadata and explore how those tools
can scale to serve the entire community.
35Format Registries
- Detailed documentation of how typed content is
represented - Persistent, unambiguous association between
public identifiers for digital formats and their
documentation - Lists of systems and services which use or
produce the format - Must be inclusive, detailed, rigorous, public,
and sustainable - Format Registry projects
- PRONOM
- http//www.nationalarchives.gov.uk/pronom/
- Global Digital Format Registry
- http//hul.harvard.edu/gdfr/
- TOM
- http//tom.library.upenn.edu/
- FRED demonstration system
- http//tom.library.upenn.edu/fred/
36Other Registries(Extant and Posited)
- Registry of Digital Masters
- I will preserve this digital thing
- http//www.oclc.org/digitalpreservation/why/digita
lregistry/default.htm - Profile registries
- I restrict this broader standard in the
following ways - Metadata Element/Schema registries
- I use these elements to mean these things
- http//www.xml.org/xml/registry.jsp
- http//www.ukoln.ac.uk/projects/iemsr/
- Etc.
- Environment registries
- Hardware/software configurations in which given
software is known to work
37Digital Information Community benefits from
metadata cooperation
- Develop common understanding
- Crucial metadata
- Standards!
- Trusted repository certification
- Acceptable preservation strategies
- Needs and costs
- Automate capture/creation of metadata
- Work with equipment manufacturers
- Develop open source tools
- Share burden
- Monitor/document digital formats
- Avoid duplicate digitization