Title: DATA DICTIONARY FOR DIGITAL PRESERVATION: PREMIS TUTORIAL
1DATA DICTIONARY FOR DIGITAL PRESERVATION PREMIS
TUTORIAL
- Priscilla Caplan, FCLA
- Rebecca Guenther, Library of Congress
- Wolfson Medical Library
- University of Glasgow
- July 17-19
- Sponsored by the Digital Curation Centre
2GOALS
- Establish definition and scope of PREMIS Data
Dictionary - Show how semantic units relate to the Data Model
- Introduce semantic units pertaining to Objects,
Events, Rights and Agents - Discuss major implementation issues
- Show ways of representing PREMIS in XML
3OUTLINE
- Introduction background and context
- PREMIS Data Model
- Semantic units pertaining to Objects
- Objects hands-on exercise
- Files, bitstreams and the onion model
- Semantic units pertaining to Agents, Rights,
Events - Events hands-on exercise
- Implementation issues
- PREMIS in XML PREMIS schemas and METS
- XML hands-on exercise
- Implementers panel
- Maintenance activity and future plans
- Conclusion Q A
4INTRODUCTION BACKGROUND AND CONTEXT
5OAIS
- Reference Model for an Open Archival Information
System (OAIS). Consultative Committee for Space
Data Systems, 2002 - A high level framework for preservation
repositories, establishing a common model and
vocabulary - Defines
- the functions an OAIS must perform
- the information model an OAIS must employ
- In the information model, a Content Data Object
is described by Representation Information (what
you need to use and interpret the object) and
Preservation Descriptive Information (what you
need to preserve and access the object)
6OAIS Information Model and PREMIS DD
- OAIS Reference Model (arguably) most widely
adopted standard in digital preservation
community - Important to orient PREMIS in an OAIS context
- Relate elements from other preservation metadata
schema based on OAIS to PREMIS (e.g., NLA) - Enhance interoperability/applicability of
preservation metadata registries e.g., Digital
Curation Centres Representation Information
Repository - Allow repositories to think about OAIS
conformance in a PREMIS context, and vice versa. - Many repositories use OAIS concepts and
vocabulary to express information (content
metadata) within their archiving systems - How does PREMIS relate to these concepts and
vocabulary?
7OAIS Information Model Structure
Packaging Information
SIP, AIP, or DIP
Content Information
Preservation Description Information
Content Data Object
Representation Information
Provenance
Reference
Fixity
Context
Descriptive Information
8PREMIS-to-OAIS Mapping
Intellectual Entities
Objects
Rights
Events
Agents
Fixity Information
Context Information
Reference Information
Provenance Information
Packaging Information
Description Information
Representation Information
9Comments
- PREMIS Data Dictionary does not provide semantic
units for Intellectual Entities - But provides semantic units to link to other
metadata sources for Intellectual Entities e.g.,
MARC record these are only semantic units
categorized as descriptive information (metadata
to aid discovery) - All entities have reference (identification)
information. - No packaging information that links content
with metadata. But PREMIS can be used with
something like METS, which does provide packaging
information. - In short, PREMIS deals mostly with
representation, context, provenance, and fixity
information, in keeping with PREMIS definition of
preservation metadata.
10Early work in preservation metadata
- Open Archival Information System (OAIS)
- defined a basic abstract information model
- NLA, CEDARS and NEDLIB
- developed preservation metadata schemes for their
projects - OCLC/RLG Preservation Metadata Framework Working
Group, Preservation Metadata and the OAIS
Information model A Metadata Framework to
Support the Preservation of Digital Objects,
2001 - unified earlier work within the OAIS framework
- National Library of New Zealand, 2002
- organized metadata elements around a data model
- Preservation Metadata Implementation Strategies
(PREMIS) - focused on practical implementation needs
11From theory to practice
Preservation Metadata Requirements
Digital Archiving Systems
Framework
OAIS
PREMIS Data Dictionary
12PREMIS Working Group
- Objective Define implementable, core
preservation metadata, with recommendations for
management and use - Membership
- 30 experts from 5 countries, libraries,
museums, archives, government agencies, private
sector - Co-Chairs Priscilla Caplan (FCLA), Rebecca
Guenther (LC) - Data Dictionary for Preservation Metadata Final
- Report of the PREMIS Working Group
- PREMIS Data Dictionary 1.0
- Accompanying report (scope, context,
- data model, special topics, glossary,
- examples)
- XML schemas to support implementation
13Some guiding principles and assumptions
- Implementable, core, preservation metadata
- Preservation metadata maintain viability,
renderability, understandability, authenticity,
identity in a preservation context - Core What most preservation repositories need
to know to preserve digital materials over the
long-term - Implementable rigorously defined supported by
usage guidelines/recommendations emphasis on
automated workflows - Implementation neutral
- No assumptions on specific implementation
- Promote flexibility/interoperability
- Focus on semantic units what you need to know
(implementation-neutral) vs. metadata elements
how you record it (implementation-specific) - Information that needs to be recoverable from
the digital archiving system, independent of
local implementation
14Uses and scope
- PREMIS can provide
- Common data model for organizing/thinking about
preservation metadata - Guidance for local implementations
- Standard for exchanging information packages
between repositories - PREMIS is not designed to provide
- Out-of-the-box solution need to instantiate as
metadata elements in the repository system - All needed metadata excludes business rules,
format-specific technical metadata, descriptive
metadata for access, non-core preservation
metadata - Lifecycle management of objects outside the
repository - Rights management limited to permissions to
perform actions within the repository
15An OAIS Perspective
Assumes stuff arrives in SIPs and is stored in
AIPs, and PREMIS is what the repository needs to
know to ingest, store and preserve it for the
future.
16Community interest
- As of July 2006
- 25,000 hits on Data Dictionary
- More than 100 subscribers to the PREMIS
Implementers Group (PIG) discussion list - Awarded the U.K. Digital Preservation Award for
2005 and the SAA Preservation Publication Award
for 2006 - The PREMIS Data Dictionary is a product of
collaboration and consensus - Digital preservation is a shared problem which
invites shared solutions - Multiplicity of perspectives on the working group
helps promote applicability in many contexts - The Data Dictionary should be useful to any
institution committed to the long-term
preservation of digital materials
17DATA MODEL
18PREMIS data model
Intellectual Entities
Rights
Agents
Objects
Events
19Intellectual Entity
- A coherent set of content that is reasonably
described as a unit, for example, a particular
book, map, photograph, or database. - May include other Intellectual Entities (e.g. as
a website includes a web page). - May have one or more digital representations.
- Can reference an Object or be referenced by an
Object, but is not described in PREMIS.
Int Entities
Rights
Agents
Objects
Events
- Examples
- Rabbit Run by John Updike (a book)
- Maggie at the beach
- (a photograph)
- The Library of Congress Website (a website)
- The Library of Congress American Memory Home
page (a web page)
20Object
- A discrete unit of information in digital form.
- Objects are what the repository preserves.
- FILE a named and ordered sequence of bytes that
is known by an operating system. - REPRESENTATION the set of files, including
structural metadata, needed for a complete and
reasonable rendition of an Intellectual Entity. - BITSTREAM contiguous or non-contiguous data
within a file that has meaningful common
properties for preservation purposes.
Int Entities
Rights
Agents
Objects
Events
- Examples
- chapter1.pdf (a pdf file)
- chapter1.pdf chapter2.pdf chapter3.pdf (the
pdf version of a book in 3 chapters) - an audio stream in uncompressed pcm (a bitstream
within an AVI file) - a video stream in MJPEG (a bitstream within an
AVI file)
21OBJECTS A photo in two formats
22OBJECTS A book in two versions
23An important aside about objects
- A repository does not have to control objects at
all levels. For example, it may not recognize
representations or bitstreams. - All PREMIS says is
- if you do control representation objects, these
are the semantic units that pertain to
representations - if you do control file objects, these are
semantic units that pertain to files - if you do control bitstream objects, these are
semantic units that pertain to bitstreams - and you need to record the relationships among
them.
24Event
- An action that involves at least one object or
agent known to the preservation repository. - Who, what, how, when, and to which object.
- Necessary to document digital provenance. Can
track history of object through the events in the
objects life.
Int Entities
Rights
Agents
Objects
Events
- Examples
- A validation event verifying that chapter1.pdf
is a good PDF file - An ingest event completing the process of
creating an AIP for a SIP - A migration event creating a new version of an
object in a more contemporary format
25Agent
- A person, organization, or software program
associated with preservation events in the life
of an object. - Not defined in detail in PREMIS not considered
core preservation metadata beyond identification
Int Entities
Rights
Agents
Objects
Events
- Examples
- Evan Owens (a person)
- Bank of Scotland (an organization)
- Bank of Scotland, Computer Systems Department (an
organization) - JHOVE version 1.0 (a software program)
26Rights
- An agreement with a rightsholder that allows a
repository to take action(s) related to objects
in the repository. - Not a full rights expression language.
- Assumption the repository is the grantee.
- Basic statement is Agent A grants Permission P
for Object B.
Int Entities
Rights
Agents
Objects
Events
- Example
- The Bank of Scotland gives the repository
permission to make an unlimited number of copies
of chapter1.pdf under its Agreement with the
repository signed December 11, 2006.
27Example A thesis in two parts PDF and MOV
Int Entity My Thesis
- There are 3 objects a REPRESENTATION and 2
FILEs. There could also be BITSTREAM objects
embedded within the FILE. - There may be several events associated with
ingest, such as validation, fixity checking and
writing to storage. - Agents may be associated with each event, e.g.
the JHOVE program and/or the repository staff
performed the validation - The archive has permission to copy the objects to
storage
Represen-tation Object
File Object
File Object
PDF
MOV
PDF
- The intellectual entity is the thesis. It has an
author, title and other bibliographic information
that might be given in a catalog record. Only
the identifier is defined in PREMIS.
MOV
28Identifiers
- Instances of objects, events, agents and rights
statements are uniquely identified by Identifiers - Identifier
- IdentifierType domain in which the value is
unique - IdentifierValue the identifier string itself
- ObjectIdentifier
- ObjectIdentifierType DRS
- ObjectIdentifierValue
- http//nrs.harvard.edu/urn-3FHCL.Loebsa1
-
- EventIdentifier
- EventIdentifierType DRS
- EventIdentifierValue 716593
syntax
example
example
29Identifiers (2)
- The identifier type should tell you how to build
the value, who is the naming authority - In this example the object identifier type is
DRS indicating Harvards Digital Repository
Service, not URL. The identifier is unique in
both domains, but DRS tells you more. - If all identifiers are local to the repository
system, it is unlikely that the identifier type
would be recorded for each identifier in the
system itself, but it should be supplied when
exchanging data with others - Can be created inside or outside of the
repository.
30The PREMIS Data Dictionary
31Data dictionary descriptions
32Sample data dictionary entry container
33Sample data dictionary entry container within
container
34Sample data dictionary entry semantic unit
35SEMANTIC UNITS PERTAINING TO OBJECTS
36Object entity
- Aggregates characteristics relevant to
preservation management that are properties of
the object - Semantic units may not all be applicable to each
type of object (representation, file, bitstream) - Main types of information
- identifier
- object characteristics
- creation information
- software and hardware environment
- digital signatures
- relationships to other objects
- links to other types of entity
37preservationLevel and objectCategory
- objectCategory (mandatory)
- Values representation, file, bitstream
- preservationLevel
- What preservation treatment/strategy the
repository plans for this object - Varying preservation options dependent on factors
such as value, uniqueness, preservability of
format - A business rule only relevant in a given
repository - Examples full, bit-level
- Mandatory, but may not be explicitly recorded if
repository offers only one level
38Creation information
- creatingApplication
- Information about application which created
object - Useful for later problem solving
- Container with 3 subunits name, version, date
- Applies to objects created externally or by
repository, e.g. by migration event - Repeatable if more than one application processed
it - Example MS Word 2000 date created
- originalName
- Name of object as submitted to or harvested by
repository - Supplements repository supplied names
- Only applicable to files
- Example sip/book/N419.pdf
39storage
- How and where the object is stored
- Container for contentLocation and storageMedium
- May be repeated if more than one identical copy
in a different location - contentLocation
- Information needed to retrieve a file from a
system or a bitstream from within a file - Subunits type and value
- Could be fully qualified path or identifier used
by storage system for bitstream a byte offset - storageMedium
- Physical medium on which the object is stored
- Useful for media management (e.g. media
migration) - May be name of system that knows the medium
- Examples hard disk, TSM
40Environment information
- What is needed to render or use an object
- Operating system
- Application software
- Computing resources
- Relevance to long-term preservation Ability to
render an object and interact with its content
may depend on knowing these technical details
41Environment container
- What is needed to render or use an object
- Operating system
- Application software
- Computing resources
- Why is obligation optional?
- Preservation strategies may differ in need for
this information (e.g., may be unneeded for
bit-level preservation) - We currently lack practical methods to collect
and store this information - Applies to all types of object (representation,
file, bitstream)
42Environment semantic units
- environmentCharacteristic
- Multiple environments can support an object, but
often not equally well - Suggested values unspecified, known to work,
minimum, recommended - Repository does not need to record all possible
environments - environmentPurpose
- Use supported by the specified environment
- Suggested values render, edit
- example for x.pdf Adobe Acrobat (edit), Adobe
Reader (render)
43Environment semantic units (cont.)
- software and hardware
- identify by name, version, type (broad category)
- Many may apply at least one should be recorded
- dependency
- non-software component or file needed
- dependency vs. swDependency
- e.g. fonts, schemas, stylesheets
- name and identifier
- environmentNote
- Any additional information
- Should not be used as substitute for more
rigorous description
44Environment example ETD (PDF file)
- environmentCharacteristic known to work
- environmentPurpose render
- software/swName Adobe Acrobat Reader
- software/swVersion 6.1
- software/swType renderer
- software/swDependency Windows NT
- software/swDependency Mozilla Firefox 1.0
- hardware/hwName Intel Pentium II
- hardware/hwType processor
- dependency/dependencyName Mathematica 5.2 True
Type math fonts
45Environment registries
- Information may be complex and increasingly
granular - Information often applies to whole class of
objects - PREMIS does not assume the existence of an
environment registry, but defines the information
that would be needed in one - PRONOM has some elements of environment registry
- for any file extension, gives list of software
that can - create
- render
- identify
- validate
- extract metadata from
46Digital signatures
- In a transaction, verifies the identify of the
sender and that the file was unchanged in
transmission. - Some archives sign stored objects for
verification in the future. - PREMIS digital signature semantic units are based
on W3Cs XML Signature Syntax and Processing - de facto standard for encoding signature
information - PREMIS adopts structure/semantics where possible
- Some departures e.g., PREMIS permits a given
signature to be a property of only 1 object.
47signatureInformation Container
- Who signed it?
- signer (name or pointer to an Agent)
- How was it signed?
- signatureInformationEncoding (e.g., Base64)
- signatureMethod (e.g., DSA-SHA1)
- How can we validate it?
- signatureValidationRules (could be a pointer to
documentation for the validation procedure) - signatureProperties (additional information)
- keyInformation the signers public key and other
info - Type e.g., DSA, RSA, PGP, etc.
- Other info e.g., certificate, revocation list,
etc. - And of course, the signature itself
48relationships between different objects
- relationship container
- what object is related to this one?
- relatedObjectIdentification
- relatedObjectSequence
- how is the object related?
- relationshipType
- structural, derivative
- relationshipSubType
- is part of, is source of, ...
- was this relationship the result of an event?
- relatedEventIdentification
- migration, copying
- relatedEventSequence
49Relationship between this file and that
representation
- relationship part of the description of this
file - relationshipType structural
- relationshipSubType is part of
- relatedObjectIdentification
- relatedObjectIdentifier
- Type repositoryID
- relatedObjectIdentifier
- Value 0385503954
- relatedObjectSequence
- 1
- relatedEventIdentification none
is part of
50Relationship between this file and a more
current, migrated version
- relationship part of the description of the
file - relationshipType derivative
- relationshipSubType is source of
- relatedObjectIdentification the identifier of
the related file - relatedObjectIdentifierType MyIRFileID
- relatedObjectIdentifierValue F004400
- relatedObjectSequence
- relatedEventIdentification the identifier of
the migration event - relatedEventIdentifierType MyIREventID
- relatedEventIdentifierValue E0192
- relatedEventSequence
is source of
this file
F004400
E0192
through migration event
51Relationships between different entity types
- Identifiers are used to link related entities
together - This object can link to one or more intellectual
entities, rights statements, and events via
linking semantic units
Int Entities
Rights
Agents
Objects
Events
- linkingIntellectualEntityIdentifier
- linkingIntellectualEntityIdentifierType
- linkingIntellectualEntityIdentifierValue
- linkingPermissionStatementIdentifier
- linkingPermissionStatementIdentifierType
- linkingPermissionStatementIdentifierValue
- linkingEventIdentifier (can you guess the two
subelements?)
52objectCharacteristics
- Applicable only to file and bitstream
- Technical properties common to all/most file
formats, not format specific - Container for subunits
- compositionLevel
- fixity
- size
- format
- significantProperties
- inhibitors
53fixity
- Information used to verify whether an object has
been altered compare message digests
(checksums) calculated at different times - Container for messageDigestAlgorithm,
messageDigest, messageDigestOriginator - Automatically calculated and recorded by
repository - Algorithm controlled vocabulary, example SHA-1
- Message digest output of message digest
algorithm - Originator agent that created original message
digest could be a string or a pointer
54format
- Container semantic unit
- Identifies the format of a file or bitstream
- Preservation activities depend on detailed and
accurate knowledge about formats - Should be ascertained by repository on ingest
(for example, using JHOVE) - May be a format name (formatDesignation) or a
pointer into a registry (formatRegistry)
55formatDesignation and formatRegistry
- formatDesignation
- Identifies the format of an object by name and
version - Format may be a matter of opinion Is it text,
xml, or METS? - MIME type is most widely used authority list
- May need more granularity may be multipart (tiff
6.0/geotiff) - formatRegistry
- Identifies format by reference to an entry in a
format registry - Detailed specifications on formats may be
contained in a future format registry - formatRegistryName, formatRegistryKey,
formatRegistryRole - Role includes purpose or expected use
56significantProperties
- Characteristics of an object considered by a
repository to be important to maintain through
preservation actions - Applicable to representation, file, bitstream
- May apply to all objects of a certain class or
may be unique to each individual object - May be determined by business rules of the
repository - Listing significant properties implies that the
repository plans to preserve those properties and
would note any modifications to them in
eventOutcome - Example for a PDF with embedded links that are
not essential use Content only - May need to be further developed in the future
57inhibitors
- Features of the object intended to inhibit
access, use or migration - It is necessary to record the kind of encryption
and the access key to allow future use of the
object - Applicable to file and bitstream
- inhibitorType
- Inhibitor method employed, e.g. DES, password
protection - inhibitorTarget
- The content or function protected, e.g.
function print - inhibitorKey
- The decryption key or password
-