DATA DICTIONARY FOR DIGITAL PRESERVATION: PREMIS TUTORIAL - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

DATA DICTIONARY FOR DIGITAL PRESERVATION: PREMIS TUTORIAL

Description:

Important to orient PREMIS in an OAIS context: ... [Maggie at the beach] (a photograph) The Library of Congress Website (a website) ... – PowerPoint PPT presentation

Number of Views:287
Avg rating:3.0/5.0
Slides: 58
Provided by: BrianL71
Category:

less

Transcript and Presenter's Notes

Title: DATA DICTIONARY FOR DIGITAL PRESERVATION: PREMIS TUTORIAL


1
DATA DICTIONARY FOR DIGITAL PRESERVATION PREMIS
TUTORIAL
  • Priscilla Caplan, FCLA
  • Rebecca Guenther, Library of Congress
  • Wolfson Medical Library
  • University of Glasgow
  • July 17-19
  • Sponsored by the Digital Curation Centre

2
GOALS
  • Establish definition and scope of PREMIS Data
    Dictionary
  • Show how semantic units relate to the Data Model
  • Introduce semantic units pertaining to Objects,
    Events, Rights and Agents
  • Discuss major implementation issues
  • Show ways of representing PREMIS in XML

3
OUTLINE
  • Introduction background and context
  • PREMIS Data Model
  • Semantic units pertaining to Objects
  • Objects hands-on exercise
  • Files, bitstreams and the onion model
  • Semantic units pertaining to Agents, Rights,
    Events
  • Events hands-on exercise
  • Implementation issues
  • PREMIS in XML PREMIS schemas and METS
  • XML hands-on exercise
  • Implementers panel
  • Maintenance activity and future plans
  • Conclusion Q A

4
INTRODUCTION BACKGROUND AND CONTEXT
5
OAIS
  • Reference Model for an Open Archival Information
    System (OAIS). Consultative Committee for Space
    Data Systems, 2002
  • A high level framework for preservation
    repositories, establishing a common model and
    vocabulary
  • Defines
  • the functions an OAIS must perform
  • the information model an OAIS must employ
  • In the information model, a Content Data Object
    is described by Representation Information (what
    you need to use and interpret the object) and
    Preservation Descriptive Information (what you
    need to preserve and access the object)

6
OAIS Information Model and PREMIS DD
  • OAIS Reference Model (arguably) most widely
    adopted standard in digital preservation
    community
  • Important to orient PREMIS in an OAIS context
  • Relate elements from other preservation metadata
    schema based on OAIS to PREMIS (e.g., NLA)
  • Enhance interoperability/applicability of
    preservation metadata registries e.g., Digital
    Curation Centres Representation Information
    Repository
  • Allow repositories to think about OAIS
    conformance in a PREMIS context, and vice versa.
  • Many repositories use OAIS concepts and
    vocabulary to express information (content
    metadata) within their archiving systems
  • How does PREMIS relate to these concepts and
    vocabulary?

7
OAIS Information Model Structure
Packaging Information
SIP, AIP, or DIP
Content Information
Preservation Description Information
Content Data Object
Representation Information
Provenance
Reference
Fixity
Context
Descriptive Information
8
PREMIS-to-OAIS Mapping
Intellectual Entities
Objects
Rights
Events
Agents
Fixity Information
Context Information
Reference Information
Provenance Information
Packaging Information
Description Information
Representation Information
9
Comments
  • PREMIS Data Dictionary does not provide semantic
    units for Intellectual Entities
  • But provides semantic units to link to other
    metadata sources for Intellectual Entities e.g.,
    MARC record these are only semantic units
    categorized as descriptive information (metadata
    to aid discovery)
  • All entities have reference (identification)
    information.
  • No packaging information that links content
    with metadata. But PREMIS can be used with
    something like METS, which does provide packaging
    information.
  • In short, PREMIS deals mostly with
    representation, context, provenance, and fixity
    information, in keeping with PREMIS definition of
    preservation metadata.

10
Early work in preservation metadata
  • Open Archival Information System (OAIS)
  • defined a basic abstract information model
  • NLA, CEDARS and NEDLIB
  • developed preservation metadata schemes for their
    projects
  • OCLC/RLG Preservation Metadata Framework Working
    Group, Preservation Metadata and the OAIS
    Information model A Metadata Framework to
    Support the Preservation of Digital Objects,
    2001
  • unified earlier work within the OAIS framework
  • National Library of New Zealand, 2002
  • organized metadata elements around a data model
  • Preservation Metadata Implementation Strategies
    (PREMIS)
  • focused on practical implementation needs

11
From theory to practice
Preservation Metadata Requirements
Digital Archiving Systems
Framework
OAIS
PREMIS Data Dictionary
12
PREMIS Working Group
  • Objective Define implementable, core
    preservation metadata, with recommendations for
    management and use
  • Membership
  • 30 experts from 5 countries, libraries,
    museums, archives, government agencies, private
    sector
  • Co-Chairs Priscilla Caplan (FCLA), Rebecca
    Guenther (LC)
  • Data Dictionary for Preservation Metadata Final
  • Report of the PREMIS Working Group
  • PREMIS Data Dictionary 1.0
  • Accompanying report (scope, context,
  • data model, special topics, glossary,
  • examples)
  • XML schemas to support implementation

13
Some guiding principles and assumptions
  • Implementable, core, preservation metadata
  • Preservation metadata maintain viability,
    renderability, understandability, authenticity,
    identity in a preservation context
  • Core What most preservation repositories need
    to know to preserve digital materials over the
    long-term
  • Implementable rigorously defined supported by
    usage guidelines/recommendations emphasis on
    automated workflows
  • Implementation neutral
  • No assumptions on specific implementation
  • Promote flexibility/interoperability
  • Focus on semantic units what you need to know
    (implementation-neutral) vs. metadata elements
    how you record it (implementation-specific)
  • Information that needs to be recoverable from
    the digital archiving system, independent of
    local implementation

14
Uses and scope
  • PREMIS can provide
  • Common data model for organizing/thinking about
    preservation metadata
  • Guidance for local implementations
  • Standard for exchanging information packages
    between repositories
  • PREMIS is not designed to provide
  • Out-of-the-box solution need to instantiate as
    metadata elements in the repository system
  • All needed metadata excludes business rules,
    format-specific technical metadata, descriptive
    metadata for access, non-core preservation
    metadata
  • Lifecycle management of objects outside the
    repository
  • Rights management limited to permissions to
    perform actions within the repository

15
An OAIS Perspective
Assumes stuff arrives in SIPs and is stored in
AIPs, and PREMIS is what the repository needs to
know to ingest, store and preserve it for the
future.
16
Community interest
  • As of July 2006
  • 25,000 hits on Data Dictionary
  • More than 100 subscribers to the PREMIS
    Implementers Group (PIG) discussion list
  • Awarded the U.K. Digital Preservation Award for
    2005 and the SAA Preservation Publication Award
    for 2006
  • The PREMIS Data Dictionary is a product of
    collaboration and consensus
  • Digital preservation is a shared problem which
    invites shared solutions
  • Multiplicity of perspectives on the working group
    helps promote applicability in many contexts
  • The Data Dictionary should be useful to any
    institution committed to the long-term
    preservation of digital materials

17
DATA MODEL
18
PREMIS data model
Intellectual Entities
Rights
Agents
Objects
Events
19
Intellectual Entity
  • A coherent set of content that is reasonably
    described as a unit, for example, a particular
    book, map, photograph, or database.
  • May include other Intellectual Entities (e.g. as
    a website includes a web page).
  • May have one or more digital representations.
  • Can reference an Object or be referenced by an
    Object, but is not described in PREMIS.

Int Entities
Rights
Agents
Objects
Events
  • Examples
  • Rabbit Run by John Updike (a book)
  • Maggie at the beach
  • (a photograph)
  • The Library of Congress Website (a website)
  • The Library of Congress American Memory Home
    page (a web page)

20
Object
  • A discrete unit of information in digital form.
  • Objects are what the repository preserves.
  • FILE a named and ordered sequence of bytes that
    is known by an operating system.
  • REPRESENTATION the set of files, including
    structural metadata, needed for a complete and
    reasonable rendition of an Intellectual Entity.
  • BITSTREAM contiguous or non-contiguous data
    within a file that has meaningful common
    properties for preservation purposes.

Int Entities
Rights
Agents
Objects
Events
  • Examples
  • chapter1.pdf (a pdf file)
  • chapter1.pdf chapter2.pdf chapter3.pdf (the
    pdf version of a book in 3 chapters)
  • an audio stream in uncompressed pcm (a bitstream
    within an AVI file)
  • a video stream in MJPEG (a bitstream within an
    AVI file)

21
OBJECTS A photo in two formats
22
OBJECTS A book in two versions
23
An important aside about objects
  • A repository does not have to control objects at
    all levels. For example, it may not recognize
    representations or bitstreams.
  • All PREMIS says is
  • if you do control representation objects, these
    are the semantic units that pertain to
    representations
  • if you do control file objects, these are
    semantic units that pertain to files
  • if you do control bitstream objects, these are
    semantic units that pertain to bitstreams
  • and you need to record the relationships among
    them.

24
Event
  • An action that involves at least one object or
    agent known to the preservation repository.
  • Who, what, how, when, and to which object.
  • Necessary to document digital provenance. Can
    track history of object through the events in the
    objects life.

Int Entities
Rights
Agents
Objects
Events
  • Examples
  • A validation event verifying that chapter1.pdf
    is a good PDF file
  • An ingest event completing the process of
    creating an AIP for a SIP
  • A migration event creating a new version of an
    object in a more contemporary format

25
Agent
  • A person, organization, or software program
    associated with preservation events in the life
    of an object.
  • Not defined in detail in PREMIS not considered
    core preservation metadata beyond identification

Int Entities
Rights
Agents
Objects
Events
  • Examples
  • Evan Owens (a person)
  • Bank of Scotland (an organization)
  • Bank of Scotland, Computer Systems Department (an
    organization)
  • JHOVE version 1.0 (a software program)

26
Rights
  • An agreement with a rightsholder that allows a
    repository to take action(s) related to objects
    in the repository.
  • Not a full rights expression language.
  • Assumption the repository is the grantee.
  • Basic statement is Agent A grants Permission P
    for Object B.

Int Entities
Rights
Agents
Objects
Events
  • Example
  • The Bank of Scotland gives the repository
    permission to make an unlimited number of copies
    of chapter1.pdf under its Agreement with the
    repository signed December 11, 2006.

27
Example A thesis in two parts PDF and MOV
Int Entity My Thesis
  • There are 3 objects a REPRESENTATION and 2
    FILEs. There could also be BITSTREAM objects
    embedded within the FILE.
  • There may be several events associated with
    ingest, such as validation, fixity checking and
    writing to storage.
  • Agents may be associated with each event, e.g.
    the JHOVE program and/or the repository staff
    performed the validation
  • The archive has permission to copy the objects to
    storage

Represen-tation Object
File Object
File Object
PDF
MOV
PDF
  • The intellectual entity is the thesis. It has an
    author, title and other bibliographic information
    that might be given in a catalog record. Only
    the identifier is defined in PREMIS.

MOV
28
Identifiers
  • Instances of objects, events, agents and rights
    statements are uniquely identified by Identifiers
  • Identifier
  • IdentifierType domain in which the value is
    unique
  • IdentifierValue the identifier string itself
  • ObjectIdentifier
  • ObjectIdentifierType DRS
  • ObjectIdentifierValue
  • http//nrs.harvard.edu/urn-3FHCL.Loebsa1
  • EventIdentifier
  • EventIdentifierType DRS
  • EventIdentifierValue 716593

syntax
example
example
29
Identifiers (2)
  • The identifier type should tell you how to build
    the value, who is the naming authority
  • In this example the object identifier type is
    DRS indicating Harvards Digital Repository
    Service, not URL. The identifier is unique in
    both domains, but DRS tells you more.
  • If all identifiers are local to the repository
    system, it is unlikely that the identifier type
    would be recorded for each identifier in the
    system itself, but it should be supplied when
    exchanging data with others
  • Can be created inside or outside of the
    repository.

30
The PREMIS Data Dictionary
31
Data dictionary descriptions
32
Sample data dictionary entry container
33
Sample data dictionary entry container within
container
34
Sample data dictionary entry semantic unit
35
SEMANTIC UNITS PERTAINING TO OBJECTS
36
Object entity
  • Aggregates characteristics relevant to
    preservation management that are properties of
    the object
  • Semantic units may not all be applicable to each
    type of object (representation, file, bitstream)
  • Main types of information
  • identifier
  • object characteristics
  • creation information
  • software and hardware environment
  • digital signatures
  • relationships to other objects
  • links to other types of entity

37
preservationLevel and objectCategory
  • objectCategory (mandatory)
  • Values representation, file, bitstream
  • preservationLevel
  • What preservation treatment/strategy the
    repository plans for this object
  • Varying preservation options dependent on factors
    such as value, uniqueness, preservability of
    format
  • A business rule only relevant in a given
    repository
  • Examples full, bit-level
  • Mandatory, but may not be explicitly recorded if
    repository offers only one level

38
Creation information
  • creatingApplication
  • Information about application which created
    object
  • Useful for later problem solving
  • Container with 3 subunits name, version, date
  • Applies to objects created externally or by
    repository, e.g. by migration event
  • Repeatable if more than one application processed
    it
  • Example MS Word 2000 date created
  • originalName
  • Name of object as submitted to or harvested by
    repository
  • Supplements repository supplied names
  • Only applicable to files
  • Example sip/book/N419.pdf

39
storage
  • How and where the object is stored
  • Container for contentLocation and storageMedium
  • May be repeated if more than one identical copy
    in a different location
  • contentLocation
  • Information needed to retrieve a file from a
    system or a bitstream from within a file
  • Subunits type and value
  • Could be fully qualified path or identifier used
    by storage system for bitstream a byte offset
  • storageMedium
  • Physical medium on which the object is stored
  • Useful for media management (e.g. media
    migration)
  • May be name of system that knows the medium
  • Examples hard disk, TSM

40
Environment information
  • What is needed to render or use an object
  • Operating system
  • Application software
  • Computing resources
  • Relevance to long-term preservation Ability to
    render an object and interact with its content
    may depend on knowing these technical details

41
Environment container
  • What is needed to render or use an object
  • Operating system
  • Application software
  • Computing resources
  • Why is obligation optional?
  • Preservation strategies may differ in need for
    this information (e.g., may be unneeded for
    bit-level preservation)
  • We currently lack practical methods to collect
    and store this information
  • Applies to all types of object (representation,
    file, bitstream)

42
Environment semantic units
  • environmentCharacteristic
  • Multiple environments can support an object, but
    often not equally well
  • Suggested values unspecified, known to work,
    minimum, recommended
  • Repository does not need to record all possible
    environments
  • environmentPurpose
  • Use supported by the specified environment
  • Suggested values render, edit
  • example for x.pdf Adobe Acrobat (edit), Adobe
    Reader (render)

43
Environment semantic units (cont.)
  • software and hardware
  • identify by name, version, type (broad category)
  • Many may apply at least one should be recorded
  • dependency
  • non-software component or file needed
  • dependency vs. swDependency
  • e.g. fonts, schemas, stylesheets
  • name and identifier
  • environmentNote
  • Any additional information
  • Should not be used as substitute for more
    rigorous description

44
Environment example ETD (PDF file)
  • environmentCharacteristic known to work
  • environmentPurpose render
  • software/swName Adobe Acrobat Reader
  • software/swVersion 6.1
  • software/swType renderer
  • software/swDependency Windows NT
  • software/swDependency Mozilla Firefox 1.0
  • hardware/hwName Intel Pentium II
  • hardware/hwType processor
  • dependency/dependencyName Mathematica 5.2 True
    Type math fonts

45
Environment registries
  • Information may be complex and increasingly
    granular
  • Information often applies to whole class of
    objects
  • PREMIS does not assume the existence of an
    environment registry, but defines the information
    that would be needed in one
  • PRONOM has some elements of environment registry
  • for any file extension, gives list of software
    that can
  • create
  • render
  • identify
  • validate
  • extract metadata from

46
Digital signatures
  • In a transaction, verifies the identify of the
    sender and that the file was unchanged in
    transmission.
  • Some archives sign stored objects for
    verification in the future.
  • PREMIS digital signature semantic units are based
    on W3Cs XML Signature Syntax and Processing
  • de facto standard for encoding signature
    information
  • PREMIS adopts structure/semantics where possible
  • Some departures e.g., PREMIS permits a given
    signature to be a property of only 1 object.

47
signatureInformation Container
  • Who signed it?
  • signer (name or pointer to an Agent)
  • How was it signed?
  • signatureInformationEncoding (e.g., Base64)
  • signatureMethod (e.g., DSA-SHA1)
  • How can we validate it?
  • signatureValidationRules (could be a pointer to
    documentation for the validation procedure)
  • signatureProperties (additional information)
  • keyInformation the signers public key and other
    info
  • Type e.g., DSA, RSA, PGP, etc.
  • Other info e.g., certificate, revocation list,
    etc.
  • And of course, the signature itself

48
relationships between different objects
  • relationship container
  • what object is related to this one?
  • relatedObjectIdentification
  • relatedObjectSequence
  • how is the object related?
  • relationshipType
  • structural, derivative
  • relationshipSubType
  • is part of, is source of, ...
  • was this relationship the result of an event?
  • relatedEventIdentification
  • migration, copying
  • relatedEventSequence

49
Relationship between this file and that
representation
  • relationship part of the description of this
    file
  • relationshipType structural
  • relationshipSubType is part of
  • relatedObjectIdentification
  • relatedObjectIdentifier
  • Type repositoryID
  • relatedObjectIdentifier
  • Value 0385503954
  • relatedObjectSequence
  • 1
  • relatedEventIdentification none

is part of
50
Relationship between this file and a more
current, migrated version
  • relationship part of the description of the
    file
  • relationshipType derivative
  • relationshipSubType is source of
  • relatedObjectIdentification the identifier of
    the related file
  • relatedObjectIdentifierType MyIRFileID
  • relatedObjectIdentifierValue F004400
  • relatedObjectSequence
  • relatedEventIdentification the identifier of
    the migration event
  • relatedEventIdentifierType MyIREventID
  • relatedEventIdentifierValue E0192
  • relatedEventSequence

is source of
this file
F004400
E0192
through migration event
51
Relationships between different entity types
  • Identifiers are used to link related entities
    together
  • This object can link to one or more intellectual
    entities, rights statements, and events via
    linking semantic units

Int Entities
Rights
Agents
Objects
Events
  • linkingIntellectualEntityIdentifier
  • linkingIntellectualEntityIdentifierType
  • linkingIntellectualEntityIdentifierValue
  • linkingPermissionStatementIdentifier
  • linkingPermissionStatementIdentifierType
  • linkingPermissionStatementIdentifierValue
  • linkingEventIdentifier (can you guess the two
    subelements?)

52
objectCharacteristics
  • Applicable only to file and bitstream
  • Technical properties common to all/most file
    formats, not format specific
  • Container for subunits
  • compositionLevel
  • fixity
  • size
  • format
  • significantProperties
  • inhibitors

53
fixity
  • Information used to verify whether an object has
    been altered compare message digests
    (checksums) calculated at different times
  • Container for messageDigestAlgorithm,
    messageDigest, messageDigestOriginator
  • Automatically calculated and recorded by
    repository
  • Algorithm controlled vocabulary, example SHA-1
  • Message digest output of message digest
    algorithm
  • Originator agent that created original message
    digest could be a string or a pointer

54
format
  • Container semantic unit
  • Identifies the format of a file or bitstream
  • Preservation activities depend on detailed and
    accurate knowledge about formats
  • Should be ascertained by repository on ingest
    (for example, using JHOVE)
  • May be a format name (formatDesignation) or a
    pointer into a registry (formatRegistry)

55
formatDesignation and formatRegistry
  • formatDesignation
  • Identifies the format of an object by name and
    version
  • Format may be a matter of opinion Is it text,
    xml, or METS?
  • MIME type is most widely used authority list
  • May need more granularity may be multipart (tiff
    6.0/geotiff)
  • formatRegistry
  • Identifies format by reference to an entry in a
    format registry
  • Detailed specifications on formats may be
    contained in a future format registry
  • formatRegistryName, formatRegistryKey,
    formatRegistryRole
  • Role includes purpose or expected use

56
significantProperties
  • Characteristics of an object considered by a
    repository to be important to maintain through
    preservation actions
  • Applicable to representation, file, bitstream
  • May apply to all objects of a certain class or
    may be unique to each individual object
  • May be determined by business rules of the
    repository
  • Listing significant properties implies that the
    repository plans to preserve those properties and
    would note any modifications to them in
    eventOutcome
  • Example for a PDF with embedded links that are
    not essential use Content only
  • May need to be further developed in the future

57
inhibitors
  • Features of the object intended to inhibit
    access, use or migration
  • It is necessary to record the kind of encryption
    and the access key to allow future use of the
    object
  • Applicable to file and bitstream
  • inhibitorType
  • Inhibitor method employed, e.g. DES, password
    protection
  • inhibitorTarget
  • The content or function protected, e.g.
    function print
  • inhibitorKey
  • The decryption key or password
Write a Comment
User Comments (0)
About PowerShow.com