Global Digital Format Registry - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Global Digital Format Registry

Description:

Ontological CLASSES, abstract families, concrete formats, and relationships. BYTESTREAM ... Ontological classification. Relationships. Status. Archiving Web Resources ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 42
Provided by: step54
Category:

less

Transcript and Presenter's Notes

Title: Global Digital Format Registry


1
Global Digital Format Registry
Archiving Web Resources Issues for Cultural
Heritage Institutions National Library of
Australia, Canberra, November 12,
2004 Information Day
  • Stephen L. Abrams
  • Digital Library Program Manager
  • Harvard University Library

2
Introduction
  • Almost all aspects of repository operation are
    conditioned by the format of the objects in the
    repository
  • Without proper characterization of digital
    objects (format typing and technical metadata),
    effective long-term preservation is difficult, if
    not impossible
  • Repositories need to ensure that
  • Digital object content streams are valid with
    respect to their format
  • Metadata encapsulated within object content
    streams are consistent with externally supplied
    metadata
  • Formatted content streams remain accessible over
    time

3
Use Cases
  • Identification
  • I have an object what format is it?
  • Validation
  • I have an object purportedly of format F is
    it?
  • Characterization
  • I have an object of format F what are its
    salient properties?
  • Assessment
  • I have an object of format F is it at risk of
    obsolescence?
  • Processing
  • I have an object of format F how can I perform
    operation X on it?

4
Repository Format Dependencies
Based on Open Archival Information System (OAIS)
Reference Model, ISO 14721
5
Characteristics of a Format Registry
  • Predictable data
  • Arbitrary granularity
  • Inclusive
  • Trustworthy
  • Authoritative
  • Honest broker with regard to proprietary
    information
  • Machine actionable discovery
  • Interoperable
  • Informative, not evaluative

6
Global Digital Format Registry
  • DLF funded two invitational workshops in 2002 to
    investigate issues surrounding the establishment
    of a GDFR

- National Archives, UK - NARA - National
Archives of Canada - New York University - NIST -
Online Computer Library Center - Research
Libraries Group - Stanford University -
University of Pennsylvania
- Bibliothèque nationale de France - California
Digital Library - Digital Library Federation -
Harvard University - Internet Engineering Task
Force - JISC - JSTOR - Library of Congress - MIT
7
GDFR Scope
  • The registry will maintain persistent,
    unambiguous bindings between public identifiers
    for digital formats and representation
    information for those formats

8
What is a Format, Anyway?
  • A reversible byte-serialized encoding of an
    information model
  • A set of syntactic and semantic rules that
  • Map from abstract content to a sequence of bytes
  • Map back from a sequence of bytes to the abstract
    content represented by those bytes

9
Almost Anything is a Format
  • ASCII-encoded text, Excel spreadsheet, PDF
  • IEEE 754 floating point number
  • XML schema (and XML Schema)
  • LZW compression
  • ARC file, Tar archive
  • Windows Portable Executable (.exe)
  • NTFS file system

10
When Is a Format Not a Format?
11
When Is a Format Not a Format?
  • How many words are inside the box?

CAT
CAT
12
When Is a Format Not a Format?
  • How many words are inside the box?
  • It depends

CAT
CAT
13
When Is a Format Not a Format?
  • How many words are inside the box?
  • It depends
  • There are two tokens of one type (or two species
    of one genus, or two instances of one class, etc.)

CAT
CAT
14
When Is a Format Not a Format?
  • How many formats are inside the box?

TIFF 4.0
TIFF/EP
TIFF/IT
15
When Is a Format Not a Format?
  • How many formats are inside the box?
  • Three subtypes of one format family
  • Inter-familial relationships

TIFF 4.0
TIFF/EP
TIFF/IT
16
Format Family Tree
17
Formal classification
  • Ontological CLASSES, abstract families, concrete
    formats, and relationships
  • BYTESTREAM
  • IMAGE
  • STILL
  • RASTER
  • GIF
  • GIF87a
  • GIF89a new-version-of GIF87a
  • JPEG
  • ISO 10918
  • JFIF subtype-of ISO 10918
  • TIFF
  • TIFF 4.0
  • TIFF 5.0 new-version-of TIFF 4.0
  • TIFF 6.0 new-version-of TIFF 5.0
  • TIFF/EP subtype-of TIFF 6.0
  • TIFF/IT subtype-of TIFF 6.0
  • TIFF/IT/CT subtype-of TIFF/IT
  • TIFF/IT/CT/P1 subtype-of TIFF/IT/CT

18
Format Subtyping
  • Substitutability
  • Can the subtype be substituted for its parent in
    all contexts without detection or loss of
    function?
  • All TIFF/ITs are TIFF 6.0s, but not all TIFF
    6.0s are TIFF/ITs
  • Arbitrary granularity of subtype
  • TIFF
  • TIFF 6.0
  • Baseline bitonal
  • DLF Benchmark for Faithful Digital
    Reproductions of Monographs and Serials
  • Harvard archival master specifications
  • Harvard Open Collection Program
    specifications

19
Format Subtyping
  • MIME types represent formats at the coarsest
    possible granularity
  • May not be sufficient for characterizes digital
    objects for purposes of preservation workflows
  • Representation information for a subtype need
    only detail the properties that distinguish it
    from its parent all others are inherited
  • Permits selection of tools appropriate to the
    task at hand

20
Format Relationships
  • Subtyping
  • US-ASCII is a subtype of UTF-8
  • Version
  • PDF 1.0 1.5
  • Encapsulation
  • WAVE can contain ?-law and ?-law audio content
    streams
  • Tar archive can contain anything
  • Affinity
  • ISO 10918-1 (JPEG) vs. ISO 10918-3 (SPIFF) vs.
    ISO 14495 (JPEG-LS)

21
Format Representation Information
  • Information that maps formatted content to more
    meaningful concepts
  • Syntax
  • A TIFF header is composed of a two byte string,
    II or MM, a two byte string, 0x2A00 or
    0x002A, and an unsigned 32 bit integer
  • Semantics
  • II indicates big-endian byte order MM,
    little-endian
  • The two byte string is the decimal value 42 in
    correct byte order
  • The integer is the byte offset of the first IFD
    structure
  • Assessment
  • Factors bearing on a formats amenability for
    long-term preservation

22
GDFR Architecture
  • Not a single monolithic registry
  • A distributed network of cooperating registries
  • Standard protocol
  • Standard abstract data model
  • Implementation of the participating registries is
    not prescribed
  • Conformance is at the level of the protocol

23
Distributed Network of Cooperating Registries
24
Data Model
  • General descriptive properties, including
    canonical and alias identifiers for formats
  • Characterization properties, detailing the
    syntactic and semantic properties for formats
  • Processing properties, describing systems and
    services for which registered formats are inputs
    or outputs
  • Administrative properties, capturing important
    events in a registrations provenance

25
Data Model
26
Data Model Sources
  • ISO 14721, Open archival information system --
    Reference model
  • OCLC/RLG Preservation Metadata Framework
  • Incorporates CEDARS, NEDLIB, NLA, OAIS, OCLC,
    etc.
  • JISC File Format Representation and Rendering
    Project
  • PRONOM
  • ISO/IEC 11179, Specification and standardization
    of data elements
  • OASIS/ebXML Registry Information Model

27
Descriptive Properties
  • Identifiers
  • Canonical
  • Alias
  • Author
  • Owner
  • Maintainer
  • Standard agent properties
  • Name, title, affiliation, type, contact
    information
  • Ontological classification
  • Relationships
  • Status

28
Characterization Properties
  • Family
  • Specification
  • Bibliographic description
  • Title, edition, author, publisher, date
  • Identifiers
  • Type
  • Reference manual, technical report, standards
    document
  • Access regime
  • Signature
  • External - Nominal file extension, Mac OS file
    type
  • Internal - Magic number

29
Characterization Properties
  • Assessment
  • Library of Congress
  • Sustainability
  • Disclosure
  • Adoption
  • Transparency
  • Self-documentation
  • External dependencies
  • DRM
  • Quality and functionality
  • Cornell VRC, OCLC INFORM

30
Processing Properties
  • Systems and services that use formats as inputs
    or outputs
  • Name and version
  • Vendor
  • Function
  • Hardware/software dependencies

31
Service Model
  • Interoperation services, for communication
    between registries conforming to the GDFR
    protocol
  • Local services, which can include
  • Access services, providing discovery and delivery
    of format representation information
  • Management services, providing mechanisms for
    maintenance, technical review, and notification
  • Human and machine service interfaces

32
Service Model Sources
  • ANSI X3.285, Metamodel for Management of
    Shareable Data
  • OASIS/ebXML Registry Services Specification

33
Interoperation
  • Registration
  • Review
  • IETF RFC process
  • Synchronization
  • OAI
  • LOCKSS
  • Some information may not be replicated, either by
    matter of local policy or due to access
    restrictions

34
Access Services
  • Discovery
  • Local
  • Global
  • Delivery

35
Management Services
  • Maintenance
  • Create, update, delete
  • Notification
  • Tell me when an event of interest to me occurs
  • Introspection
  • Public exposure of local services, policies, and
    practices

36
What Happens Next?
  • Prototype system demonstrating provisional data
    model
  • Multi-year, two-track project

37
FRED A Format Registry Demonstration
38
GDFR Technical Track
  • Deliverables
  • Data model
  • Network protocol
  • Reference implementation
  • Initial population
  • Schedule
  • Year one Analysis, design, and prototype
  • Year two Development and deployment
  • Year three Production operation and integration
    with repository workflows

39
GDFR Administrative Track
  • Deliverables
  • Recommendations for sustainable governance
    structure and business model
  • Schedule
  • Year one Analysis and consultation
  • Year two White Papers and consultation
  • Year three Final recommendations

40
Why is This Important to You?
  • The GDFR is an enabling technology underlying
    digital repository operations and preservation
    activities
  • It permits typing of digital objects at an
    appropriate level of granularity
  • It enables the future recovery of the syntax and
    semantics associated with typed digital objects
  • It provides a mechanism to pool and redistribute
    the expertise of the digital preservation
    community

41
More Information
hul.harvard.edu/gdfr/ tom.library.upenn.edu/fred/
stephen_abrams_at_harvard.edu
Write a Comment
User Comments (0)
About PowerShow.com