Title: The Role of Format Registries in Digital Preservation
1The Role of Format Registries in Digital
Preservation
Archiving Web Resources Issues for Cultural
Heritage Institutions National Library of
Australia, Canberra, November 9-11, 2004
- Stephen L. Abrams
- Digital Library Program Manager
- Harvard University Library
2Introduction
- Almost all aspects of repository operation are
conditioned by the format of the objects in the
repository - Without proper characterization of digital
objects (format typing and technical metadata),
effective long-term preservation is difficult, if
not impossible - Repositories need to ensure that
- Digital object content streams are valid with
respect to their format - Metadata encapsulated within object content
streams are consistent with externally supplied
metadata - Formatted content streams remain accessible over
time
3What is a Format, Anyway?
- A reversible byte-serialized encoding of an
information model - A set of syntactic and semantic rules that
- Map from abstract content to a sequence of bytes
- Map back from a sequence of bytes to the abstract
content represented by those bytes
4With Format Typing, All Content is Opaque
ffd8ffe000104a464946000102010083 00830000ffed0fb05
0686f746f73686f 7020332e30003842494d03e90a507269 6
e7420496e666f000000007800000000 004800480000000002
f40240ffeeffee 030602520347052803fc000200000048 00
480000000002d80228000100000064 0000000100030303000
00001270f0001 00010000000000000000000000006008 001
90190000000000000000000000000 00000000000000000000
000000000000 000000003842494d03ed0a5265736f6c 7574
696f6e0000000010008313a30002 0001008313a3000200013
842494d040d 18465820476c6f62616c204c69676874 696e6
720416e676c650000000004000
5With Format Typing, All Content is Opaque
ffd8ffe000104a464946000102010083 00830000ffed0fb05
0686f746f73686f 7020332e30003842494d03e90a507269 6
e7420496e666f000000007800000000 004800480000000002
f40240ffeeffee 030602520347052803fc000200000048 00
480000000002d80228000100000064 0000000100030303000
00001270f0001 00010000000000000000000000006008 001
90190000000000000000000000000 00000000000000000000
000000000000 000000003842494d03ed0a5265736f6c 7574
696f6e0000000010008313a30002 0001008313a3000200013
842494d040d 18465820476c6f62616c204c69676874 696e6
720416e676c650000000004000
6Use Cases
- Identification
- I have an object what format is it?
- Validation
- I have an object purportedly of format F is
it? - Characterization
- I have an object of format F what are its
salient properties? - Assessment
- I have an object of format F is it at risk of
obsolescence? - Processing
- I have an object of format F how can I perform
operation X on it?
7Repository Format Dependencies
Based on Open Archival Information System (OAIS)
Reference Model, ISO 14721
8Preservation Strategy Dependencies
- Migration
- Transform object content from format F, supported
by yesterdays platform, to format G, supported
by tomorrows platform - Emulation
- Ensure that yesterdays platform continues to
work tomorrow - Recreate the behavior of yesterdays platform in
the context of tomorrows platform - Recreate the behavior of yesterdays platform in
the context of a Universal Virtual Computer (UVC)
9Institutional Archives
- Often under an obligation to accept material of
unknown provenance - Library of Congress Archive Ingest and Handling
Test (AIHT) - Investigate issues surrounding the transfer of
digital collections between institutions - Test corpus is the George Mason University 9/11
archive - 57,000 file (13GB)
- Collected via submission and web harvesting
- 97 of all files are in 9 formats
- AIFF, ASCII, GIF, HTML, JPEG, PDF, TIFF, WAVE,
XML - The remaining 3 are in 100 formats (as indicated
by file extension)
10Format Representation Information
- Information that maps formatted content to more
meaningful concepts - Syntax
- A TIFF header is composed of a two byte string,
II or MM, a two byte string, 0x2A00 or
0x002A, and an unsigned 32 bit integer - Semantics
- II indicates big-endian byte order MM,
little-endian - The two byte string is the decimal value 42 in
correct byte order - The integer is the byte offset of the first IFD
structure - Assessment
- Factors bearing on a formats amenability for
long-term preservation
11Library of Congress Assessment Model
- Sustainability
- Disclosure
- Adoption
- Transparency
- Self-documentation
- External dependencies
- DRM
- Quality and functionality
12A New Generation of Format-Aware Tools
- JHOVE - JSTOR/Harvard Object Validation
Environment - Format-specific object identification,
validation, and characterization - Modules for AIFF, ASCII, GIF, HTML, JPEG, JPEG
2000, PDF, TIFF, UTF-8, WAVE, XML - NLNZ Preservation Metadata Extraction Tool
- Adaptors for BMP, GIF, HTML, JPEG, MS Office,
OpenOffice, PDF, TIFF, WAVE, WordPerfect - Short-listed for Pilgrim Trust Conservation Award
- A great deal of knowledge about formats is
encapsulated into these tools where does this
representation information come from?
13The Harvard Format Registry
14Format Representation Information Sources
- There are lots of sources, but
15Format Representation Information Sources
- There are lots of sources, but
- for the most part they are informal,
inconsistent, and ephemeral
16Diffuse Web Site c/o Internet Archives
17Digital Formats for Library of Congress
Collections
18Whats Wrong with MIME Types?
- Level of detail
- Level of disclosure
- Level of granularity
- Non-actionable
19Whats Wrong with MIME Types?
MIME TYPE NAME application MIME SUBTYPE NAME
msword REQUIRED PARAMETERS none OPTIONAL
PARAMETERS An optional version parameter can be
specified. Some of the more common versions are
4 Microsoft Word 4.0 for the Macintosh. 5
Microsoft Word 5.0 and 5.1 for the Macintosh. 2w
Microsoft Word for Windows 2.0 6 Microsoft
Word 6 for Windows and Macintosh platform
independent format (coming soon) ENCODING
CONSIDERATIONS Microsoft word files are in a
binary format. Some encoding will be necessary
for MIME mailers as in application/octet-stream.
Microsoft Word files for the Macintosh are
encoded in the data fork of a macintosh file. The
type creator is MSWD, the file type is WDBN.
Microsoft Word files that contain external data
references such as publish subscribe services
are explicitly not allowed. SECURITY
CONSIDERATIONS None known. PUBLISHED
SPECIFICATION Specification by example From any
microsoft word application select "Save As..."
from the "File" menu. Enter a filename, make sure
that "Normal" is specified for the file type, and
click "Save".
20Whats Wrong with MIME Types?
MIME TYPE NAME application MIME SUBTYPE NAME
msword REQUIRED PARAMETERS none OPTIONAL
PARAMETERS An optional version parameter can be
specified. Some of the more common versions are
4 Microsoft Word 4.0 for the Macintosh. 5
Microsoft Word 5.0 and 5.1 for the Macintosh. 2w
Microsoft Word for Windows 2.0 6 Microsoft
Word 6 for Windows and Macintosh platform
independent format (coming soon) ENCODING
CONSIDERATIONS Microsoft word files are in a
binary format. Some encoding will be necessary
for MIME mailers as in application/octet-stream.
Microsoft Word files for the Macintosh are
encoded in the data fork of a macintosh file. The
type creator is MSWD, the file type is WDBN.
Microsoft Word files that contain external data
references such as publish subscribe services
are explicitly not allowed. SECURITY
CONSIDERATIONS None known. PUBLISHED
SPECIFICATION Specification by example From any
microsoft word application select "Save As..."
from the "File" menu. Enter a filename, make sure
that "Normal" is specified for the file type, and
click "Save".
21Characteristics of a Format Registry
- Predictable data
- Arbitrary granularity
- Inclusive
- Trustworthy
- Authoritative
- Honest broker with regard to proprietary
information - Machine actionable
- Interoperable
- Informative, not evaluative
22So, When Will Any of This Happen?
- And will it be in time to be useful for existing
at-risk digital assets?
23National Archives (UK) PRONOM
24Global Digital Format Registry
- DLF funded two invitational workshops in 2002 to
investigate issues surrounding the establishment
of a GDFR
- National Archives, UK - NARA - National
Archives of Canada - New York University - NIST -
Online Computer Library Center - Research
Libraries Group - Stanford University -
University of Pennsylvania
- Bibliothèque nationale de France - California
Digital Library - Digital Library Federation -
Harvard University - Internet Engineering Task
Force - JISC - JSTOR - Library of Congress - MIT
25GDFR Architecture
- Not a single monolithic registry
- A distributed network of cooperating registries
- Standard protocol
- Standard abstract data model
26Distributed Network of Cooperating Registries
27Data Model
- General descriptive properties, including
canonical and alias identifiers for formats - Characterization properties, detailing the
syntactic and semantic properties for formats - Processing properties, describing systems and
services for which registered formats are inputs
or outputs - Administrative properties, capturing important
events in a registrations provenance
28Service Model
- Management services, providing mechanisms for
maintenance, technical review, and notification - Access services, providing discovery and delivery
of format representation information - Interoperation/synchronization
29FRED A Format Registry Demonstration
30GDFR Technical Track
- Deliverables
- Data model
- Network protocol
- Editorial process
- Reference implementation
- Initial population
- Schedule
- Year one Analysis, design, and prototype
- Year two Development and deployment
- Year three Production operation and integration
with repository workflows
31GDFR Administrative Track
- Deliverables
- Recommendations for sustainable governance
structure and business model - Schedule
- Year one Analysis and consultation
- Year two White Papers and consultation
- Year three Final recommendations
32Why is This Important to You?
- The GDFR is an enabling technology underlying
digital repository operations and preservation
activities - It permits typing of digital objects at an
appropriate level of granularity - It enables the future recovery of the syntax and
semantics associated with typed digital objects - It provides a mechanism to pool and redistribute
the expertise of the digital preservation
community
33More Information
OAIS/ISO 14721 ltwww.ccsds.org/CCSDS/documents/650x
0b1.pdfgt UVC
ltwww-5.ibm.com/nl/dias/preservation2.htmlgt LC
Assessment Model
ltwww.digitalpreservation.gov
/gt JHOVE
lthul.harvard.edu/jhove/gt NLNZ
ltwww.natlib.govt.nz/files/Project20Descriptio
n_v3-final.pdfgt Diffuse
ltweb.archive.org/web/20030128052128/http//www.dif
fuse.org/gt IANA MIME registry
ltwww.iana.org/assignments/media-types/gt PRONOM
ltwww.nationalarchives.gov.uk/PRONOM/gt GDFR
lthul.harvard.edu/gdfr/gt FRED
lttom.library.upenn.edu/fred/gt ltstephen_abram
s_at_harvard.edugt