Title: John Kunze
1Towards a Culture of Digital Preservation
22 June 2006
- John Kunze
- California Digital Library
- University of California
2Whats digital preservation?
- Storing digital objects while retaining a balance
of usability and faithfulness to their creators
original intentions
3(No Transcript)
4Kinds of loss
- Hard loss - some or all data bits are missing
- Soft loss - bits still somewhere we think
- Syntactic loss - bits are there, but format
cannot be rendered by software - Semantic loss - data renderable without apparent
error, but not understandably - Legal loss - data format or data itself is
legally encumbered
5Preservation of an abstraction
- All our experience of digital data is
intermediated by software, which produces
different experiences depending on things like - Algorithms, browser configurations, window size,
local fonts, network traffic, client network
location, randomized ads, time of day, etc. - Which of those is relevant or important to save?
- Can we even formulate Best Practices?
- Collaboratively we could probably formulate Not
Bad Practices
6Digital preservation is hard
- The goal is to safeguard informations
- Viability (intact bit streams)
- Renderability (by machines)
- Understandability (by humans)
- Some loss is unavoidable, so prioritize
- What content do we care about most?
- Where will our money go furthest?
- When to be perfect? When to do triage?
- How to make this easier?
7Short is long
- Best indicator of the future is the recent past
- Your skilled computer system administrator is a
great adviser on practical long-term preservation - Maintaining, monitoring, converting, upgrading,
and migrating user files and software on multiple
hardware/software platforms several times a year
develops keen insights into long-term maintenance - Long-term is like short-term, only more so
- Reduce dependencies to simplify maintenance
8Depend less
- Technical dependencies wont all go away, as we
cant experience the bits without them - Why paper lives a 1000 years
- Why plain e-text has lasted 35 years
- But which technical dependencies to drop first?
- Diffuse tools will be fixed before you notice
(eg, browser, operating sys, DNS, Internet,
Acrobat) - Narrow, simple tools are fixable in the community
- Narrow, complex tools (eg, Handle) are risky
9A culture of collaboration
- Sharing the burden with other institutions
- Technical challenges protocols, metadata
formats - Socio-organizational challenges procedures,
policies redundancy - Political challenges awareness, alliances
funding - Think consortially, act locally
10Policy collaboration
- Different flavors of preservation exist, not just
between organizations, but for different kinds of
objects within one collection - Preservation is nuanced, not on or off
- Whats your policy? Standards are needed
- Commitment statements
- Permanence ratings (e.g., US NLM)
- Rights declarations
- Trusted Digital Repositories Attributes and
Responsibilities (RLG) - Organizations are surprisingly cautious about
their commitments to preservation
11Objects and surrogates
- Surrogates provide a time-honored way of avoiding
the inconvenience of directly handling objects. - Surrogates are usually much smaller, eg, a
catalog card - Surrogates may be necessary, eg, when the object
is legally encumbered or in a language you dont
understand - Surrogates can be much more uniform (for easier
processing) than objects - Every system has surrogates, even if dynamically
generated - A surrogate serves as a tool to help us find,
use, and manage information objects, for example, - Find an image by the photographers name
- Verify from record details that you want to
purchase - Trouble-shoot processing errors
12Metadata and protocols
- A surrogate is essentially a metadata record for
an object - The data in the record is metadata
- Metadata is structured data about an object
- When structured, data assists automation by
making it easy to recognize and record individual
data elements - The more uniform, the more leverage for
interoperation - Automation Interoperation ? Protocol
- Protocols are key to technical collaboration,
from federated search to simple object exchange
between institutions - Our collaboration is limited by our protocols
13Simple protocols arent so simple
functionality
simplicity
- In the beginning, TCP/IP, Email headers, HTTP,
NNTP - Expanding functionality OSI, Z39.50, CORBA,
SOAP - Contracting complexity OpenSearch, RSS,
SRW/SRU, OAI - How are we doing in June 2006, at least in
digital libraries? - OAI (low barrier) failures attributed to errors
in XML coding, schemas poor, inconsistent, and
expensive metadata with surrogates too
non-uniform to be of much use CL CL
14Simple metadata isnt so simple
- Dublin Core 15 elements thought to apply to
almost any object
Despite efforts to correct known problems, the
simplest protocol with the simplest metadata
OAI reports an overall 36 failure rate, 77
due to metadata/encoding and protocol errors.
15Simplest Dublin Core metadata
- lt?xml version"1.0"?gt
- lt!DOCTYPE rdfRDF PUBLIC "-//DUBLIN CORE//DCMES
DTD 2002/07/31//EN" - "http//dublincore.org/documents/2002/07/31/dc
mes-xml/dcmes-xml-dtd.dtd"gt - ltrdfRDF xmlnsrdf"http//www.w3.org/1999/02/22-r
df-syntax-ns" - xmlnsdc"http//purl.org/dc/elements/1.1
/"gt - ltrdfDescription rdfabouthttp//www.nap.edu/b
ooks/0309064996/html/gt - ltdctitlegtThe Digital Dilemmalt/dctitlegt
- ltdccreatorgtNational Research
Councillt/dccreatorgt - ltdcdategt2000-06-22lt/dcdategt
- lt/rdfDescriptiongt
- lt/rdfRDFgt
- Collaboration is hard enough
- What technical choices make our collaboration
harder?
16Same record with Dublin Kernel
- Heres the same information, still
machine-readable, as an Electronic Resource
Citation (ERC) with Kernel metadata -
- erc
- who National Research Council
- what The Digital Dilemma
- when 2000
- where http//books.nap.edu/html/digital5Fdilemma
- The same information again, in its most compact
form - erc National Research Council
- The Digital Dilemma 2000
- http//books.nap.edu/html/digital5Fdilemma
- The ERC format is the basis of a simple exchange
protocol - Designed for little more than orderly management,
it frees up technical resources for more
interesting work
17Metadata and collaboration
- Complex metadata might not be needed, but simple
metadata will be needed - Non-text-based content is still hard to index
automatically - Deliberately handled content implies orderly
management - Collaboration seamlessly federated collections
require metadata agreements/standards - Collaboration easy exchanges (import/ export)
require metadata agreements/standards
18Preservation and collaboration
- Preservation basics
- Collect the bits before they disappear
- Keep redundant copies of the bits (replication)
- Best if held at sites of independent
collaborators - Stand-alone solutions how to distinguish them?
- Further collaborative approaches
- Vendor education
- Building trust and reputation
- Persistent identifiers (URLs)
19Collaborative vendor education
- Consortial approach to creating best practices
for content creators and submitters - Educating vendors with one voice, for example,
- Adobe PDF/A (AArchival) format
- Apache web server error codes
- Book scanning
- MS Word metadata formats
20Collaborative trust-building
- How is trust in a repository established?
- One way is with audits and checklists
- Expensive and requires you to trust me
- Another way is to form a consortium of mutually
monitoring institutions - Membership requires you to monitor and to be
monitored - Monitoring involves limited access to your
objects by other consortium members, who publish
their findings - All monitoring (eg, link checking, checksum
validation) and findings publication is automated
by consortial tools - The concept it to provide a cheap way for you to
ask members whether to trust it - This is a reputation system, not a policing system
21Collaborative persistent identifiers (URLs)
- URLs break because
- Either the URLs server host goes away
- Or the URLs path part ceases to work
- All these can be traced back to
- Provider ignorance, in which case, educate
- Provider disappearance with no successor
organization - Consortial prevention and rescue strategy
- Join supportive consortium, for example, all the
cultural memory organizations represented in this
room - Small, vulnerable organizations find consortial
successor - All organizations publish URLs under one
hostname, eg, http//id.archive.org/12345/ - First part of path (12345) uniquely identifies
organization
22Collaborative global resolver
- Uses ordinary redirects, one per organization,
that sends all incoming request to the
institutional resolver - A few hundred cultural heritage institutions
- URN/DOI/Handle problem solved without liabilities
of complex, proprietary, or special-purpose
infrastructure
2. URLa
1. initiate
global resolver
User
8. display
3. URLb
(redirect)
4. URLb
web browser
5.
institutional resolver
final sub-server
6.
URLc (5) or page (7)
6. URLc
one path or the other
final web server
7. page
23Generalize to a per-object resolver
- Deals with namespace splitting neglected by
URNs/Handles - Per-object resolution -- fastest, simplest
architecture - Can still do per-resolver redirects (DOIs,
Handles,) - No browser mods, no infrastructure to carry
forward - Puts pressure on scaling (resolution, harvesting)
2. URLa
1. initiate
global resolver
User
8. display
3. URLb
web browser
4. URLb
final web server
5. page
24Mirrored resolver clusters
- Regional (eg, Europe, Asia, North America)
clusters of mirrored resolver instances - Round-robin failover for redundancy,
fault-tolerance, and load-sharing
resolver instance
resolver instance
web browser
resolver instance
User
resolver instance
25Example global id resolver
- Sample identifiers at id.archive.org -- these
work - http//id.archive.org/12345/libraries/visitor.html
- http//id.archive.org/13030/inside
- http//id.archive.org/urnnbnseuudiva-3324
- http//id.archive.org/ark/13030/tf5p30086k
- It can also redirect URNs, DOIs, Handles, eg,
- http//id.archive.org/doi10.1111/j.0307-6946.2004
.00571.x
26Conclusions
- Digital preservation is hard, but easy to mess up
- Basic strategies
- Simplicity
- Reduction of narrow, complex dependencies
- Short-term informs long-term
- Successful preservation will be the work of many
- Consortia of collaborating repositories
- Multiple copies of digital objects
- Transparent repository auditing practices
- Think consortially, act locally
27John.Kunze_at_ucop.edu
- www.cdlib.org/programs/digital_preservation.html