John Kunze - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

John Kunze

Description:

... several times a year develops keen insights into ... Find an image by the photographer's name. Verify from record details that you want to purchase ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 28
Provided by: kultur
Category:
Tags: john | kunze

less

Transcript and Presenter's Notes

Title: John Kunze


1
Towards a Culture of Digital Preservation
22 June 2006
  • John Kunze
  • California Digital Library
  • University of California

2
Whats digital preservation?
  • Storing digital objects while retaining a balance
    of usability and faithfulness to their creators
    original intentions

3
(No Transcript)
4
Kinds of loss
  • Hard loss - some or all data bits are missing
  • Soft loss - bits still somewhere we think
  • Syntactic loss - bits are there, but format
    cannot be rendered by software
  • Semantic loss - data renderable without apparent
    error, but not understandably
  • Legal loss - data format or data itself is
    legally encumbered

5
Preservation of an abstraction
  • All our experience of digital data is
    intermediated by software, which produces
    different experiences depending on things like
  • Algorithms, browser configurations, window size,
    local fonts, network traffic, client network
    location, randomized ads, time of day, etc.
  • Which of those is relevant or important to save?
  • Can we even formulate Best Practices?
  • Collaboratively we could probably formulate Not
    Bad Practices

6
Digital preservation is hard
  • The goal is to safeguard informations
  • Viability (intact bit streams)
  • Renderability (by machines)
  • Understandability (by humans)
  • Some loss is unavoidable, so prioritize
  • What content do we care about most?
  • Where will our money go furthest?
  • When to be perfect? When to do triage?
  • How to make this easier?

7
Short is long
  • Best indicator of the future is the recent past
  • Your skilled computer system administrator is a
    great adviser on practical long-term preservation
  • Maintaining, monitoring, converting, upgrading,
    and migrating user files and software on multiple
    hardware/software platforms several times a year
    develops keen insights into long-term maintenance
  • Long-term is like short-term, only more so
  • Reduce dependencies to simplify maintenance

8
Depend less
  • Technical dependencies wont all go away, as we
    cant experience the bits without them
  • Why paper lives a 1000 years
  • Why plain e-text has lasted 35 years
  • But which technical dependencies to drop first?
  • Diffuse tools will be fixed before you notice
    (eg, browser, operating sys, DNS, Internet,
    Acrobat)
  • Narrow, simple tools are fixable in the community
  • Narrow, complex tools (eg, Handle) are risky

9
A culture of collaboration
  • Sharing the burden with other institutions
  • Technical challenges protocols, metadata
    formats
  • Socio-organizational challenges procedures,
    policies redundancy
  • Political challenges awareness, alliances
    funding
  • Think consortially, act locally

10
Policy collaboration
  • Different flavors of preservation exist, not just
    between organizations, but for different kinds of
    objects within one collection
  • Preservation is nuanced, not on or off
  • Whats your policy? Standards are needed
  • Commitment statements
  • Permanence ratings (e.g., US NLM)
  • Rights declarations
  • Trusted Digital Repositories Attributes and
    Responsibilities (RLG)
  • Organizations are surprisingly cautious about
    their commitments to preservation

11
Objects and surrogates
  • Surrogates provide a time-honored way of avoiding
    the inconvenience of directly handling objects.
  • Surrogates are usually much smaller, eg, a
    catalog card
  • Surrogates may be necessary, eg, when the object
    is legally encumbered or in a language you dont
    understand
  • Surrogates can be much more uniform (for easier
    processing) than objects
  • Every system has surrogates, even if dynamically
    generated
  • A surrogate serves as a tool to help us find,
    use, and manage information objects, for example,
  • Find an image by the photographers name
  • Verify from record details that you want to
    purchase
  • Trouble-shoot processing errors

12
Metadata and protocols
  • A surrogate is essentially a metadata record for
    an object
  • The data in the record is metadata
  • Metadata is structured data about an object
  • When structured, data assists automation by
    making it easy to recognize and record individual
    data elements
  • The more uniform, the more leverage for
    interoperation
  • Automation Interoperation ? Protocol
  • Protocols are key to technical collaboration,
    from federated search to simple object exchange
    between institutions
  • Our collaboration is limited by our protocols

13
Simple protocols arent so simple
functionality
simplicity
  • In the beginning, TCP/IP, Email headers, HTTP,
    NNTP
  • Expanding functionality OSI, Z39.50, CORBA,
    SOAP
  • Contracting complexity OpenSearch, RSS,
    SRW/SRU, OAI
  • How are we doing in June 2006, at least in
    digital libraries?
  • OAI (low barrier) failures attributed to errors
    in XML coding, schemas poor, inconsistent, and
    expensive metadata with surrogates too
    non-uniform to be of much use CL CL

14
Simple metadata isnt so simple
  • Dublin Core 15 elements thought to apply to
    almost any object

Despite efforts to correct known problems, the
simplest protocol with the simplest metadata
OAI reports an overall 36 failure rate, 77
due to metadata/encoding and protocol errors.
15
Simplest Dublin Core metadata
  • lt?xml version"1.0"?gt
  • lt!DOCTYPE rdfRDF PUBLIC "-//DUBLIN CORE//DCMES
    DTD 2002/07/31//EN"
  • "http//dublincore.org/documents/2002/07/31/dc
    mes-xml/dcmes-xml-dtd.dtd"gt
  • ltrdfRDF xmlnsrdf"http//www.w3.org/1999/02/22-r
    df-syntax-ns"
  • xmlnsdc"http//purl.org/dc/elements/1.1
    /"gt
  • ltrdfDescription rdfabouthttp//www.nap.edu/b
    ooks/0309064996/html/gt
  • ltdctitlegtThe Digital Dilemmalt/dctitlegt
  • ltdccreatorgtNational Research
    Councillt/dccreatorgt
  • ltdcdategt2000-06-22lt/dcdategt
  • lt/rdfDescriptiongt
  • lt/rdfRDFgt
  • Collaboration is hard enough
  • What technical choices make our collaboration
    harder?

16
Same record with Dublin Kernel
  • Heres the same information, still
    machine-readable, as an Electronic Resource
    Citation (ERC) with Kernel metadata
  • erc
  • who National Research Council
  • what The Digital Dilemma
  • when 2000
  • where http//books.nap.edu/html/digital5Fdilemma
  • The same information again, in its most compact
    form
  • erc National Research Council
  • The Digital Dilemma 2000
  • http//books.nap.edu/html/digital5Fdilemma
  • The ERC format is the basis of a simple exchange
    protocol
  • Designed for little more than orderly management,
    it frees up technical resources for more
    interesting work

17
Metadata and collaboration
  • Complex metadata might not be needed, but simple
    metadata will be needed
  • Non-text-based content is still hard to index
    automatically
  • Deliberately handled content implies orderly
    management
  • Collaboration seamlessly federated collections
    require metadata agreements/standards
  • Collaboration easy exchanges (import/ export)
    require metadata agreements/standards

18
Preservation and collaboration
  • Preservation basics
  • Collect the bits before they disappear
  • Keep redundant copies of the bits (replication)
  • Best if held at sites of independent
    collaborators
  • Stand-alone solutions how to distinguish them?
  • Further collaborative approaches
  • Vendor education
  • Building trust and reputation
  • Persistent identifiers (URLs)

19
Collaborative vendor education
  • Consortial approach to creating best practices
    for content creators and submitters
  • Educating vendors with one voice, for example,
  • Adobe PDF/A (AArchival) format
  • Apache web server error codes
  • Book scanning
  • MS Word metadata formats

20
Collaborative trust-building
  • How is trust in a repository established?
  • One way is with audits and checklists
  • Expensive and requires you to trust me
  • Another way is to form a consortium of mutually
    monitoring institutions
  • Membership requires you to monitor and to be
    monitored
  • Monitoring involves limited access to your
    objects by other consortium members, who publish
    their findings
  • All monitoring (eg, link checking, checksum
    validation) and findings publication is automated
    by consortial tools
  • The concept it to provide a cheap way for you to
    ask members whether to trust it
  • This is a reputation system, not a policing system

21
Collaborative persistent identifiers (URLs)
  • URLs break because
  • Either the URLs server host goes away
  • Or the URLs path part ceases to work
  • All these can be traced back to
  • Provider ignorance, in which case, educate
  • Provider disappearance with no successor
    organization
  • Consortial prevention and rescue strategy
  • Join supportive consortium, for example, all the
    cultural memory organizations represented in this
    room
  • Small, vulnerable organizations find consortial
    successor
  • All organizations publish URLs under one
    hostname, eg, http//id.archive.org/12345/
  • First part of path (12345) uniquely identifies
    organization

22
Collaborative global resolver
  • Uses ordinary redirects, one per organization,
    that sends all incoming request to the
    institutional resolver
  • A few hundred cultural heritage institutions
  • URN/DOI/Handle problem solved without liabilities
    of complex, proprietary, or special-purpose
    infrastructure

2. URLa
1. initiate
global resolver
User
8. display
3. URLb
(redirect)
4. URLb
web browser
5.
institutional resolver
final sub-server
6.
URLc (5) or page (7)
6. URLc
one path or the other
final web server
7. page
23
Generalize to a per-object resolver
  • Deals with namespace splitting neglected by
    URNs/Handles
  • Per-object resolution -- fastest, simplest
    architecture
  • Can still do per-resolver redirects (DOIs,
    Handles,)
  • No browser mods, no infrastructure to carry
    forward
  • Puts pressure on scaling (resolution, harvesting)

2. URLa
1. initiate
global resolver
User
8. display
3. URLb
web browser
4. URLb
final web server
5. page
24
Mirrored resolver clusters
  • Regional (eg, Europe, Asia, North America)
    clusters of mirrored resolver instances
  • Round-robin failover for redundancy,
    fault-tolerance, and load-sharing

resolver instance
resolver instance
web browser
resolver instance
User
resolver instance
25
Example global id resolver
  • Sample identifiers at id.archive.org -- these
    work
  • http//id.archive.org/12345/libraries/visitor.html
  • http//id.archive.org/13030/inside
  • http//id.archive.org/urnnbnseuudiva-3324
  • http//id.archive.org/ark/13030/tf5p30086k
  • It can also redirect URNs, DOIs, Handles, eg,
  • http//id.archive.org/doi10.1111/j.0307-6946.2004
    .00571.x

26
Conclusions
  • Digital preservation is hard, but easy to mess up
  • Basic strategies
  • Simplicity
  • Reduction of narrow, complex dependencies
  • Short-term informs long-term
  • Successful preservation will be the work of many
  • Consortia of collaborating repositories
  • Multiple copies of digital objects
  • Transparent repository auditing practices
  • Think consortially, act locally

27
John.Kunze_at_ucop.edu
  • www.cdlib.org/programs/digital_preservation.html
Write a Comment
User Comments (0)
About PowerShow.com