University Library Experience CDL Case Study - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

University Library Experience CDL Case Study

Description:

Content hosting: electronic texts, web-based material, datasets, finding aids ... Multiple opinions and practices will flourish ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 19
Provided by: johnk69
Category:

less

Transcript and Presenter's Notes

Title: University Library Experience CDL Case Study


1
University Library Experience CDL Case Study
  • 30 June 2005
  • John Kunze, California Digital Library

2
California Digital Library
  • A university library with no books, students, or
    faculty
  • Central services for 10 campus libraries
  • Content hosting electronic texts, web-based
    material, datasets, finding aids
  • Linked California museums archives
  • Plus a Digital Preservation Program

3
Whats digital preservation?
  • Safeguarding electronic information
  • Viability (intact bit streams)
  • Renderability (by machines)
  • Understandability (by humans)
  • Theres no preservation if we dont know what
    its called
  • CDL core need for persistent identifiers

4
Whats a persistent identifier?
  • An identifier that is valid for long enough
  • valid, enough these are service/user dependent
  • Whats an identifier? Its an association
    between a string and a thing. It follows that
  • An id is not a string of data (good)
  • An id is a matter of opinion, not fact there
    will be at least one other provider, serial if
    not in parallel, or your objects die with you
    (inconvenient)
  • Same thing, two strings or same string, two
    things
  • Often same string, different metadata
  • Often same string, parallel things diverging
    over time due to different preservation practices
    (eg, migrations)

5
Accepting some disorder
  • Long term preservation wont happen unless
    objects can change residence and diverge
  • Campus snapshot to CDL subsequent snapshots
  • Publisher to dim CDL archive later CDL to SS?
  • Better if object lives in several places at once
  • Eventually, Producer loses control of copies
  • Multiple opinions and practices will flourish
  • Static, id-based persistence claims soon
    irrelevant
  • urn, hdl, etc. reflect hopes of people
    long gone
  • Not pretty, but the alternative (loss) is worse

6
Agreeing to disagree
  • What we say, but shouldnt (not loudly)
  • Dont re-assign a persistent id to something else
  • Or dont replace a persistent object with another
  • What we do
  • Knowingly replace our persistent objects (typos,
    drafts, format conversions, home page redesign)
  • Honestly provide a real kind of persistence, but
    with very different replacement policies
  • Wont have one way within CDL, let alone without

7
Diverse persistence practice
  • How dissimilar must two objects be before they
    get different ids?
  • CDLs home-grown Digital Preservation Repository
    (open source) is self-service
  • Lets the Submitter decide
  • Makes preservation a joint responsibility
  • Requirement need to be able to tell users what
    flavor of permanence is in effect

8
CDL Persistent Ids Must
  • Identify, whether or not the object is at hand
  • It may not be convenient, helpful, or permitted
    for you to inspect the object itself -- metadata
    needed
  • Convey different flavors of permanence
  • Lead to access (if authorized)
  • Not strictly an identification problem, but it
    is the 404 not found that we need to fix
  • Be valid for some longish period
  • Be carried on, in, or with the object

9
How to choose an id scheme
  • All CDL requirements are purely about service
  • Candidate schemes URL, PURL, URN, ARK, Handle,
    DOI, MD5, GUID, ISxx,
  • CDL gets no direct service help from any scheme
    no scheme or syntax confers persistence of any
    kind
  • We then ask which schemes are lowest cost and
    lowest risk?

10
Myths to fight against
  • Harmful Fallacy 1. A URL is a location, and is
    therefore inherently unstable. (ridiculous)
  • Harmful Fallacy 2. Explicit server/resolver names
    make URLs inherently unstable.
  • So loc.gov is less stable than handle.net and
    the implicit global resolvers that it depends on?
  • Harmful Fallacy 3. HTTP-based resolvers will not
    scale for persistent access. (google)
  • Harmful Fallacy 4. URLs are the problem.
  • Cool URLs dont break -- Tim Berners-Lee

11
Impersistence - big factors
  • Bankruptcy - no successor found
  • Loss of funding - no successor found
  • Loss of political support
  • War, social upheaval, natural disaster
  • Scheme impact zero

12
Impersistence - lesser factors
  • Deliberately or accidentally, objects are
  • Removed
  • Replaced
  • Moved without setting up a redirect
  • Everyone has an indirection mechanism, though
    most dont use it
  • Scheme impact zero

13
Impersistence - small factors
  • Your org likes persistent ids in principle, but
  • It lacks knowledge that vanilla web servers
    trivially support 500,000 redirect directives
  • It lacks the expertise or staff to maintain a web
    server, a two-column database table, and a
    nightly server config file report writer
  • Scheme impact zero

14
Scheme costs and risks
  • Every modern service needs to support
    indefinitely and find or be given replacements
    for at least
  • Web server, web browser, and DNS
  • In addition, URN, Handle, and DOI resolution need
    a global proxy or a plugin for every access
  • ARK could use a plugin, but doesnt need it
  • Handle and DOI also require
  • You to maintain an extra local server
  • The community to maintain a set of global servers
  • For the CDL
  • Handle and DOI come with highest risk
  • ARK comes with lowest risk

15
Persistence - indirect factors
  • CDLs persistence requirements call for an id
    scheme (not service) connecting users to
  • metadata
  • whether and what kind of persistence
  • sub-object and variant inferences
  • core ids on proxy failure (gracefully)
  • Scheme impact ARK provides these
  • A scheme is not a service (DOI is not CrossRef)
  • When choosing a scheme, we wanted to remain
    independent of extra external service providers

16
Our Stuff vs Their Stuff
  • Persistence can be split into
  • the Our Stuff Problem
  • the Their Stuff Problem
  • It makes no sense for CDL to assign persistent
    ids to Their Stuff
  • Their Stuff can be hugely important to our users,
    but we dont control it and cannot vouch for it
  • Where we can afford it, we track them with PURLs
  • CDL does assign persistent ids to Our Stuff

17
Distribution of Id Assignment
  • Objects ingested in flows from other libraries
    per submission agreements
  • Each object has an ARK after ingest
  • Either it has it already
  • Or we give it one upon entry
  • Campuses can mint their own ARKs or rely on our
    minting service
  • Their own campus ARK namespace is theirs to
    divide up as they wish

18
Opaque ids with semantic extensions
  • CDL dilemma
  • opaque ids are needed for names that age and
    travel well
  • Semantically laden ids are helpful in providing
    many id services
  • Hybrid
  • opaque ids are used to name abstract preservation
    objects
  • Semantic and sometimes transient extensions
    address components inside of objects (the set of
    components evolves over time anyway)
Write a Comment
User Comments (0)
About PowerShow.com