Title: MetaArchive Distributed Digital Preservation Workshop
1MetaArchiveDistributed Digital Preservation
Workshop
- Wednesday, May 30, 2007
- Robert W. Woodruff Library
- Emory University
- Atlanta, Georgia
2Day One Overview
- 830 AM - 900 AM Light Breakfast and Welcome
-
- 900 AM - 1030 AM Session 1. Overview of
Distributed Digital Preservation Networks, M.
Halbert -
- 1030 AM - 1045 AM Break
-
- 1045 AM - 1215 PM Session 2. Content
Management, C. Jannik and G. MacMillan -
- 1215 PM - 115 PM Lunch
-
- 115 PM - 245 PM Session 3. Costs and
Operational Considerations, M. Halbert and K.
Skinner -
- 245 PM - 300 PM Break
-
- 300 PM - 430 PM Session 4. Organizational
Agreements, D. Buttler and K. Skinner -
- 430 PM - 445 PM Wrap Up
3Purposes of this Workshop
- Foster discussion concerning distributed digital
preservation strategies - Share information and perspectives acquired in
the course of the MetaArchive NDIIPP project - Provide information and training for institutions
seeking to build or join distributed digital
preservation networks based on the LOCKSS
software.
4Introductions Who We All Are
- Please introduce yourself
- Say where you are from
- Mention any particular things that you hope to
get out of this workshop, and any other
expectations you may have - Identify any particular topics you hope we will
spend time discussing
5Learning Objectives for this Session
- Review day one workshop sessions
- Overview of some digital preservation basics
- Reasons to establish or join a network
- Models of network organization
- Defining partner/member responsibilities
- Overview of MetaArchive and LOCKSS
6Overview of Some Digital Preservation Basics
7The New Field of Digital Preservation
- Cultural heritage organizations are rapidly
expanding their digitization programs in an
effort to provide better access to collections.
As these digitization efforts go forward, and as
an increasing number of born-digital acquisitions
are made, there are concomitant needs for
preservation of these materials. - The DigCCurr 2007 Conference was hosted in April
2007 by the School of Information and Library
Science at the University of North Carolina at
Chapel Hill in an explicit effort to define the
new field of Digital Curation. -
- The Consultative Committee for Space Data Systems
has of necessity created many working standards
for preservation of digital information. One of
the most notable standards was the Reference
Model for an Open Archival Information System
(OAIS) which provided a broad vocabulary for
discussing digital archives systems and processes -
- The National Digital Information Infrastructure
and Preservation Program (NDIIPP) is the
congressionally chartered national program to
digitally preserve our national heritage - The Digital Preservation Management Workshop
hosted by Cornell University from 2003-2006 was
an effort to collate and share relevant best
practices and documentation from a large number
of emerging projects and efforts related to
digital preservation. -
- In the UK, groups such as the Digital Curation
Centre and the Digital Preservation Coalition
have been formed to foster joint action to
address the urgent challenges of securing the
preservation of digital resources in the UK and
to work with others internationally to secure our
global digital memory and knowledge base.
8The Data Loss Problem
9The Data Loss Problem (cont.)
10The Data Loss Problem (cont.)
11The Data Loss Problem (cont.)
12The Data Loss Problem (cont.)
From NDIIPP Website on the Importance of Digital
preservation (http//www.digitalpreservation.gov/
importance/)
13National Digital Information and Infrastructure
Preservation Program (NDIIPP) Commentary
- Technology has so altered our world that most of
what we now create begins life in a digital
format. - The artifacts that tell the stories of our lives
no longer reside in a trunk in the attic, but on
personal computers or Web sites, in e-mails or on
digital photo and film cards. - The flip side to the ease with which we are able
to create digital content is the complexity of
preservation and long-term retrieval of this
content. - We must contend with issues relating to hardware
and software compatibility long-term storage
organization of files for ease of search and
retrieval media quality disaster recovery and
integrity of original data
14Making Our Digital Heritage a Top Priority
- When we consider the ways in which the American
story has been conveyed to the nation, we think
of items such as the Declaration of Independence,
Depression-era photographs, television
transmission of the lunar landing and audio of
Martin Luther King's "I Have a Dream" speech.
Each of these are physically preserved and
maintained according to the properties of the
physical media on which they were created. Yet,
how will we preserve these essential pieces of
our heritage? - Web sites as they existed in the days following
Sept. 11, 2001, or Hurricane Katrina? - What about Web sites developed during the
national elections? - Executive correspondence generated via e-mail?
- Web sites dedicated to political, social and
economic analyses? - Data generated via geographical information
systems, rather than physical maps? - Digitally recorded music or video recordings?
- Web sites that feature personal information such
as videos or photographs? - Social networking sites?
- Should these be at a greater risk of loss, simply
because they are not tangible? - The content of digital archives at cultural
heritage institutions, created with scarce
resources in a time of great change
15The Gap in Digital Preservation Programs
- 66 of cultural heritage institutions (academic
libraries, archives, art museums, public
libraries, and other similar kinds of
institutions) report that no one is responsible
for digital preservation activities - 30 of all archives have been backed up one time
or not at all
Source 2005 NEDCC Survey by Bishoff and Clareson
16Reasons to Establish or Join a DDP Network
17Backups versus Digital Preservation
- What differentiates a schedule for data backups
from a digital preservation program? - Backups are tactical measures. Backups are
typically stored in a single location (often
nearby or collocated with the servers backed up)
and are performed only periodically. Backups are
designed to address short-term data loss via
minimal investment of money and staff time
resources. Backups are better than nothing, but
not a comprehensive solution to the problem of
preserving information over time. - Digital preservation is strategic. A digital
preservation program entails a geographically
dispersed set of secure caches of critical
information. A true digital preservation program
will require multi-institutional collaboration
and at least some ongoing investment to
realistically address the issues involved in
preserving information over time.
18What is Digital Preservation?
- Digital Preservation refers to the management of
digital information over time. - Unlike the preservation of paper or microfilm,
the preservation of digital information demands
ongoing attention. This constant input of effort,
time, and money to handle rapid technological and
organisational advance is considered the main
stumbling block for preserving digital
information beyond a couple of years. - Digital preservation can therefore be seen as the
set of processes and activities that ensure the
continued access to information and all kinds of
records, scientific and cultural heritage
existing in digital formats.
http//en.wikipedia.org/wiki/Digital_preservation
19Secure and Distributed Cache Networks
- Why are the characteristics of geographically
distribution and security so important? This
strategy maximizes survivability of content in
both individual and collective terms - Security reduces the likelihood that any single
cache will be compromised. - Distribution reduces the likelihood that the loss
of any single cache will lead to a loss of the
preserved content. - By creating a collaborative network for secure
and distributed preservation, a group can also
work together on more complex issues such as
format migration.
20Case Study from the Chirographic (Handwritten)
Era The Nag Hammâdi Library
- Collection of early Coptic texts discovered near
the town of Nag Hammâdi in 1945 - Had been buried in the 4th Century CE when
censored - Only extent copies of core early Gnostic
scholarship
- Survived 15 centuries because they were part of a
secure, distributed chirographic network
21Shared archiving Fails without a Pre-coordinated
Digital Preservation Network in Place
- The NDIIPP Archive Ingest and Handling Test
(AIHT) - Designed to document methods for preserving
digital cultural materials, identify areas that
require further research - Participants tested five different preservation
systems - Encountered many unexpected incompatibilities
because of different systems - Realization that much of the cost in preserving
digital material is in coordinating the
organizational and institutional imperatives of
preservation, and not the technological costs of
storage space
22Both Technical Networking and Organizational
Networking are Required
- A single cultural heritage organization is
unlikely to have the capability to operate
several geographically dispersed and securely
maintained servers - Collaboration between institutions on
technological solutions is essential - Similarly, inter-institutional agreements must be
put in place or there will be no commitment to
act in concert over time - The increased number and diversity of those
concerned with digital preservationcoupled with
the current general scarcity of resources for
preservation infrastructuresuggests that new
collaborative relationships that cross
institutional and sector boundaries could provide
important and promising ways to deal with the
data preservation challenge. These
collaborations could potentially help spread the
burden of preservation, create economies of scale
needed to support it, and mitigate the risks of
data loss. - - The Need for Formalized Trust in Digital
Repository Collaborative Infrastructure - NSF/JISC Repositories Workshop (April 16,
2007)
23Defining Partner/Member Responsibilities
24Institutional and Consortial Roles
- Preservation Sites are entities responsible for
the ongoing activity of preserving digital
content. At a minimum, every preservation site
must include responsible staff and a node server
of the relevant preservation network.
Preservation sites collectively comprise a
preservation network. - Development Sites are responsible for technical
development of the computer systems that enable
the preservation network. Obviously, development
sites may also be preservation sites and/or
contributing sites. - A Preservation Network is composed of all
preservation sites that work together to preserve
at-risk digital content. - Contributing (Content) Sites are institutions
that need to preserve digital content, and
therefore decide to contribute digital content
into the preservation network. The preservation
network acts for the common good to preserve the
at-risk content submitted by the contributing
sites. Contributing sites may also be
preservation sites.
25Individual Roles
- Selectors are staff that identify and prioritize
content to be preserved. They will most often be
knowledgeable concerning the content of an
institutions digital archives, and may have been
the same individuals that originally created or
acquired the archives. - System Administrators are staff members that
maintain individual preservation node servers of
the relevant preservation network. - Data Wranglers are programmers and other
technically adept workers that prepare local
digital archives for ingestion into a
preservation network. - Program Managers are leaders that accept
responsibility for coordinating the activities of
a digital preservation network. -
- NOTE All of the above roles may overlap in
creative ways!
26Models of Network Organization
- Different Ways of Creating or Joining Digital
Preservation Networks
27Dedicated Network
- Create a Dedicated Preservation Network
- Provides the greatest organizational control
- You can set up the rules for the network
- Requires greatest up-front investment to
implement
28Strategic Alliance
- Build onto an Existing Preservation Network
- Takes advantage of previous investments by others
- Requires understanding the rules of existing
network and abiding by them - Still requires capital investment in
infrastructure
29Piggyback Ride
- Arrange Contribution Strategy to an Existing
Preservation Network - No capital investment in infrastructure required
- Maximum advantage from previous investments by
others - Requires abiding by rules of existing network
- Requires convincing the existing network to
preserve your stuff will likely entail fees
30Network Security Factors
- What level of security and control over access to
your data do you need? - Do you have sensitive assets that require access
controls? If so, you may need a dedicated
network in which you control access to the
preservation nodes, or at least be able to join a
network which provides such access assurances. - Do you have some flexibility in adapting to other
infrastructures and security policies? If so, it
may be simplest to join and build your
preservation nodes onto an existing network. The
requirements may be readily acceptable. - Do you have relaxed or no security/access
expectations? If so, you may simply want to
piggyback off an existing network and depend on
their good graces.
31Decisions on Degrees of Security
- More security and access assurances drive up the
required costs of a preservation network - Extra costs may very well be justified! The
entire point of a preservation network is long
term security for you digital content. - Strategic alliances can make a lot of sense.
They leverage your resources, but still give you
ownership of a portion of the infrastructure. - If you have no infrastructural capacity, and
little or no funding, a piggyback ride is better
than nothing!
32Overview of MetaArchive and LOCKSS
33MetaArchive
- A dedicated preservation network for digital
archives established under the auspices of and
with funding from the National Digital
Information and Infrastructure Preservation
Program (NDIIPP) - Based on LOCKSS technology, but a separate
network with high capacity nodes - Highly distributed geographically across multiple
states - Node servers are very secure, with a variety of
extra security hardening measures added to each
preservation node - Memoranda of Understanding between participating
sites concerning commitment to maintain each
others data security and network integrity - Motivation to preserve partners digital archives
is based on signed agreements and commitment to
the preservation network - Available for others to join, both to build onto
or to piggyback on - Active development community, committed to
ongoing exploration of distributed preservation
technologies, digital Curation tools, and format
migration methods - Fee structure to join as members or to piggyback
on
34LOCKSS
- A dedicated preservation network for online
journals, established with funding from the
Mellon Foundation and new funding from the
NDIIPP - The pioneering leader in distributed digital
preservation - Very highly distributed geographically across the
world, with hundreds of sites - Available for others to join, both to build onto
or to piggyback on - Fee structure for membership
- No signed agreements between sites individual
nodes may preserve content or withdraw at will - Motivation to preserve content is based on
interest by members in long-term access to online
journal content to which they subscribe - Active development community, with new
initiatives with publishers (CLOCKSS) and many
other technical advancement directions
35QA Discussion