Title: Self-Preserving Digital Objects
1Self-Preserving Digital Objects
- Michael L. Nelson
- mln_at_cs.odu.edu
- http//www.cs.odu.edu/mln/
- Several Slides from
- Terry L. Harrison
- University of Southern California
- 6/15/04
2Outline
- History
- Preservation
- Archives vs. Objects
- Smart Objects Dumb Archives
- Self-Preserving Objects
3My DL History
- 1992 - work first begun on first generation
Langley Technical Report Server (LTRS) - 1993 - WWW version of LTRS
- http//techreports.larc.nasa.gov/ltrs/
- work w/ ODU on WATERS
- 1994 - NASA Technical Report Server (NTRS)
- distributed searching of many LTRS-like servers
(20 separate nodes, all NASA centers) - http//techreports.larc.nasa.gov/cgi-bin/NTRS
- 1996 - NACA Technical Report Server (NACATRS)
- http//naca.larc.nasa.gov/
- 1996 - Joint research in DLs with ODU begins
- 1997 - NCSTRL (clustering, buckets)
- 1999 - OAI-PMH development begins
- 2001 - Arc, DP9, Archon, Kepler, etc.
- 2002 - OAI-PMH version of the NTRS
- http//ntrs.nasa.gov/
4History
- ca. 1994 - 1995 a LaRC researcher, upon seeing
LTRS remarked - all of these reports are nice, but what we
really want is the data... - ca. 1995 - present many reports in LTRS start to
include data files, appendices, software and
other information types - NACATRS the scanned nature of the reports imply
that 1 report N files - N gt (pages 3) 2
5NASA STI
- Formal publications cover a decreasing percentage
of NASAs STI output - most DLs focus only on formal publications
- Informal STI is maintained by only by a network
of collegial distribution - aging and shrinking workforce weakens this
network - Customers want much more than formal publication
- rather than stretch the meaning of report or
document, define a new object for DL
transactions
6STI Observations
- Media formats are instantiations of a more
general class of information - Most DLs are uni-format, following the obsolete
media boundaries of their non-digital
predecessors - Separate but equal DLs considered harmful
- customer should not have to re-integrate what
should never have been de-integrated... - institutional knowledge being lost because we
dont have a publishing vector established
7Pyramid of Scientific and Technical Information
(STI)
Information is created in a variety of formats.
Formal publications, the focus of most DL
projects, are supported by a pyramid of informal
information.
8Information Lost Over Time
9Content is King
The information content is more important than
the systems used for its storage, management and
retrieval
Objects should not be locked in specific DLs
or archives
10Prelude to OAI
- I met Herbert Van de Sompel in April 1999...
- we spoke of a demonstration project he had in
mind and had received sponsorship from Paul
Ginsparg and Rick Luce - We wanted to demonstrate a multi-disciplinary DL
that leveraged the large number of high quality,
yet often isolated, tech report servers, e-print
servers, etc. - most digital libraries (DLs) had grown up along
single disciplines or institutions - little to no interoperability isolated DL
gardens
11Universal Preprint Service
- A cross-archive DL that that provides services on
a collection of metadata harvested from multiple
archives - Nelson NCSTRL a modified version of Dienst
- support for clustering
- support for buckets
- Krichel ReDIF metadata format
- Van de Sompel SFX Linking
- Demonstrated at Santa Fe NM, October 21-22, 1999
- http//web.archive.org/web//http//ups.cs.odu.edu
/ - D-Lib Magazine, 6(2) 2000 (2 articles)
- http//www.dlib.org/dlib/february00/02contents.htm
l - UPS was soon renamed the Open Archives Initiative
(OAI) http//www.openarchives.org/
12UPS Participants
totals ca. July 1999
13Buckets Information Surrogates in UPS
- Limitations on intellectual property,
- file size, transmission time, system
- load, etc. caused us to focus on
- metadata only
- Metadata was collected into
- buckets, with pointers back to the
- data files (still at the original sites)
14Value Added Services Attachedto the Buckets
SFX Reference Linking Service, developed at
Univ of Ghent, Belgium. - provides a layer
of indirection between reference
services available at a local site
and the object itself SFX buttons are
attached to the buckets themselves -
communication occurs between SFX server
and the bucket Adding other services to
the buckets is easy...
15Data and Service Providers
- Data Providers
- publishing into an archive
- Self-describing archives
- Much of the learning about the constituent UPS
archives occurred out of band - providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Service Providers
- harvest metadata from providers
- implement user interface to data
Even if these are done by the same DL, these are
distinct roles
16Metadata Harvesting
- Move away from distributed searching
- Extract metadata from various sources
- Build services on local copies of metadata
- data remains at remote repositories
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
17Result OAI
- The OAI was the result of the demonstration and
discussion during the Santa Fe meeting - OAI a bunch of people, a religion, a cult, etc.
- OAI Protocol For Metadata Harvesting (OAI-PMH)
the protocol created and maintained by the OAI - Initial focus was on federating collections of
scholarly e-print materials - however, interest grew and the scope and
application of OAI-PMH expanded to become a
generic bulk metadata transport protocol - Note
- OAI-PMH is only about metadata -- not full text!
- but what is metadata vs. full-text?
- OAI is neutral with respect to the nature of the
metadata or the resources the metadata describes - read commercial publishers have an interest in
OAI-PMH too...
18A Look Back at UPS
- Primary outcome of the meeting was the OAI
OAI-PMH - Krichel ReDIF metadata
- still in use being developed
- Van de Sompel SFX
- OpenURL (NISO Standard)
- SFX is a commercial OpenURL resolver marketed by
Ex Libris - Nelson
- NCSTRL begat Arc (arc.cs.odu.edu) and others
- Buckets?
19Componentized Digital Libraries
. . .
20Preservation
- RLG Report Preserving Digital Information Final
Report and Recommendations - http//www.rlg.org/ArchTF/
- refreshing - moving to new media
- considered (comparatively) easy
- migrating - transitioning to new systems,
formats, idioms - considered hard
21Really Long Term Preservation
- Migration is very hard, to be sure
- but given sufficient demand, this can be
accomplished - cf. early 1980s game emulation
- http//www.intellivisionlives.com/
- http//stella.atari.org/
- Refreshing may actually be harder
- or at least intrinsically bound to the migration
problem - http//web.archive.org/web/19980128071544/http//w
ww.usc.edu/ - http//web.archive.org/web//http//library.usc.ed
u/ - http//web.archive.org/web/19971210220634/http//l
ib-www.lanl.gov/
22Preservation Metrics So Far
- Nelson Allen
- 3 decay of objects in DLs
- http//www.dlib.org/dlib/january02/nelson/01nelson
.html - Lawrence, et al.
- 3 decay of URLs included in technical papers
- http//www.neci.nec.com/lawrence/papers/persisten
ce-computer01/bib.html - Koheler
- 33 of URLs unstable or partially unstable
- http//InformationR.net/ir/4-4/paper60.html
- Kahle
- average URL lasts 44 days
- http//www.hackvan.com/pub/stig/articles/trusted-s
ystems/0397kahle.html - Spinellis
- 28 loss of 5-8 year old URLs from CACM / IEEE
Computer - http//citeseer.ist.psu.edu/spinellis03decay.html
23Case Study ICASE
- Institute for Computer Applications in Science
and Engineering - independent research institute affiliated with
NASA Langley Research Center - www.icase.edu
- years of operation 1972-2002
- combined with other LaRC institutes, rolled into
the National Institute for Aerospace (NIA) - ICASE Report Series
- pre-prints/e-prints of all ICASE affiliated
authors - also issued as NASA Contractor Reports
- Dienst was used for report management workflow
- Harrison, Zubair Nelson, JCDL 03, Dienst lt-gt
OAI-PMH gateway
24NIA Transition
- At first, all files at www.icase.edu were lost
- then, the site was brought back online
- but how well do DLs survive bulk-transfer?
25Whither the ICASE Digital Library?
it appears to be reinstated
but not completely
26How Long is Forever?
- Average human life span (from http//www.che.uc.e
du/acs/archives/cintacs/vol39no5/vol39no5.html) - female 78
- male 77
- Average Fortune 500 company lifespan (from
http//www.businessweek.com/chapter/degeus.htm) - 40 - 50 years
- Universities?
- U.S. Government agency or institution?
- what about individual labs?
- NASA Zero Base Review
- U.S. Military BRAC
27Self-Preservation
- Objects should be prepared to outlive the people
institutions that are charged with their
well-being - Many areas of risk
- company, agency, university, etc. ceases to exist
- funding cut
- person dies
- disaster (hurricane, earthquake, etc.)
- malicious attack
28P2P Model
- Applicable for scientific and technical
information? - Napster, Gnutella, etc. rely on the repetitive
nature of popular culture media (songs, movies,
etc.) to insure the availability of items - a bubble of recent and popular interest
- this assumption is probably not valid in STI DLs
- cf. popularity(HBO) gtgt popularity(AMC)
29Smart Objects, Dumb Archives
DA
Buckets
Guildford Protocol
Fedora? METS?
OAI-PMH
???
30Key Concepts in the Architecture of the Digital
Library
- next 9 slides taken from Bill Arms seminal
article in the inaugural issue of D-Lib Magazine - http//www.dlib.org/dlib/July95/07arms.html
31The technical framework exists within a legal and
social framework
- DLs no longer represent systems specific to
academics or information specialists - content influences how the DL is used
- architecture must allow the implementation of
various policies
32Understanding of digital library concepts is
hampered by terminology
- common English ! professional English
- multiple professional jargons too
- What do these words mean to you?
- copy
- publish
- content
- document
- work
33The underlying architecture should be separate
from the content stored in the library
- general purpose functions and content-specific
functions should be separated - TL analogy
- the more specific the bookshelf is to holding
actual books, the harder it is to repurpose the
bookshelf in the future
34Names and identifiers are the basic building
block for the digital library
- names ! addresses
- in any DL architecture diagram, (almost)
anything that can be drawn can be named - consider the impact that handles/DOIs have had on
the publishing/DL community
35Digital library objects are more than collections
of bits
- objects metadata data
- but what is metadata?
- dont ask hard questions
figure 2 in http//www.dlib.org/dlib/July95/07arm
s.html
36The digital library object that is used is
different from the stored object
- what you store is not necessarily what you get
- storage and dissemination are separate events,
and can represent separate formats - also, potentially separate from the
application-specific format
37Users want intellectual works, not digital objects
- The DL architects needs should not inconvenience
the users needs - recombination of objects
- what is an object in your world view?
figure 4 in http//www.dlib.org/dlib/July95/07arms
.html
38Repositories must look after the information they
hold
- Repository Access Protocol
- Kahn Wilensky Framework
- http//www.cnri.reston.va.us/home/cstr/arch/k-w.ht
ml
figure 3 in http//www.dlib.org/dlib/July95/07arms
.html
39Objects vs. Archives
- This is the tenet that I question
- Most DL objects still bound to the applications
that generate or render the objects
40Design Goals
- Aggregation
- DLs should be shielded from the transient nature
of file formats - Prevent information hemorrhaging by archiving all
data types - Intelligence
- Aggregation (above) implies code, why stop at
passive objects? Make objects smart... - Bucket-bucket bucket-tool intelligence
41Design Goals
- Self-Sufficiency
- Maximum autonomy survivability fully
self-sufficient buckets - Option to internally store all needed materials
- Mobility
- Why should an information object be stuck in one
place? - Mobility for replication, workflow, data
collection
42Design Goals
- Heterogeneity
- One size does not fit all...
- Different buckets for different applications,
sites, disciplines, etc. - Archive Independence
- Focus is on information, not yet another DL
system - does not require an archive to function
- Work with everything break nothing
43Smart Objects
- aggregate
- metadata
- data
- methods to operate on the metadata/data
- http//www.cs.odu.edu/mln/teaching/cs595-f03/?met
hodgetMetadatatypeall - http//www.cs.odu.edu/mln/teaching/cs595-f03/?met
hodlistMethods - http//www.cs.odu.edu/mln/teaching/cs595-f03/?met
hodlistPreference - (cheat) http//www.cs.odu.edu/mln/teaching/cs595-
f03/bucket/bucket.xml - assumptions
- Perl
- http server
44Internal Structure
jaga.cs.odu.edu/home/mln/public_html/teaching/cs6
95-f03 ls bucket/ CVS/ index.cgi jaga.cs.odu.
edu/home/mln/public_html/teaching/cs695-f03 ls
bucket/ bucket.xml content/ CVS/ lib/ logs/
methods/ jaga.cs.odu.edu/home/mln/public_html/tea
ching/cs695-f03 ls bucket/content/ syllabus.txt
week1readings.html
week5readings.html week10readings.html
week1week-01.ppt week6readings.html week
11readings.html week2readings.html
week7readings.html week12readings.html
week2week-02.ppt week8readings.html week
13readings.html week3assignment1.ppt
week9readings.html week14readings.html
week3readings.html week15readings.html
week3week-03.ppt jaga.cs.odu.edu/home/mln/publi
c_html/teaching/cs695-f03 ls bucket/lib CVS/
EZXML.pm mime.e style.css jaga.cs.odu.edu/home/
mln/public_html/teaching/cs695-f03 ls
bucket/logs/ access.log CVS/ mylog.log jaga.cs.o
du.edu/home/mln/public_html/teaching/cs695-f03
ls bucket/methods/ addElement.pl
getElement.pl listMethods.pl
setPreference.pl CVS/ get_log.pl
listPreference.pl deleteElement.pl
getlog.pl log.pl display.pl
getMetadata.pl setMetadata.pl jaga.cs.odu.edu/
home/mln/public_html/teaching/cs695-f03
45Examples
- 1.6.X bucket
- http//ntrs.nasa.gov/
- http//www.cs.odu.edu/mln/phd/
- 2.0 buckets
- http//www.cs.odu.edu/mln/teaching/cs595-f03/
- http//www.cs.odu.edu/lutken/smalltest/1120/
- 3.0 buckets (under development)
- http//www.cs.odu.edu/jallen/buckets/
- uses MPEG-21 DIDLs
- cf. http//www.dlib.org/dlib/november03/bekaert/11
bekaert.html
46Self-Preservation
- Objectives
- knowledge of the system state not required
- i.e. -- you dont need to keep track of where
everything is - the knowledge required for each object should be
minimal - actually, the required number of friends should
be finite, even in very large systems
47Friends and Family
- Friends
- connections to other buckets
- Family
- connections to replications of you
48Scenario 3buckets/2pals each
C
Pals b,a
49We want to add new_guy (D)
C
Pals b,a
D
Pals(none)
50Tool calls C.insert(D,start)
C
Pals b,a
D
Pals
51D is added to Cs pal list
C
Pals b,a,(d)
C pal_list is overstuffed
D
Pals
52Return handshake D.insert(C,finish)
C
Pals b,a,(d)
C pal_list is overstuffed
D
Pals c
53C refits pal list
C
Pals b,a,(d)
C pal_list is overstuffed
D
Pals c
54Refit step 1 C.pop_1st_pal not known by (D)
C
Pals b,a,d
Now C pal_list is overstuffed
D
Pals c
55Refit step 2B.pop_pal( C )
C
Pals a,d
D
Pals c
56Refit step 2B.insert( D, start )
B
Pals a,d
C
Pals a,d
D
Pals c
57Refit step 3D.insert( B, finish )
B
Pals a,d
C
Pals a,d
D
Pals c,b
58Refit step 3D.insert( B, finish )
B
Pals a,d
C
Pals a,d
D
Pals c,b
59A
Pals b,c
B
Pals a,d
C
Pals a,d
D
Pals c,b
6010 Buckets, 4 Friends Step 2
6110 Buckets, 4 Friends Step 3
6210 Buckets, 4 Friends Step 4
6310 Buckets, 4 Friends Step 5
6410 Buckets, 4 Friends Step 6
6510 Buckets, 4 Friends Step 7
6610 Buckets, 4 Friends Step 8
6710 Buckets, 4 Friends Step 9
6810 Buckets, 4 Friends Step 10
6920 Buckets, 4 Friends
70100 Buckets, 10 Friends
71Building the Network
Bucket this_node_name max_friend
size list_of_pals insert ( new_guy, string
handshake) // Adds new_guy to this bucket's pal
list // handshake "start" or "finish" if
(I know(new_guy) return else put
new_guy at end of my pal list if ( handshake
"start" ) new_node.insert(this_node_name,
"finish") if ( my pal list if now
overstuffed) refit() return
list_of_pals refit () // To keep pal_list
from being overstuffed read in new_guy's pal
list pop_1st_pal_list() // I remove 1st
pal "Y" from my list that's // not present in
"new_guy's" pal list Y.pop_from_list(Me) //
Have "Y" pop "Me" Y.insert(new_guy ,
"start") // Y adds new_guy to his list //
this will call new_guy to add "Y" as well
72Communications Cost Building the Network
- Total communications cost to build the network
- b2 - f - (b-f)2
- b of buckets
- f of friends
73Communications Cost Building the Network
74Communications CostTraversing the Network
- Flood algorithm
- b(f-1) - f 2
- Spanning Tree
- b - 1
- Upper bound on the diameter of the network
- (b-f) /2 1
- (typically much less)
75Network Resiliency
- The network can survive at least f-1 node
(bucket) or edge (communications) failures and
still remain fully connected
76Cf. Other P2P Projects
- Gnutella
- also O(N2) to build the network
- currently dont know the exact message cost
- Chord, Tapestry, etc.
- content addressable networking
- hash function to map keys to locations
- orthogonal to buckets
77Chatting
- the stored objects are inactive until invoked
- if no one communicates with the object, it never
wakes up, can never perform self-tests, etc. - solution
- circulate a number of tokens through the network
to insure that everyone is woken up - buckets can perform a number of administrative
tasks at these times - Core to solving the migration issue
78Communications Tokens
79Flocking
- Craig Reynolds, Flocks, Herds, and SchoolsA
Distributed Behavioral Model, SIGGRAPH 87 - Observations
- flocks, schools, herds, etc. exhibit many
desirable properties - scale-free
- neighbors matter, not total size of flock
- no upper bound
- flocks are never full
- flocks, etc. can be modeled with simple rules
- Collision Avoidance avoid collisions with nearby
flockmates - Velocity Matching attempt to match velocity with
nearby flockmates - Flock Centering attempt to stay close to nearby
flockmates
80Flocking for DLs
Rules Flocking Boids Flocking Buckets
Collision Avoidance avoid collisions with nearby flockmates not overwriting one's own copies nor the copies of other buckets (i.e., namespace collision avoidance)
Velocity Matching attempt to match velocity with nearby flockmates deleting copies of oneself to provide space for late arrivals in a storage location
Flock Centering attempt to stay close to nearby flockmates following others to available storage locations
81Flocking (9,4)
82Flocking (10,4)
83Future Work
- Friends
- optimizing the connections while sending the
communication token - convert to small world graph over time
- repair faults in the network
- Family
- types
- active
- passive
- provenance / authenticity
84Other Applications for Smart Objects
- communication pulses will share the location of
new services - format conversion (migration)
- new repository locations (refreshing)
- submit logs, alerts, other messages to people,
services, etc. - self-arranging displays
85Self-Arranging Displays For Buckets
- premise to have the links in the object reflect
the communitys preferences - real-time computation no log file processing
- Bollen Nelson, Adaptive Networks of Smart
Objects - http//www.cs.odu.edu/mln/pubs/bollenj_adaptive.p
df
86Hebbian Learning
http//b1?methoddisplayrefererb1redirecthttp
//b2?method\ display\26refererhttp//b1
http//b2?methoddisplayrefererb2redirecthttp
//b1?method\ display\26redirecthttp//b3?method
display\26refererhttp//b2
87Initial Experiment
- Elango, Bollen Nelson, "Dynamic Linking of
Smart Digital Objects Based on User Navigation
Patterns" - http//www.arxiv.org/abs/cs.DL/0401029
- http//www.acm.org/technews/articles/20046/0607m.h
tmlitem8 - Take top 50 all-time pop music bands
- from Spin Magazines top 50 bands of all time
- From each band, take 2 related bands
- according to allmusic.com
- Create network of 150 buckets with band info
(metadata from allmusic.com) - Randomize the network
- each band points to 3 other randomly selected
bands - Get people to traverse the network
88Sample Screenshot
89Sample Results
90From the Initial Node Public Enemy
91Reviews and Summaries of Related Work
- Fedora, Warwick Framework, Kahn-Wilensky
Framework, VERS, Multivalent Documents,
Cryptolopes, etc. - NASA TM 211426
- http//techreports.larc.nasa.gov/ltrs/PDF/2001/tm/
NASA-2001-tm211426.pdf - Journal of Digital Libraries, forthcoming special
issue on Complex Digital Objects - CFP http//www.dljournal.org/
92Risks
- Why have these projects met with limited success
or are only used in niche applications? - it is one thing to add a layer to your DL, but
changing the structure of your first-class
objects incurs a level of short-term risk - however, even the most well-thought out
componentized DL is subject to long-term risks - cf. ICASE DL
93Conclusions
- Smart objects are an idea whose time has come
- natural progression of DL RD
- Smart objects will play an fundamental role in
digital preservation - More info on preservation
- http//www.cs.odu.edu/mln/teaching/cs791-s04/