Self-Preserving Digital Objects - PowerPoint PPT Presentation

About This Presentation
Title:

Self-Preserving Digital Objects

Description:

SelfPreserving Digital Objects – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 94
Provided by: Michael50
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Self-Preserving Digital Objects


1
Self-Preserving Digital Objects
  • Michael L. Nelson
  • mln_at_cs.odu.edu
  • http//www.cs.odu.edu/mln/
  • Several Slides from
  • Terry L. Harrison
  • University of Southern California
  • 6/15/04

2
Outline
  • History
  • Preservation
  • Archives vs. Objects
  • Smart Objects Dumb Archives
  • Self-Preserving Objects

3
My DL History
  • 1992 - work first begun on first generation
    Langley Technical Report Server (LTRS)
  • 1993 - WWW version of LTRS
  • http//techreports.larc.nasa.gov/ltrs/
  • work w/ ODU on WATERS
  • 1994 - NASA Technical Report Server (NTRS)
  • distributed searching of many LTRS-like servers
    (20 separate nodes, all NASA centers)
  • http//techreports.larc.nasa.gov/cgi-bin/NTRS
  • 1996 - NACA Technical Report Server (NACATRS)
  • http//naca.larc.nasa.gov/
  • 1996 - Joint research in DLs with ODU begins
  • 1997 - NCSTRL (clustering, buckets)
  • 1999 - OAI-PMH development begins
  • 2001 - Arc, DP9, Archon, Kepler, etc.
  • 2002 - OAI-PMH version of the NTRS
  • http//ntrs.nasa.gov/

4
History
  • ca. 1994 - 1995 a LaRC researcher, upon seeing
    LTRS remarked
  • all of these reports are nice, but what we
    really want is the data...
  • ca. 1995 - present many reports in LTRS start to
    include data files, appendices, software and
    other information types
  • NACATRS the scanned nature of the reports imply
    that 1 report N files
  • N gt (pages 3) 2

5
NASA STI
  • Formal publications cover a decreasing percentage
    of NASAs STI output
  • most DLs focus only on formal publications
  • Informal STI is maintained by only by a network
    of collegial distribution
  • aging and shrinking workforce weakens this
    network
  • Customers want much more than formal publication
  • rather than stretch the meaning of report or
    document, define a new object for DL
    transactions

6
STI Observations
  • Media formats are instantiations of a more
    general class of information
  • Most DLs are uni-format, following the obsolete
    media boundaries of their non-digital
    predecessors
  • Separate but equal DLs considered harmful
  • customer should not have to re-integrate what
    should never have been de-integrated...
  • institutional knowledge being lost because we
    dont have a publishing vector established

7
Pyramid of Scientific and Technical Information
(STI)
Information is created in a variety of formats.
Formal publications, the focus of most DL
projects, are supported by a pyramid of informal
information.
8
Information Lost Over Time
9
Content is King
The information content is more important than
the systems used for its storage, management and
retrieval
Objects should not be locked in specific DLs
or archives
10
Prelude to OAI
  • I met Herbert Van de Sompel in April 1999...
  • we spoke of a demonstration project he had in
    mind and had received sponsorship from Paul
    Ginsparg and Rick Luce
  • We wanted to demonstrate a multi-disciplinary DL
    that leveraged the large number of high quality,
    yet often isolated, tech report servers, e-print
    servers, etc.
  • most digital libraries (DLs) had grown up along
    single disciplines or institutions
  • little to no interoperability isolated DL
    gardens

11
Universal Preprint Service
  • A cross-archive DL that that provides services on
    a collection of metadata harvested from multiple
    archives
  • Nelson NCSTRL a modified version of Dienst
  • support for clustering
  • support for buckets
  • Krichel ReDIF metadata format
  • Van de Sompel SFX Linking
  • Demonstrated at Santa Fe NM, October 21-22, 1999
  • http//web.archive.org/web//http//ups.cs.odu.edu
    /
  • D-Lib Magazine, 6(2) 2000 (2 articles)
  • http//www.dlib.org/dlib/february00/02contents.htm
    l
  • UPS was soon renamed the Open Archives Initiative
    (OAI) http//www.openarchives.org/

12
UPS Participants
totals ca. July 1999
13
Buckets Information Surrogates in UPS
  • Limitations on intellectual property,
  • file size, transmission time, system
  • load, etc. caused us to focus on
  • metadata only
  • Metadata was collected into
  • buckets, with pointers back to the
  • data files (still at the original sites)

14
Value Added Services Attachedto the Buckets
SFX Reference Linking Service, developed at
Univ of Ghent, Belgium. - provides a layer
of indirection between reference
services available at a local site
and the object itself SFX buttons are
attached to the buckets themselves -
communication occurs between SFX server
and the bucket Adding other services to
the buckets is easy...
15
Data and Service Providers
  • Data Providers
  • publishing into an archive
  • Self-describing archives
  • Much of the learning about the constituent UPS
    archives occurred out of band
  • providing methods for metadata harvesting
  • provide non-technical context for sharing
    information also
  • Service Providers
  • harvest metadata from providers
  • implement user interface to data

Even if these are done by the same DL, these are
distinct roles
16
Metadata Harvesting
  • Move away from distributed searching
  • Extract metadata from various sources
  • Build services on local copies of metadata
  • data remains at remote repositories

all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
17
Result OAI
  • The OAI was the result of the demonstration and
    discussion during the Santa Fe meeting
  • OAI a bunch of people, a religion, a cult, etc.
  • OAI Protocol For Metadata Harvesting (OAI-PMH)
    the protocol created and maintained by the OAI
  • Initial focus was on federating collections of
    scholarly e-print materials
  • however, interest grew and the scope and
    application of OAI-PMH expanded to become a
    generic bulk metadata transport protocol
  • Note
  • OAI-PMH is only about metadata -- not full text!
  • but what is metadata vs. full-text?
  • OAI is neutral with respect to the nature of the
    metadata or the resources the metadata describes
  • read commercial publishers have an interest in
    OAI-PMH too...

18
A Look Back at UPS
  • Primary outcome of the meeting was the OAI
    OAI-PMH
  • Krichel ReDIF metadata
  • still in use being developed
  • Van de Sompel SFX
  • OpenURL (NISO Standard)
  • SFX is a commercial OpenURL resolver marketed by
    Ex Libris
  • Nelson
  • NCSTRL begat Arc (arc.cs.odu.edu) and others
  • Buckets?

19
Componentized Digital Libraries
. . .
20
Preservation
  • RLG Report Preserving Digital Information Final
    Report and Recommendations
  • http//www.rlg.org/ArchTF/
  • refreshing - moving to new media
  • considered (comparatively) easy
  • migrating - transitioning to new systems,
    formats, idioms
  • considered hard

21
Really Long Term Preservation
  • Migration is very hard, to be sure
  • but given sufficient demand, this can be
    accomplished
  • cf. early 1980s game emulation
  • http//www.intellivisionlives.com/
  • http//stella.atari.org/
  • Refreshing may actually be harder
  • or at least intrinsically bound to the migration
    problem
  • http//web.archive.org/web/19980128071544/http//w
    ww.usc.edu/
  • http//web.archive.org/web//http//library.usc.ed
    u/
  • http//web.archive.org/web/19971210220634/http//l
    ib-www.lanl.gov/

22
Preservation Metrics So Far
  • Nelson Allen
  • 3 decay of objects in DLs
  • http//www.dlib.org/dlib/january02/nelson/01nelson
    .html
  • Lawrence, et al.
  • 3 decay of URLs included in technical papers
  • http//www.neci.nec.com/lawrence/papers/persisten
    ce-computer01/bib.html
  • Koheler
  • 33 of URLs unstable or partially unstable
  • http//InformationR.net/ir/4-4/paper60.html
  • Kahle
  • average URL lasts 44 days
  • http//www.hackvan.com/pub/stig/articles/trusted-s
    ystems/0397kahle.html
  • Spinellis
  • 28 loss of 5-8 year old URLs from CACM / IEEE
    Computer
  • http//citeseer.ist.psu.edu/spinellis03decay.html

23
Case Study ICASE
  • Institute for Computer Applications in Science
    and Engineering
  • independent research institute affiliated with
    NASA Langley Research Center
  • www.icase.edu
  • years of operation 1972-2002
  • combined with other LaRC institutes, rolled into
    the National Institute for Aerospace (NIA)
  • ICASE Report Series
  • pre-prints/e-prints of all ICASE affiliated
    authors
  • also issued as NASA Contractor Reports
  • Dienst was used for report management workflow
  • Harrison, Zubair Nelson, JCDL 03, Dienst lt-gt
    OAI-PMH gateway

24
NIA Transition
  • At first, all files at www.icase.edu were lost
  • then, the site was brought back online
  • but how well do DLs survive bulk-transfer?

25
Whither the ICASE Digital Library?
it appears to be reinstated
but not completely
26
How Long is Forever?
  • Average human life span (from http//www.che.uc.e
    du/acs/archives/cintacs/vol39no5/vol39no5.html)
  • female 78
  • male 77
  • Average Fortune 500 company lifespan (from
    http//www.businessweek.com/chapter/degeus.htm)
  • 40 - 50 years
  • Universities?
  • U.S. Government agency or institution?
  • what about individual labs?
  • NASA Zero Base Review
  • U.S. Military BRAC

27
Self-Preservation
  • Objects should be prepared to outlive the people
    institutions that are charged with their
    well-being
  • Many areas of risk
  • company, agency, university, etc. ceases to exist
  • funding cut
  • person dies
  • disaster (hurricane, earthquake, etc.)
  • malicious attack

28
P2P Model
  • Applicable for scientific and technical
    information?
  • Napster, Gnutella, etc. rely on the repetitive
    nature of popular culture media (songs, movies,
    etc.) to insure the availability of items
  • a bubble of recent and popular interest
  • this assumption is probably not valid in STI DLs
  • cf. popularity(HBO) gtgt popularity(AMC)

29
Smart Objects, Dumb Archives
DA
Buckets
Guildford Protocol
Fedora? METS?
OAI-PMH
???
30
Key Concepts in the Architecture of the Digital
Library
  • next 9 slides taken from Bill Arms seminal
    article in the inaugural issue of D-Lib Magazine
  • http//www.dlib.org/dlib/July95/07arms.html

31
The technical framework exists within a legal and
social framework
  • DLs no longer represent systems specific to
    academics or information specialists
  • content influences how the DL is used
  • architecture must allow the implementation of
    various policies

32
Understanding of digital library concepts is
hampered by terminology
  • common English ! professional English
  • multiple professional jargons too
  • What do these words mean to you?
  • copy
  • publish
  • content
  • document
  • work

33
The underlying architecture should be separate
from the content stored in the library
  • general purpose functions and content-specific
    functions should be separated
  • TL analogy
  • the more specific the bookshelf is to holding
    actual books, the harder it is to repurpose the
    bookshelf in the future

34
Names and identifiers are the basic building
block for the digital library
  • names ! addresses
  • in any DL architecture diagram, (almost)
    anything that can be drawn can be named
  • consider the impact that handles/DOIs have had on
    the publishing/DL community

35
Digital library objects are more than collections
of bits
  • objects metadata data
  • but what is metadata?
  • dont ask hard questions

figure 2 in http//www.dlib.org/dlib/July95/07arm
s.html
36
The digital library object that is used is
different from the stored object
  • what you store is not necessarily what you get
  • storage and dissemination are separate events,
    and can represent separate formats
  • also, potentially separate from the
    application-specific format

37
Users want intellectual works, not digital objects
  • The DL architects needs should not inconvenience
    the users needs
  • recombination of objects
  • what is an object in your world view?

figure 4 in http//www.dlib.org/dlib/July95/07arms
.html
38
Repositories must look after the information they
hold
  • Repository Access Protocol
  • Kahn Wilensky Framework
  • http//www.cnri.reston.va.us/home/cstr/arch/k-w.ht
    ml

figure 3 in http//www.dlib.org/dlib/July95/07arms
.html
39
Objects vs. Archives
  • This is the tenet that I question
  • Most DL objects still bound to the applications
    that generate or render the objects

40
Design Goals
  • Aggregation
  • DLs should be shielded from the transient nature
    of file formats
  • Prevent information hemorrhaging by archiving all
    data types
  • Intelligence
  • Aggregation (above) implies code, why stop at
    passive objects? Make objects smart...
  • Bucket-bucket bucket-tool intelligence

41
Design Goals
  • Self-Sufficiency
  • Maximum autonomy survivability fully
    self-sufficient buckets
  • Option to internally store all needed materials
  • Mobility
  • Why should an information object be stuck in one
    place?
  • Mobility for replication, workflow, data
    collection

42
Design Goals
  • Heterogeneity
  • One size does not fit all...
  • Different buckets for different applications,
    sites, disciplines, etc.
  • Archive Independence
  • Focus is on information, not yet another DL
    system
  • does not require an archive to function
  • Work with everything break nothing

43
Smart Objects
  • aggregate
  • metadata
  • data
  • methods to operate on the metadata/data
  • http//www.cs.odu.edu/mln/teaching/cs595-f03/?met
    hodgetMetadatatypeall
  • http//www.cs.odu.edu/mln/teaching/cs595-f03/?met
    hodlistMethods
  • http//www.cs.odu.edu/mln/teaching/cs595-f03/?met
    hodlistPreference
  • (cheat) http//www.cs.odu.edu/mln/teaching/cs595-
    f03/bucket/bucket.xml
  • assumptions
  • Perl
  • http server

44
Internal Structure
jaga.cs.odu.edu/home/mln/public_html/teaching/cs6
95-f03 ls bucket/ CVS/ index.cgi jaga.cs.odu.
edu/home/mln/public_html/teaching/cs695-f03 ls
bucket/ bucket.xml content/ CVS/ lib/ logs/
methods/ jaga.cs.odu.edu/home/mln/public_html/tea
ching/cs695-f03 ls bucket/content/ syllabus.txt
week1readings.html
week5readings.html week10readings.html
week1week-01.ppt week6readings.html week
11readings.html week2readings.html
week7readings.html week12readings.html
week2week-02.ppt week8readings.html week
13readings.html week3assignment1.ppt
week9readings.html week14readings.html
week3readings.html week15readings.html
week3week-03.ppt jaga.cs.odu.edu/home/mln/publi
c_html/teaching/cs695-f03 ls bucket/lib CVS/
EZXML.pm mime.e style.css jaga.cs.odu.edu/home/
mln/public_html/teaching/cs695-f03 ls
bucket/logs/ access.log CVS/ mylog.log jaga.cs.o
du.edu/home/mln/public_html/teaching/cs695-f03
ls bucket/methods/ addElement.pl
getElement.pl listMethods.pl
setPreference.pl CVS/ get_log.pl
listPreference.pl deleteElement.pl
getlog.pl log.pl display.pl
getMetadata.pl setMetadata.pl jaga.cs.odu.edu/
home/mln/public_html/teaching/cs695-f03
45
Examples
  • 1.6.X bucket
  • http//ntrs.nasa.gov/
  • http//www.cs.odu.edu/mln/phd/
  • 2.0 buckets
  • http//www.cs.odu.edu/mln/teaching/cs595-f03/
  • http//www.cs.odu.edu/lutken/smalltest/1120/
  • 3.0 buckets (under development)
  • http//www.cs.odu.edu/jallen/buckets/
  • uses MPEG-21 DIDLs
  • cf. http//www.dlib.org/dlib/november03/bekaert/11
    bekaert.html

46
Self-Preservation
  • Objectives
  • knowledge of the system state not required
  • i.e. -- you dont need to keep track of where
    everything is
  • the knowledge required for each object should be
    minimal
  • actually, the required number of friends should
    be finite, even in very large systems

47
Friends and Family
  • Friends
  • connections to other buckets
  • Family
  • connections to replications of you

48
Scenario 3buckets/2pals each
C
Pals b,a
49
We want to add new_guy (D)
C
Pals b,a
D
Pals(none)
50
Tool calls C.insert(D,start)
C
Pals b,a
D
Pals
51
D is added to Cs pal list
C
Pals b,a,(d)
C pal_list is overstuffed
D
Pals
52
Return handshake D.insert(C,finish)
C
Pals b,a,(d)
C pal_list is overstuffed
D
Pals c
53
C refits pal list

C
Pals b,a,(d)
C pal_list is overstuffed
D
Pals c
54
Refit step 1 C.pop_1st_pal not known by (D)

C
Pals b,a,d
Now C pal_list is overstuffed
D
Pals c
55
Refit step 2B.pop_pal( C )
C
Pals a,d
D
Pals c
56
Refit step 2B.insert( D, start )
B
Pals a,d
C
Pals a,d
D
Pals c
57
Refit step 3D.insert( B, finish )
B
Pals a,d
C
Pals a,d
D
Pals c,b
58
Refit step 3D.insert( B, finish )
B
Pals a,d
C
Pals a,d
D
Pals c,b
59

A
Pals b,c
B
Pals a,d
C
Pals a,d
D
Pals c,b
60
10 Buckets, 4 Friends Step 2
61
10 Buckets, 4 Friends Step 3
62
10 Buckets, 4 Friends Step 4
63
10 Buckets, 4 Friends Step 5
64
10 Buckets, 4 Friends Step 6
65
10 Buckets, 4 Friends Step 7
66
10 Buckets, 4 Friends Step 8
67
10 Buckets, 4 Friends Step 9
68
10 Buckets, 4 Friends Step 10
69
20 Buckets, 4 Friends
70
100 Buckets, 10 Friends
71
Building the Network
Bucket this_node_name max_friend
size list_of_pals insert ( new_guy, string
handshake) // Adds new_guy to this bucket's pal
list // handshake "start" or "finish" if
(I know(new_guy) return else put
new_guy at end of my pal list if ( handshake
"start" ) new_node.insert(this_node_name,
"finish") if ( my pal list if now
overstuffed) refit() return
list_of_pals refit () // To keep pal_list
from being overstuffed read in new_guy's pal
list pop_1st_pal_list() // I remove 1st
pal "Y" from my list that's // not present in
"new_guy's" pal list Y.pop_from_list(Me) //
Have "Y" pop "Me" Y.insert(new_guy ,
"start") // Y adds new_guy to his list //
this will call new_guy to add "Y" as well
72
Communications Cost Building the Network
  • Total communications cost to build the network
  • b2 - f - (b-f)2
  • b of buckets
  • f of friends

73
Communications Cost Building the Network
74
Communications CostTraversing the Network
  • Flood algorithm
  • b(f-1) - f 2
  • Spanning Tree
  • b - 1
  • Upper bound on the diameter of the network
  • (b-f) /2 1
  • (typically much less)

75
Network Resiliency
  • The network can survive at least f-1 node
    (bucket) or edge (communications) failures and
    still remain fully connected

76
Cf. Other P2P Projects
  • Gnutella
  • also O(N2) to build the network
  • currently dont know the exact message cost
  • Chord, Tapestry, etc.
  • content addressable networking
  • hash function to map keys to locations
  • orthogonal to buckets

77
Chatting
  • the stored objects are inactive until invoked
  • if no one communicates with the object, it never
    wakes up, can never perform self-tests, etc.
  • solution
  • circulate a number of tokens through the network
    to insure that everyone is woken up
  • buckets can perform a number of administrative
    tasks at these times
  • Core to solving the migration issue

78
Communications Tokens
79
Flocking
  • Craig Reynolds, Flocks, Herds, and SchoolsA
    Distributed Behavioral Model, SIGGRAPH 87
  • Observations
  • flocks, schools, herds, etc. exhibit many
    desirable properties
  • scale-free
  • neighbors matter, not total size of flock
  • no upper bound
  • flocks are never full
  • flocks, etc. can be modeled with simple rules
  • Collision Avoidance avoid collisions with nearby
    flockmates
  • Velocity Matching attempt to match velocity with
    nearby flockmates
  • Flock Centering attempt to stay close to nearby
    flockmates

80
Flocking for DLs
Rules Flocking Boids Flocking Buckets
Collision Avoidance avoid collisions with nearby flockmates not overwriting one's own copies nor the copies of other buckets (i.e., namespace collision avoidance)
Velocity Matching attempt to match velocity with nearby flockmates deleting copies of oneself to provide space for late arrivals in a storage location
Flock Centering attempt to stay close to nearby flockmates following others to available storage locations
81
Flocking (9,4)
82
Flocking (10,4)
83
Future Work
  • Friends
  • optimizing the connections while sending the
    communication token
  • convert to small world graph over time
  • repair faults in the network
  • Family
  • types
  • active
  • passive
  • provenance / authenticity

84
Other Applications for Smart Objects
  • communication pulses will share the location of
    new services
  • format conversion (migration)
  • new repository locations (refreshing)
  • submit logs, alerts, other messages to people,
    services, etc.
  • self-arranging displays

85
Self-Arranging Displays For Buckets
  • premise to have the links in the object reflect
    the communitys preferences
  • real-time computation no log file processing
  • Bollen Nelson, Adaptive Networks of Smart
    Objects
  • http//www.cs.odu.edu/mln/pubs/bollenj_adaptive.p
    df

86
Hebbian Learning
http//b1?methoddisplayrefererb1redirecthttp
//b2?method\ display\26refererhttp//b1
http//b2?methoddisplayrefererb2redirecthttp
//b1?method\ display\26redirecthttp//b3?method
display\26refererhttp//b2
87
Initial Experiment
  • Elango, Bollen Nelson, "Dynamic Linking of
    Smart Digital Objects Based on User Navigation
    Patterns"
  • http//www.arxiv.org/abs/cs.DL/0401029
  • http//www.acm.org/technews/articles/20046/0607m.h
    tmlitem8
  • Take top 50 all-time pop music bands
  • from Spin Magazines top 50 bands of all time
  • From each band, take 2 related bands
  • according to allmusic.com
  • Create network of 150 buckets with band info
    (metadata from allmusic.com)
  • Randomize the network
  • each band points to 3 other randomly selected
    bands
  • Get people to traverse the network

88
Sample Screenshot
89
Sample Results
90
From the Initial Node Public Enemy
91
Reviews and Summaries of Related Work
  • Fedora, Warwick Framework, Kahn-Wilensky
    Framework, VERS, Multivalent Documents,
    Cryptolopes, etc.
  • NASA TM 211426
  • http//techreports.larc.nasa.gov/ltrs/PDF/2001/tm/
    NASA-2001-tm211426.pdf
  • Journal of Digital Libraries, forthcoming special
    issue on Complex Digital Objects
  • CFP http//www.dljournal.org/

92
Risks
  • Why have these projects met with limited success
    or are only used in niche applications?
  • it is one thing to add a layer to your DL, but
    changing the structure of your first-class
    objects incurs a level of short-term risk
  • however, even the most well-thought out
    componentized DL is subject to long-term risks
  • cf. ICASE DL

93
Conclusions
  • Smart objects are an idea whose time has come
  • natural progression of DL RD
  • Smart objects will play an fundamental role in
    digital preservation
  • More info on preservation
  • http//www.cs.odu.edu/mln/teaching/cs791-s04/
Write a Comment
User Comments (0)
About PowerShow.com