Title: Smart Objects and Dumb but Open Archives
1Smart Objects and Dumb (but Open!) Archives
- Michael L. Nelson
- NASA Langley Research Center
- University of North Carolina
- mln_at_ils.unc.edu
- http//www.ils.unc.edu/mln/
- Cornell University
- CS 502 Computing Methods for DLs
- Guest Lecture
- April 20, 2001
2Outline
- History / problem statement / motivation
- Buckets smart objects
- Bucket implementation
- Smart objects, dumb archives (SODA)
- Open Archive Initiative (OAI)
- Bucket Communication Space (BCS)
- Future work
- Conclusions
3NASA Scientific and Technical Information
- Formal publications cover a decreasing percentage
of NASAs STI output - most DLs focus only on formal publications
- Informal STI is maintained by only by a network
of collegial distribution - aging and shrinking workforce weakens this
network - Customers want much more than formal publication
- rather than stretch the meaning of report or
document, define a new object for DL
transactions
4NASA LaRC Publications 1991-1999
5STI Observations
- Media formats are instantiations of a more
general class of information - Most DLs are uni-format, following the obsolete
media boundaries of their non-digital
predecessors - Separate but equal DLs considered harmful
- customer should not have to re-integrate what
should never have been de-integrated... - institutional knowledge being lost because we
dont have a publishing vector established
6Information Lost Over Time
7Pyramid of Scientific and Technical Information
(STI)
Information is created in a variety of formats.
Formal publications, the focus of most DL
projects, are supported by a pyramid of informal
information.
8The Tyranny of the Archive(Content is King)
The information content is more important than
the systems used for its storage, management and
retrieval
Objects should not be locked in specific DLs
or archives
9Buckets
- Aggregation intelligence buckets
- metadata data methods buckets
- Object-oriented, intelligent agent archival
entities - A collection of all information about a project
- manuscripts - software
- data - images
- video - etc.
- Customizable, heterogeneous
- buckets can learn, talk, and coordinate
- buckets control terms and conditions, display,
etc. -- not the archive that holds them
10Design Goals
- Aggregation
- DLs should be shielded from the transient nature
of file formats - Prevent information hemorrhaging by archiving all
data types - Intelligence
- Aggregation (above) implies code, why stop at
passive objects? Make objects smart... - Bucket-bucket bucket-tool intelligence
11Design Goals
- Self-Sufficiency
- Maximum autonomy survivability fully
self-sufficient buckets - Option to internally store all needed materials
- Mobility
- Why should an information object be stuck in one
place? - Mobility for replication, workflow, data
collection
12Design Goals
- Heterogeneity
- One size does not fit all...
- Different buckets for different applications,
sites, disciplines, etc. - Archive Independence
- Focus is on information, not yet another DL
system - does not require an archive to function
- Work with everything break nothing
13Bucket Architecture
A Typical NASA DL Bucket -- Other Bucket Types
Possible!
14A Sample Bucket
4 packages - report (4 elements) -
appendix (2 elements) - contact information (2
elements) - translation (1 element)
15 Another Sample Bucket
2 packages - pre-print (2
elements) - pointer to SFX reference
linking service for published and
pre-print versions (2 elements)
this bucket display for the Universal
Preprint Service https//ups.cs.odu.edu/
16Heterogeneous Buckets
- Buckets are envisioned to locally modifiable and
extensible - There is a default set of public methods defined
for buckets - additional methods can be locally defined
- Buckets can learn new methods
- new default methods, or locally defined
extensions - override default methods
17Bucket Messages
- Sample bucket messages
- http//home.larc.nasa.gov/mln/bucket/
- http//home.larc.nasa.gov/mln/bucket/?methoddisp
lay - invokes the default display method
- http//home.larc.nasa.gov/mln/bucket/?methodmeta
data - returns the metadata for the bucket
- http//home.larc.nasa.gov/mln/bucket/?methoddisp
laypkg_namereportelement_nametr1253.pdf - displays a single element
- http//home.larc.nasa.gov/mln/bucket/?methodlist
_methods - lists all the methods that this bucket
implements -
18Bucket Methods
BUCKET DEMO
most methods take various arguments see Appendix
B in dissertation http//home.larc.nasa.gov/ml
n/phd/
supersedes Table 1 in NASA TM 1998 208419
19Bucket Metadata
- Due to Dienst heritage, uses RFC-1807 format
- this is likely to change in the future
- Metadata defines the content and appearance of
the bucket - bibliographic and control information
- But can store any format of metadata
- bucket does not need to understand all formats
- special purpose, legacy or obscure formats
- COSATI, MARC
- http//foo.edu/bucket-27/?methodmetadataformatc
osati
20Current Implementation
- File system semantics
- 1 bucket 1 directory
- 1 package 1 directory in bucket
- 1 element 1 file in package directory
- index.cgi is the bucket lid
- http dependency for access
- index.cgi written in Perl 5.0
- Methods should not change when the implementation
changes - still use http as transport protocol
- Oracle, Lotus Notes implementations being
developed - Java, PHP, Tcl, etc. implementations possible too
21Bucket Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
report.pkg
appendix.pkg
source files for methods
http dependency files
terms and conditions
logs
software.pkg
testdata.pkg
_md.pkg
_state.pkg
metadata
bucket state
default bucket packages
sample bucket payload
22Systems Tested
23SODASmart Objects, Dumb Archives
- Objects are more important than the archive that
holds them - The object should be the authority on its
contents, not an archive - We envision a general shift of intelligence from
archives to the objects themselves - DL protocols should find, index, and search --
not know about file formats, policy, terms and
conditions, etc.
24Presentation Responsibility Shifts From Dienst to
Buckets
25SODA
- Current DLs have tight integration between the
data object, the archive it is in, and the
interface used to access it - 1-1 model between DL and archive
- By decoupling these functions, we can separate
their development and maintenance - N-M model between DLs and archives
26SODA
Students and Educators
. . .
. . .
Library Users
Researchers
Corporate Developers
DLSs Building From Archives and Buckets
NASA DLS
Avionics DLS
NCSTRL
Archives Managing Buckets
. . .
NASA Archive
CoRR
ACM Archive
. . .
All Known Buckets (in archives and out)
. . .
27Dumb Archive
- Archives should be little more than set managers
- Several possible archive candidates
- LDAP, Dienst, Guildford Protocol, others
- Our implementation a modified bucket, DA
- it has all of the regular bucket methods, plus
- da_list - list all buckets in the archive
- da_put - put a bucket in an archive
- da_delete - delete a bucket from an archive
- da_info - archive-level metadata
- da_get - redirect to this bucket
all operations modulo appropriate TC
28DA Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
source files for methods
http dependency files
terms and conditions
logs
no bucket payload
_md.pkg
_state.pkg
holdings.pkg
metadata
bucket state
DA data structures
- holdings.pkg package for DA
- does not use packages/elements
- scalability concerns
- uses GDBM/NDBM files (hashes)
- 1 hash per argument to da_put
default bucket packages
29OAI as a Dumb Archive
- Originally used a separate protocol
implementation for the dumb archive - Now using the metadata harvesting protocol
defined by the Open Archive Initiative (OAI) - OAI evolved from the Universal Preprint Service
(UPS) - http//www.dlib.org/dlib/february00/vandesompel-up
s/02vandesompel-ups.html - http//ups.cs.odu.edu/
- http//www.openarchives.org/
- OAI does not require smart objects, but does
create a dumb archive layer
30OAI Bucket Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
oai
source files for methods
http dependency files
terms and conditions
oai.pl element is a support library that defines
access for the specific DL
logs
_md.pkg
_state.pkg
metadata
bucket state
bucket payload is DL specific support library
default bucket packages
in addition to the 30 bucket methods each OAI
verb is implemented as a separate method
31Intelligence
- Shift of responsibility into the data objects
opens up an entire new class of applications - data objects as intelligent agents
- Premise instead of having the data objects do
nothing while they patiently wait to be accessed,
have them do something useful while waiting ...
32Bucket Communication Space
- Provides a well known, shared memory model for
buckets to communicate - communications model Linda (Javaspace)
- Applications
- Bucket matching
- the same author (separated by publisher, time)
- different authors (finding similar works)
- Metadata scrubbing
- Format translation (metadata, images, documents)
- Bucket messaging
- including broadcast multicast
33BCS Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
source files for methods
http dependency files
terms and conditions
logs
no bucket payload
_md.pkg
_state.pkg
bcs.pkg
- bcs.pkg package for BCS
- uses GDBM/NDBM files (hashes)
- for registr
- included programs
- mdt (metadata conversion)
- Image Alchemy (image conversion)
metadata
bucket state
BCS data structures conversion programs
default bucket packages
34BCS Methods
- bcs_list, bcs_register, bcs_unregister
- set management
- bcs_convert_image
- wrapper for Image Alchemy program
- no bucket hooks in 1.6
- bcs_convert_metadata
- wrapper for mdt program
- bucket hooks in 1.6
35BCS Methods
BCS DEMO
- bcs_message
- search, search/replace, search/mesg
functionality - bcs_similarity
- all x all comparison
- n x all comparison (n1 .. all)
- adjustable threshold for similarity
36Similarity Results from UPS
- NACA - 3036 documents
- UPS Math - 3831 documents
- for 6867 documents, ran for 42 hours (561k
comparisons / hour) - used default value of 0.85 for similarity
- NACA - 159 similar documents
- UPS Math - 35 similar documents
- No similarity between NACA UPS Math
- Optimizations
- clustering of collection
- distributed computation of similarity matrix
37Future Work
- Alternate implementations for buckets
- Java, Oracle, Python, Tcl
- Alternate API access
- CORBA, SOAP
- New functionality for buckets
- Standard packages / elements for revisions,
citations, checksums
38Future Work
- Security, authentication, TC
- investigate X.509, Kerberos, MD5
- formalize ACLs
- Specialized buckets
- discipline- or data-specific buckets
- computational buckets
- software reuse, RPC-like support
- Reduce the centralization of the BCS
- cf. Berkeleys xFS serverless file system
- http//now.cs.berkeley.edu/Xfs/xfs.html
- Passive -gt Active objects
- e.g., LANLs Active Recommendation Project
- http//www.c3.lanl.gov/rocha/lww/
39Impact
- SODA
- significant immediate interoperability benefits
- frees the object from the tyranny of the archive
- Bucket aggregation evolutionary concept
- benefit begins immediately, continues
indefinitely - no more information hemorrhaging
- Bucket intelligence revolutionary concept
- benefit is mid- to long-term
- full impact unknown a flexible framework will
allow others to innovate - make archived objects active, not passive
40if bucket software doesnt work out, well
market products with Phils likeness thanks to
Rod Waid for Phil
http//dlib.cs.odu.edu/
41Emergency backup slides...
42Why Digital Libraries?
digital library collection of information both
digitized and organized -- M. Lesk, 1997
- Why not just use the WWW ?
- WWW by itself has low archival management
characteristics - Why not use a RDBMS?
- In the same way that a card catalog is not a TL,
a RDBMS is candidate technology for use in DLs - DL is the union of the content and services
defined on the content
43Digital Libraries?
- Ultimately, the product of a research institution
is information - information objects (generally publications) are
frequently the only tangible measure of research
output - (compressing an entire body of literature)
- Traditional libraries (TLs) are expensive, and
less and less information is being archived by
fewer and fewer TLs
44TLs vs. DLs
- DLs clearly better than TLs at
- Dissemination, storing information variety
- However, TL objects are more survivable
- Who will archive the research information?
- the publishers?
- the institutions?
- the authors?
- Will the average DL object still be accessible in
10 years?
45Cosine Correlation With Frequency Term Weighting
n ? (tdij X
tdik) i1 similarity
(dj,dk) n n
? tdij2 X ? tdik2 i1
i1 where tdij
the ith term in the vector for document
j tdik the ith term in the vector for
document k n the number of unique terms in
the data set
Adapted from Harman (1992), originally from
Salton Lesk (1968)
46Similarity Matrix