Smart Objects and Dumb but Open Archives - PowerPoint PPT Presentation

About This Presentation
Title:

Smart Objects and Dumb but Open Archives

Description:

Smart objects, dumb archives (SODA) Open Archive Initiative (OAI) ... Originally used a separate protocol & implementation for the 'dumb archive' ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 47
Provided by: off102
Category:

less

Transcript and Presenter's Notes

Title: Smart Objects and Dumb but Open Archives


1
Smart Objects and Dumb (but Open!) Archives
  • Michael L. Nelson
  • NASA Langley Research Center
  • University of North Carolina
  • mln_at_ils.unc.edu
  • http//www.ils.unc.edu/mln/
  • Cornell University
  • CS 502 Computing Methods for DLs
  • Guest Lecture
  • April 20, 2001

2
Outline
  • History / problem statement / motivation
  • Buckets smart objects
  • Bucket implementation
  • Smart objects, dumb archives (SODA)
  • Open Archive Initiative (OAI)
  • Bucket Communication Space (BCS)
  • Future work
  • Conclusions

3
NASA Scientific and Technical Information
  • Formal publications cover a decreasing percentage
    of NASAs STI output
  • most DLs focus only on formal publications
  • Informal STI is maintained by only by a network
    of collegial distribution
  • aging and shrinking workforce weakens this
    network
  • Customers want much more than formal publication
  • rather than stretch the meaning of report or
    document, define a new object for DL
    transactions

4
NASA LaRC Publications 1991-1999
5
STI Observations
  • Media formats are instantiations of a more
    general class of information
  • Most DLs are uni-format, following the obsolete
    media boundaries of their non-digital
    predecessors
  • Separate but equal DLs considered harmful
  • customer should not have to re-integrate what
    should never have been de-integrated...
  • institutional knowledge being lost because we
    dont have a publishing vector established

6
Information Lost Over Time
7
Pyramid of Scientific and Technical Information
(STI)
Information is created in a variety of formats.
Formal publications, the focus of most DL
projects, are supported by a pyramid of informal
information.
8
The Tyranny of the Archive(Content is King)
The information content is more important than
the systems used for its storage, management and
retrieval
Objects should not be locked in specific DLs
or archives
9
Buckets
  • Aggregation intelligence buckets
  • metadata data methods buckets
  • Object-oriented, intelligent agent archival
    entities
  • A collection of all information about a project
  • manuscripts - software
  • data - images
  • video - etc.
  • Customizable, heterogeneous
  • buckets can learn, talk, and coordinate
  • buckets control terms and conditions, display,
    etc. -- not the archive that holds them

10
Design Goals
  • Aggregation
  • DLs should be shielded from the transient nature
    of file formats
  • Prevent information hemorrhaging by archiving all
    data types
  • Intelligence
  • Aggregation (above) implies code, why stop at
    passive objects? Make objects smart...
  • Bucket-bucket bucket-tool intelligence

11
Design Goals
  • Self-Sufficiency
  • Maximum autonomy survivability fully
    self-sufficient buckets
  • Option to internally store all needed materials
  • Mobility
  • Why should an information object be stuck in one
    place?
  • Mobility for replication, workflow, data
    collection

12
Design Goals
  • Heterogeneity
  • One size does not fit all...
  • Different buckets for different applications,
    sites, disciplines, etc.
  • Archive Independence
  • Focus is on information, not yet another DL
    system
  • does not require an archive to function
  • Work with everything break nothing

13
Bucket Architecture
A Typical NASA DL Bucket -- Other Bucket Types
Possible!
14
A Sample Bucket
4 packages - report (4 elements) -
appendix (2 elements) - contact information (2
elements) - translation (1 element)
15
Another Sample Bucket
2 packages - pre-print (2
elements) - pointer to SFX reference
linking service for published and
pre-print versions (2 elements)
this bucket display for the Universal
Preprint Service https//ups.cs.odu.edu/
16
Heterogeneous Buckets
  • Buckets are envisioned to locally modifiable and
    extensible
  • There is a default set of public methods defined
    for buckets
  • additional methods can be locally defined
  • Buckets can learn new methods
  • new default methods, or locally defined
    extensions
  • override default methods

17
Bucket Messages
  • Sample bucket messages
  • http//home.larc.nasa.gov/mln/bucket/
  • http//home.larc.nasa.gov/mln/bucket/?methoddisp
    lay
  • invokes the default display method
  • http//home.larc.nasa.gov/mln/bucket/?methodmeta
    data
  • returns the metadata for the bucket
  • http//home.larc.nasa.gov/mln/bucket/?methoddisp
    laypkg_namereportelement_nametr1253.pdf
  • displays a single element
  • http//home.larc.nasa.gov/mln/bucket/?methodlist
    _methods
  • lists all the methods that this bucket
    implements

18
Bucket Methods
BUCKET DEMO
most methods take various arguments see Appendix
B in dissertation http//home.larc.nasa.gov/ml
n/phd/
supersedes Table 1 in NASA TM 1998 208419
19
Bucket Metadata
  • Due to Dienst heritage, uses RFC-1807 format
  • this is likely to change in the future
  • Metadata defines the content and appearance of
    the bucket
  • bibliographic and control information
  • But can store any format of metadata
  • bucket does not need to understand all formats
  • special purpose, legacy or obscure formats
  • COSATI, MARC
  • http//foo.edu/bucket-27/?methodmetadataformatc
    osati

20
Current Implementation
  • File system semantics
  • 1 bucket 1 directory
  • 1 package 1 directory in bucket
  • 1 element 1 file in package directory
  • index.cgi is the bucket lid
  • http dependency for access
  • index.cgi written in Perl 5.0
  • Methods should not change when the implementation
    changes
  • still use http as transport protocol
  • Oracle, Lotus Notes implementations being
    developed
  • Java, PHP, Tcl, etc. implementations possible too

21
Bucket Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
report.pkg
appendix.pkg
source files for methods
http dependency files
terms and conditions
logs
software.pkg
testdata.pkg
_md.pkg
_state.pkg
metadata
bucket state
default bucket packages
sample bucket payload
22
Systems Tested
23
SODASmart Objects, Dumb Archives
  • Objects are more important than the archive that
    holds them
  • The object should be the authority on its
    contents, not an archive
  • We envision a general shift of intelligence from
    archives to the objects themselves
  • DL protocols should find, index, and search --
    not know about file formats, policy, terms and
    conditions, etc.

24
Presentation Responsibility Shifts From Dienst to
Buckets
25
SODA
  • Current DLs have tight integration between the
    data object, the archive it is in, and the
    interface used to access it
  • 1-1 model between DL and archive
  • By decoupling these functions, we can separate
    their development and maintenance
  • N-M model between DLs and archives

26
SODA
Students and Educators
. . .
. . .
Library Users
Researchers
Corporate Developers
DLSs Building From Archives and Buckets
NASA DLS
Avionics DLS
NCSTRL
Archives Managing Buckets
. . .
NASA Archive
CoRR
ACM Archive
. . .
All Known Buckets (in archives and out)
. . .
27
Dumb Archive
  • Archives should be little more than set managers
  • Several possible archive candidates
  • LDAP, Dienst, Guildford Protocol, others
  • Our implementation a modified bucket, DA
  • it has all of the regular bucket methods, plus
  • da_list - list all buckets in the archive
  • da_put - put a bucket in an archive
  • da_delete - delete a bucket from an archive
  • da_info - archive-level metadata
  • da_get - redirect to this bucket

all operations modulo appropriate TC
28
DA Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
source files for methods
http dependency files
terms and conditions
logs
no bucket payload
_md.pkg
_state.pkg
holdings.pkg
metadata
bucket state
DA data structures
  • holdings.pkg package for DA
  • does not use packages/elements
  • scalability concerns
  • uses GDBM/NDBM files (hashes)
  • 1 hash per argument to da_put

default bucket packages
29
OAI as a Dumb Archive
  • Originally used a separate protocol
    implementation for the dumb archive
  • Now using the metadata harvesting protocol
    defined by the Open Archive Initiative (OAI)
  • OAI evolved from the Universal Preprint Service
    (UPS)
  • http//www.dlib.org/dlib/february00/vandesompel-up
    s/02vandesompel-ups.html
  • http//ups.cs.odu.edu/
  • http//www.openarchives.org/
  • OAI does not require smart objects, but does
    create a dumb archive layer

30
OAI Bucket Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
oai
source files for methods
http dependency files
terms and conditions
oai.pl element is a support library that defines
access for the specific DL
logs
_md.pkg
_state.pkg
metadata
bucket state
bucket payload is DL specific support library
default bucket packages
in addition to the 30 bucket methods each OAI
verb is implemented as a separate method
31
Intelligence
  • Shift of responsibility into the data objects
    opens up an entire new class of applications
  • data objects as intelligent agents
  • Premise instead of having the data objects do
    nothing while they patiently wait to be accessed,
    have them do something useful while waiting ...

32
Bucket Communication Space
  • Provides a well known, shared memory model for
    buckets to communicate
  • communications model Linda (Javaspace)
  • Applications
  • Bucket matching
  • the same author (separated by publisher, time)
  • different authors (finding similar works)
  • Metadata scrubbing
  • Format translation (metadata, images, documents)
  • Bucket messaging
  • including broadcast multicast

33
BCS Structure
Bucket
index.cgi
_method.pkg
_http.pkg
_log.pkg
_tc.pkg
source files for methods
http dependency files
terms and conditions
logs
no bucket payload
_md.pkg
_state.pkg
bcs.pkg
  • bcs.pkg package for BCS
  • uses GDBM/NDBM files (hashes)
  • for registr
  • included programs
  • mdt (metadata conversion)
  • Image Alchemy (image conversion)

metadata
bucket state
BCS data structures conversion programs
default bucket packages
34
BCS Methods
  • bcs_list, bcs_register, bcs_unregister
  • set management
  • bcs_convert_image
  • wrapper for Image Alchemy program
  • no bucket hooks in 1.6
  • bcs_convert_metadata
  • wrapper for mdt program
  • bucket hooks in 1.6

35
BCS Methods
BCS DEMO
  • bcs_message
  • search, search/replace, search/mesg
    functionality
  • bcs_similarity
  • all x all comparison
  • n x all comparison (n1 .. all)
  • adjustable threshold for similarity

36
Similarity Results from UPS
  • NACA - 3036 documents
  • UPS Math - 3831 documents
  • for 6867 documents, ran for 42 hours (561k
    comparisons / hour)
  • used default value of 0.85 for similarity
  • NACA - 159 similar documents
  • UPS Math - 35 similar documents
  • No similarity between NACA UPS Math
  • Optimizations
  • clustering of collection
  • distributed computation of similarity matrix

37
Future Work
  • Alternate implementations for buckets
  • Java, Oracle, Python, Tcl
  • Alternate API access
  • CORBA, SOAP
  • New functionality for buckets
  • Standard packages / elements for revisions,
    citations, checksums

38
Future Work
  • Security, authentication, TC
  • investigate X.509, Kerberos, MD5
  • formalize ACLs
  • Specialized buckets
  • discipline- or data-specific buckets
  • computational buckets
  • software reuse, RPC-like support
  • Reduce the centralization of the BCS
  • cf. Berkeleys xFS serverless file system
  • http//now.cs.berkeley.edu/Xfs/xfs.html
  • Passive -gt Active objects
  • e.g., LANLs Active Recommendation Project
  • http//www.c3.lanl.gov/rocha/lww/

39
Impact
  • SODA
  • significant immediate interoperability benefits
  • frees the object from the tyranny of the archive
  • Bucket aggregation evolutionary concept
  • benefit begins immediately, continues
    indefinitely
  • no more information hemorrhaging
  • Bucket intelligence revolutionary concept
  • benefit is mid- to long-term
  • full impact unknown a flexible framework will
    allow others to innovate
  • make archived objects active, not passive

40
if bucket software doesnt work out, well
market products with Phils likeness thanks to
Rod Waid for Phil
http//dlib.cs.odu.edu/
41
Emergency backup slides...
42
Why Digital Libraries?
digital library collection of information both
digitized and organized -- M. Lesk, 1997
  • Why not just use the WWW ?
  • WWW by itself has low archival management
    characteristics
  • Why not use a RDBMS?
  • In the same way that a card catalog is not a TL,
    a RDBMS is candidate technology for use in DLs
  • DL is the union of the content and services
    defined on the content

43
Digital Libraries?
  • Ultimately, the product of a research institution
    is information
  • information objects (generally publications) are
    frequently the only tangible measure of research
    output
  • (compressing an entire body of literature)
  • Traditional libraries (TLs) are expensive, and
    less and less information is being archived by
    fewer and fewer TLs

44
TLs vs. DLs
  • DLs clearly better than TLs at
  • Dissemination, storing information variety
  • However, TL objects are more survivable
  • Who will archive the research information?
  • the publishers?
  • the institutions?
  • the authors?
  • Will the average DL object still be accessible in
    10 years?

45
Cosine Correlation With Frequency Term Weighting
n ? (tdij X
tdik) i1 similarity
(dj,dk) n n
? tdij2 X ? tdik2 i1
i1 where tdij
the ith term in the vector for document
j tdik the ith term in the vector for
document k n the number of unique terms in
the data set
Adapted from Harman (1992), originally from
Salton Lesk (1968)
46
Similarity Matrix
Write a Comment
User Comments (0)
About PowerShow.com