OAI: Past, Present and Future - PowerPoint PPT Presentation

About This Presentation
Title:

OAI: Past, Present and Future

Description:

Demonstrated at Santa Fe NM, October 21-22, 1999. http://ups.cs.odu.edu ... February 15, 2000 - Santa Fe Convention published in D-Lib Magazine ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 52
Provided by: Offi156
Category:
Tags: oai | fe | future | mexico | new | past | present | santa

less

Transcript and Presenter's Notes

Title: OAI: Past, Present and Future


1
OAI Past, Present and Future
  • Michael L. Nelson mln_at_ils.unc.edu
  • several slides stolen from Herbert Van de Sompel
  • Open Archives Meeting
  • Institute of Mechanical Engineers
  • London
  • 07/11/01

2
Outline
  • Past
  • original goals, participants
  • Present
  • evolution of goals, terms, definitions, current
    status
  • Future
  • observations, use in the U.S., next steps

3
Background
  • I met Herbert Van de Sompel in April 1999...
  • we spoke of a demonstration project he had in
    mind and had received sponsorship from Paul
    Ginsparg and Rick Luce
  • We wanted to demonstrate a multi-disciplinary DL
    that leveraged the large number of high quality,
    yet often isolated, tech report servers, e-print
    servers, etc.
  • most DLs had grown up along single disciplines
  • little to no interoperability, gardens of DLs

4
The Rise and Fall of Distributed Searching
  • wholesale distributed searching, popular at the
    time, is attractive in theory but troublesome in
    practice
  • Davis Lagoze, JASIS 51(3), pp. 273-80
  • Powell French, Proc 5th ACM DL, pp. 264-265
  • distributed searching of N nodes still viable,
    but only for small values of N
  • NCSTRL N gt 100 bad
  • NTRS/NIX Nlt20 ok (but could be better)

5
The Rise and Fall of Distributed Searching
  • Other problems of distributed searching (from
    STARTS)
  • source-metadata problem
  • how do you know which nodes to search?
  • query-language problem
  • syntax varies and drifts over time between the
    various nodes
  • rank-merging problem
  • how do you meaningfully merge multiple result
    sets?
  • Temptations
  • centralize all functions
  • everything will be done at X
  • standardize on a single product
  • everyone will use system Y

6
Universal Preprint Service
  • A cross-archive DL that that provides services on
    a collection of metadata harvested from multiple
    archives
  • based on NCSTRL a modified version of Dienst
  • support for clustering
  • support for buckets
  • Demonstrated at Santa Fe NM, October 21-22, 1999
  • http//ups.cs.odu.edu/
  • D-Lib Magazine, 6(2) 2000 (2 articles)
  • http//www.dlib.org/dlib/february00/02contents.htm
    l
  • UPS was soon renamed the Open Archives Initiative
    (OAI) http//www.openarchives.org/

7
UPS Participants
totals ca. July 1999
8
Metadata Harvesting
  • Getting metadata out of archives
  • not all archives support metadata extraction
  • some archives have undocumented metadata
    extraction procedures
  • not all archives support rich criteria for
    extraction
  • single dump concept only
  • Intellectual property and use rights not always
    clear
  • many policies akin to dont ask, dont tell

9
Metadata Formatting and Quality
  • Quality problems with
  • record duplication
  • crucial missing fields
  • internal errors
  • ambiguous references to people and places,
    publications
  • Different formats!

unproven intuition n digital libraries results
in O(n) metadata formats
10
Buckets Information Surrogates in UPS
  • Limitations on intellectual property,
  • file size, transmission time, system
  • load, etc. caused us to focus on
  • metadata only
  • Metadata was collected into
  • buckets, with pointers back to the
  • data files (still at the original sites)

11
Value Added Services Attachedto the Buckets
SFX Reference Linking Service, developed at
Univ of Ghent, Belgium. - provides a layer
of indirection between reference
services available at a local site
and the object itself SFX buttons are
attached to the buckets themselves -
communication occurs between SFX server
and the bucket Adding other services to
the buckets is easy...
12
Data and Service Providers
  • Data Providers
  • publishing into an archive
  • providing methods for metadata harvesting
  • provide non-technical context for sharing
    information also
  • Service Providers
  • harvest metadata from providers
  • implement user interface to data
  • Even if provided by the same DL, these are
    distinct functions

13
Data and Service Providers
  • Self-describing archives
  • Much of the learning about the constituent UPS
    archives occurred out of band
  • Given an unknown archive, we should be able to
    algorithmically determine the nature of the
    archive

Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
No machine based way to extract metadata
Machine and user interfaces for extracting
metadata.
14
Data and Service Providers
Input and harvesting interfaces optional
Native end-user interface
Service Provider
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Data Provider
Data Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
15
Result OAI
  • The OAI was the result of the demonstration and
    discussion during the Santa Fe meeting
  • Initial focus was on federating collections of
    scholarly e-print materials
  • however, interest grew and the scope and
    application of OAI expanded to become a
    generic bulk metadata
    transport protocol
  • Note
  • OAI is only about metadata -- not full text!
  • OAI is neutral with respect to the nature of the
    metadata or the resources the metadata describes
  • read commercial publishers have an interest in
    OAI too...

16
OAI Timeline Highlights
  • October 21-22, 1999 - initial UPS meeting
  • February 15, 2000 - Santa Fe Convention published
    in D-Lib Magazine
  • precursor to the OAI metadata harvesting protocol
  • June 3, 2000 - workshop at ACM DL 2000 (Texas)
  • August 25, 2000 - OAI steering committee formed,
    DLF/CNI support
  • September 7-8, 2000 - technical meeting at
    Cornell University
  • defined the core of the current OAI metadata
    harvesting protocol
  • September 21, 2000 - workshop at ECDL 2000
    (Portugal)
  • November 1, 2000 - Alpha test group announced
    (15 organizations)
  • January 23, 2001 - OAI protocol 1.0 announced,
    OAI Open Day in the U.S. (Washington DC)
  • purpose freeze protocol for 12-16 months,
    generate critical mass
  • February 26, 2001 - OAI Open Day in Europe
    (Berlin)
  • July 3, 2001 - OAI protocol 1.1 announced
  • to reflect changes in the W3Cs XML latest schema
    recommendation
  • September 8, 2001 - workshop at ECDL 2001
    (Darmstadt)

17
Open Archives Initiative
18
Open Archives Initiative
Open Archival Information System
insuring long-term preservation of archival
materials
exposure of metadata for harvesting
OAIS
OAIS w/ an OAI interface
http//www.dlib.org/dlib/april01/04editorial.html
http//www.dlib.org/dlib/may01/05letters.html http
//ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html
19
OAI Metadata Harvesting Protocol
  • Then
  • OAI harvesting protocol originally a subset of
    the Dienst (NCSTRL) protocol
  • and originally called the Santa Fe Convention
  • originally defined an OAI-specific metadata
    format
  • Now
  • OAI metadata format dropped in favor of
    unqualified Dublin Core
  • other formats possible, but DC is required as
    lowest common denominator
  • No longer dependent on Dienst
  • defined independently (though still easily
    mappable)

20
Overview of OAI Verbs
Verb Function
Identify description of archive
ListMetadataFormats metadata formats supported by archive
ListSets sets defined by archive
ListIdentifiers OAI unique ids contained in archive
ListRecords listing of N records
GetRecord listing of a single record
archival metadata
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
21
supporting protocol requests
service provider harvester
data provider repository
Identify
  • Identify / Time / Request
  • Repository identifier
  • Base-URL
  • Admin e-mail
  • OAI protocol version
  • Description

herbert van de sompel
22
supporting protocol requests
service provider harvester
data provider repository
ListMetadataFormats
identifieroaimlib123a
  • ListMetadataFormats / Time / Request
  • REPEAT
  • Format prefix
  • Format XML schema
  • /REPEAT

herbert van de sompel
23
supporting protocol requests
service provider harvester
data provider repository
ListSets resumptionToken
  • ListSets / Time / Request
  • REPEAT
  • SetSpec
  • SetName
  • /REPEAT

herbert van de sompel
24
harvesting requests
froma
untilb
setklm ListRecords metadataPrefixdc
resumptionToken
service provider harvester
data provider repository
  • ListRecords / Time / Request
  • REPEAT
  • Identifier
  • Datestamp
  • Metadata
  • /REPEAT

herbert van de sompel
25
harvesting requests
service provider harvester
data provider repository
froma
untilb
setklm ListIdentifiers resumptionToken
  • ListIdentifiers / Time / Request
  • REPEAT
  • Identifier
  • Datestamp
  • /REPEAT

herbert van de sompel
26
harvesting requests
service provider harvester
data provider repository
GetRecord identifieroaimlib123a
metadataPrefixdc
  • GetRecord / Time / Request
  • Identifier
  • Datestamp
  • Metadata

herbert van de sompel
27
Flow Control
  • ListSets, ListIdentifiers, ListRecords are all
    allowed to return partial responses, via a
    combination of
  • resumptionToken an opaque, archive-defined data
    string that when passed back to the archive
    allows the response to begin where it left off
  • each archive defines their own resumptionToken
    syntax it may have visible semantics or not
  • 503 http status code retry after
  • up to the harvester to understand this code and
    respect it, and up to the archive to enforce it

28
resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
29
OAI Demos
  • Data providers
  • not really meant for end-user interaction, but
    Sulemans Repository Explorer is an excellent
    tool
  • http//purl.org/net/oai_explorer
  • 30 registered data providers
  • http//oaisrv.nsdl.cornell.edu/Register/BrowseSite
    s.pl
  • many being used for internal purposes not
    registered
  • Service providers
  • Arc, the first known SP harvesting from OAI data
    providers
  • http//arc.cs.odu.edu/
  • 3 registered service providers
  • http//www.openarchives.org/service_provider/oai_s
    p.htm
  • several more known to be in testing or creation

30
Field of Dreams
  • It should be easy to be a data provider, even if
    it makes more work for the service provider.
  • if enough data providers exist, the service
    providers will come (DPs gtgt SPs)
  • Open-source / freely available tools
  • drop-in data providers
  • industrial strength http//www.eprints.org/
  • personal size http//kepler.cs.odu.edu/
  • tools to make your existing DL a data provider
  • http//www.openarchives.org/tools/tools.htm
  • also OAI-implementers mailing list / mail
    archive!
  • service providers
  • only bits and pieces currently publicly
    available...

31
OAI Observation Front-End Only
  • No input/registry mechanism
  • OAI harvesting protocol is always a front-end for
    something else
  • filesystem, Dienst, RDBMS, LDAP, etc.
  • convenient for pre-existing DLs, but does not
    address new DLs
  • e.g., we want to do OAI
  • Bounds the scope of OAI
  • responsibilities and domain of OAI are still be
    discussed
  • tension between functionality and simplicity

32
OAI Observation No TC
  • No terms conditions provisions in protocol
  • assumes all metadata has uniform access rights
  • how to restrict metadata to certain hosts?
  • introducing TC would increase the scope of
    application, but at the expense of simplicity
  • how expensive do we want to make a
    just-a-front-end protocol ?
  • maybe TC is a good application for sets?

33
OAI Observation No TC
  • Possible to use multiple OAI servers in a
    DMZ-like configuration

OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
34
OAI Observation No TC
  • Possible to use OAI harvesting protocol in
    closed, restricted systems

OAI 1
OAI 2
OAI 3
OAI 4
all OAI requests originate from these 4 DLs
35
OAI Observation Monolithic
  • An OAI server has no protocol-defined concept of
    other OAI servers
  • backups, mirrors, etc. have to be resolved
    outside of the scope of OAI
  • scope vs. complexity again
  • fully connected graph of DLs harvesting from each
    other is unnecessary
  • cf. web crawlers vs. gathers in U of Colorados
    Harvest System
  • 3rd party harvesting interfaces raise more TC
    and data coherency issues

36
302 Load Balancing
  • Interactive users on main DL machine should not
    be impacted by metadata harvesting
  • dont take deliveries through the front door
  • not part of the protocol defined outside the
    protocol

OAI Server
harvester
naca.larc.nasa.gov/oai/
37
OAI Observation Data Coherency
  • In the interest of OAI implementer simplicity,
    several issues are left for the service provider
    to interpret
  • what is an update vs. addition?
  • in the NACA OAI interface, they are reported as
    the same and its up to the harvesting system to
    figure it out
  • deletions?
  • it is currently optional for OAI systems to mark
    records as deleted or not
  • still left to the harvester to interpret

38
OAI Observation Harvest Model
  • Frequency of harvests
  • all-at-once harvests?
  • initial harvest
  • resolving data coherency
  • frequent incremental harvests?
  • far more efficient for both service and data
    providers
  • Webcrawling vs. digital library models
  • webcrawlers little to no a priori information
    about target
  • DLs frequent harvesting of a small number of
    known targets
  • Realization we know very little about how
    harvesting behavior
  • are we optimizing for all-at-once, when
    incremental will be more common?

39
Potentially Good Ideas(but were not sure yet)
  • Sets
  • intuition well be glad we included them
  • arXiv the first to implement sets
  • their DL is roughly built on sets, so it was an
    easy mapping for them
  • a few other repositories have since adopted sets
  • Flow control
  • harvesting denial of service attack ?
  • is resumptionToken solution not enough? too
    much?
  • need data providers with large collections and
    enough service providers to generate a load

40
Potentially Good Ideas(but were not sure yet)
  • Metadata
  • Q Which format should I use?
  • A any/all of them
  • lowest common denominator unqualified Dublin
    Core
  • Again, little known about actual behavior
  • will DC be actually be useful? or too lossy?
  • will communities create/adopt specific formats?
  • will native (presumably richer) formats be
    harvested?

41
XML Observations
  • Not too much of a problem for data providers
  • XML is easier to write than read
  • Service providers
  • XML can be pretty picky a large ListRecords
    result can be invalidated with a single error
  • harvest in chunks? individual records?
  • author contributed metadata particularly a
    problem (e.g. control characters from
    copy-n-paste)
  • one advantage of resumptionToken is that it
    compartmentalizes bad data

42
Current NTRS / NIX Architecture
  • NASA-wide page that federates N center/project
    specific servers through distributed searching

user
search for cfd applications
http//techreports.larc.nasa.gov/cgi-bin/NTRS http
//nix.nasa.gov/
NTRS/NIX
search forcfd applications
search forcfd applications
search forcfd applications
search forcfd applications
each node independently maintained
. . .
43
Current NTRS / NIX Architecture
  • Or users can interact directly with the nodes of
    NTRS/NIX

user
NTRS/NIX
search forcfd applications
search forcfd applications
. . .
44
Proposed Strategy Data Providers
  • Reduce the high interoperability expectations of
    distributed searching
  • Each current node of NTRS, NIX and other NASA DLs
    become an OAI data provider
  • LTRS NACA already have test OAI interfaces
  • LTRS http//techreports.larc.nasa.gov/ltrs/oai/
  • NACA http//naca.larc.nasa.gov/oai/
  • each node is free to run their own software /
    architecture / system / etc., but the method of
    metadata exposure is standardized
  • very low interoperability requirements
  • each node can continue to have a user interface

45
Proposed Strategy Service Providers
  • NTRS, NIX and other well known, destination DLs
    become OAI service providers
  • no longer relying on distributed searching
  • harvest metadata from their constituent data
    providers
  • provide their value added services on local
    copies of the metadata
  • data remains resident at the local data providers

46
NTRS OAI Architecture
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
NTRS
local copy of metadata
metadata harvested offline, through OAI
interface
each node independently maintained
. . .
LTRS
ATRS
GTRS
CASITRS
content (reports) remain archived at the local
sites
47
Additional Models
  • First step
  • OAI interfaces for data providers
  • DLs use OAI interfaces to move from distributed
    searching to metadata harvesting
  • Other possibilities
  • hierarchical harvesting
  • exposing metadata to other, possibly non-NASA DLs
  • harvesting from other, possibly non-NASA DLs
  • multi-genre DLs
  • re-apply the OAI protocol for harvesting /
    replicating content (not just metadata)
  • 3rd party service providers

48
NASA DLs in the Larger STI Realm
DOE
DOD
Universities
Publishers
. . .
International
this could be a fully connected graph
NTRS could also be a data provider from the
point of view of other DLs allowing
the harvesting of NASA report metadata.
NTRS could also harvest metadata from other
DLs, and provide access to non-NASA content. We
hope to influence the direction of the
science.gov effort to use OAI.
49
New Kinds of DLs
  • Drawing from the same pool of DPs
  • different interfaces, capabilities and collection
    policies for
  • public affairs
  • K-12 education
  • science research
  • authors / librarians / managers
  • NTRS and NIX could harvest from the same sources
  • be the same DL, but with different interfaces?
  • be replaced with a new, all-encompassing DL?
  • DL creators can now focus on collection
    management
  • ala carting their collections and sub
    collections
  • instead of fussing over syntax synchronization of
    remote search services

50
A Generic Harvesting Protocol
  • The actual uses of OAI depend on your relative
    position and concerns
  • What is metadata vs. data?
  • Who is a SP vs. a DP?
  • Multiple OAI interfaces make many things
    possible
  • restricted / public interfaces
  • Arc-like description of harvested archives
  • updates of log files, authority lists, etc.
  • Additional services can be built on top of OAI
  • content replication
  • awareness services

51
OAI Impact
  • Lightweight interoperability protocol
  • an OAI layer is added to your existing DL
  • Separation of responsibilities
  • service providers
  • data providers
  • http//www.openarchives.org/
Write a Comment
User Comments (0)
About PowerShow.com