Title: OAI: Past, Present and Future
1OAI Past, Present and Future
- Michael L. Nelson mln_at_ils.unc.edu
- several slides stolen from Herbert Van de Sompel
- Open Archives Meeting
- Institute of Mechanical Engineers
- London
- 07/11/01
2Outline
- Past
- original goals, participants
- Present
- evolution of goals, terms, definitions, current
status - Future
- observations, use in the U.S., next steps
3Background
- I met Herbert Van de Sompel in April 1999...
- we spoke of a demonstration project he had in
mind and had received sponsorship from Paul
Ginsparg and Rick Luce - We wanted to demonstrate a multi-disciplinary DL
that leveraged the large number of high quality,
yet often isolated, tech report servers, e-print
servers, etc. - most DLs had grown up along single disciplines
- little to no interoperability, gardens of DLs
4The Rise and Fall of Distributed Searching
- wholesale distributed searching, popular at the
time, is attractive in theory but troublesome in
practice - Davis Lagoze, JASIS 51(3), pp. 273-80
- Powell French, Proc 5th ACM DL, pp. 264-265
- distributed searching of N nodes still viable,
but only for small values of N - NCSTRL N gt 100 bad
- NTRS/NIX Nlt20 ok (but could be better)
5The Rise and Fall of Distributed Searching
- Other problems of distributed searching (from
STARTS) - source-metadata problem
- how do you know which nodes to search?
- query-language problem
- syntax varies and drifts over time between the
various nodes - rank-merging problem
- how do you meaningfully merge multiple result
sets? - Temptations
- centralize all functions
- everything will be done at X
- standardize on a single product
- everyone will use system Y
6Universal Preprint Service
- A cross-archive DL that that provides services on
a collection of metadata harvested from multiple
archives - based on NCSTRL a modified version of Dienst
- support for clustering
- support for buckets
- Demonstrated at Santa Fe NM, October 21-22, 1999
- http//ups.cs.odu.edu/
- D-Lib Magazine, 6(2) 2000 (2 articles)
- http//www.dlib.org/dlib/february00/02contents.htm
l - UPS was soon renamed the Open Archives Initiative
(OAI) http//www.openarchives.org/
7UPS Participants
totals ca. July 1999
8Metadata Harvesting
- Getting metadata out of archives
- not all archives support metadata extraction
- some archives have undocumented metadata
extraction procedures - not all archives support rich criteria for
extraction - single dump concept only
- Intellectual property and use rights not always
clear - many policies akin to dont ask, dont tell
9Metadata Formatting and Quality
- Quality problems with
- record duplication
- crucial missing fields
- internal errors
- ambiguous references to people and places,
publications - Different formats!
unproven intuition n digital libraries results
in O(n) metadata formats
10Buckets Information Surrogates in UPS
- Limitations on intellectual property,
- file size, transmission time, system
- load, etc. caused us to focus on
- metadata only
- Metadata was collected into
- buckets, with pointers back to the
- data files (still at the original sites)
11Value Added Services Attachedto the Buckets
SFX Reference Linking Service, developed at
Univ of Ghent, Belgium. - provides a layer
of indirection between reference
services available at a local site
and the object itself SFX buttons are
attached to the buckets themselves -
communication occurs between SFX server
and the bucket Adding other services to
the buckets is easy...
12Data and Service Providers
- Data Providers
- publishing into an archive
- providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Service Providers
- harvest metadata from providers
- implement user interface to data
- Even if provided by the same DL, these are
distinct functions
13Data and Service Providers
- Self-describing archives
- Much of the learning about the constituent UPS
archives occurred out of band - Given an unknown archive, we should be able to
algorithmically determine the nature of the
archive
Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
No machine based way to extract metadata
Machine and user interfaces for extracting
metadata.
14Data and Service Providers
Input and harvesting interfaces optional
Native end-user interface
Service Provider
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Data Provider
Data Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
15Result OAI
- The OAI was the result of the demonstration and
discussion during the Santa Fe meeting - Initial focus was on federating collections of
scholarly e-print materials - however, interest grew and the scope and
application of OAI expanded to become a
generic bulk metadata
transport protocol - Note
- OAI is only about metadata -- not full text!
- OAI is neutral with respect to the nature of the
metadata or the resources the metadata describes - read commercial publishers have an interest in
OAI too...
16OAI Timeline Highlights
- October 21-22, 1999 - initial UPS meeting
- February 15, 2000 - Santa Fe Convention published
in D-Lib Magazine - precursor to the OAI metadata harvesting protocol
- June 3, 2000 - workshop at ACM DL 2000 (Texas)
- August 25, 2000 - OAI steering committee formed,
DLF/CNI support - September 7-8, 2000 - technical meeting at
Cornell University - defined the core of the current OAI metadata
harvesting protocol - September 21, 2000 - workshop at ECDL 2000
(Portugal) - November 1, 2000 - Alpha test group announced
(15 organizations) - January 23, 2001 - OAI protocol 1.0 announced,
OAI Open Day in the U.S. (Washington DC) - purpose freeze protocol for 12-16 months,
generate critical mass - February 26, 2001 - OAI Open Day in Europe
(Berlin) - July 3, 2001 - OAI protocol 1.1 announced
- to reflect changes in the W3Cs XML latest schema
recommendation - September 8, 2001 - workshop at ECDL 2001
(Darmstadt)
17Open Archives Initiative
18Open Archives Initiative
Open Archival Information System
insuring long-term preservation of archival
materials
exposure of metadata for harvesting
OAIS
OAIS w/ an OAI interface
http//www.dlib.org/dlib/april01/04editorial.html
http//www.dlib.org/dlib/may01/05letters.html http
//ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html
19OAI Metadata Harvesting Protocol
- Then
- OAI harvesting protocol originally a subset of
the Dienst (NCSTRL) protocol - and originally called the Santa Fe Convention
- originally defined an OAI-specific metadata
format - Now
- OAI metadata format dropped in favor of
unqualified Dublin Core - other formats possible, but DC is required as
lowest common denominator - No longer dependent on Dienst
- defined independently (though still easily
mappable)
20Overview of OAI Verbs
archival metadata
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
21supporting protocol requests
service provider harvester
data provider repository
Identify
- Identify / Time / Request
- Repository identifier
- Base-URL
- Admin e-mail
- OAI protocol version
- Description
herbert van de sompel
22supporting protocol requests
service provider harvester
data provider repository
ListMetadataFormats
identifieroaimlib123a
- ListMetadataFormats / Time / Request
- REPEAT
- Format prefix
- Format XML schema
- /REPEAT
herbert van de sompel
23supporting protocol requests
service provider harvester
data provider repository
ListSets resumptionToken
- ListSets / Time / Request
- REPEAT
- SetSpec
- SetName
- /REPEAT
herbert van de sompel
24harvesting requests
froma
untilb
setklm ListRecords metadataPrefixdc
resumptionToken
service provider harvester
data provider repository
- ListRecords / Time / Request
- REPEAT
- Identifier
- Datestamp
- Metadata
- /REPEAT
herbert van de sompel
25harvesting requests
service provider harvester
data provider repository
froma
untilb
setklm ListIdentifiers resumptionToken
- ListIdentifiers / Time / Request
- REPEAT
- Identifier
- Datestamp
- /REPEAT
herbert van de sompel
26harvesting requests
service provider harvester
data provider repository
GetRecord identifieroaimlib123a
metadataPrefixdc
- GetRecord / Time / Request
- Identifier
- Datestamp
- Metadata
herbert van de sompel
27Flow Control
- ListSets, ListIdentifiers, ListRecords are all
allowed to return partial responses, via a
combination of - resumptionToken an opaque, archive-defined data
string that when passed back to the archive
allows the response to begin where it left off - each archive defines their own resumptionToken
syntax it may have visible semantics or not - 503 http status code retry after
- up to the harvester to understand this code and
respect it, and up to the archive to enforce it
28resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
29OAI Demos
- Data providers
- not really meant for end-user interaction, but
Sulemans Repository Explorer is an excellent
tool - http//purl.org/net/oai_explorer
- 30 registered data providers
- http//oaisrv.nsdl.cornell.edu/Register/BrowseSite
s.pl - many being used for internal purposes not
registered - Service providers
- Arc, the first known SP harvesting from OAI data
providers - http//arc.cs.odu.edu/
- 3 registered service providers
- http//www.openarchives.org/service_provider/oai_s
p.htm - several more known to be in testing or creation
30Field of Dreams
- It should be easy to be a data provider, even if
it makes more work for the service provider. - if enough data providers exist, the service
providers will come (DPs gtgt SPs) - Open-source / freely available tools
- drop-in data providers
- industrial strength http//www.eprints.org/
- personal size http//kepler.cs.odu.edu/
- tools to make your existing DL a data provider
- http//www.openarchives.org/tools/tools.htm
- also OAI-implementers mailing list / mail
archive! - service providers
- only bits and pieces currently publicly
available...
31OAI Observation Front-End Only
- No input/registry mechanism
- OAI harvesting protocol is always a front-end for
something else - filesystem, Dienst, RDBMS, LDAP, etc.
- convenient for pre-existing DLs, but does not
address new DLs - e.g., we want to do OAI
- Bounds the scope of OAI
- responsibilities and domain of OAI are still be
discussed - tension between functionality and simplicity
32OAI Observation No TC
- No terms conditions provisions in protocol
- assumes all metadata has uniform access rights
- how to restrict metadata to certain hosts?
- introducing TC would increase the scope of
application, but at the expense of simplicity - how expensive do we want to make a
just-a-front-end protocol ? - maybe TC is a good application for sets?
33OAI Observation No TC
- Possible to use multiple OAI servers in a
DMZ-like configuration
OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
34OAI Observation No TC
- Possible to use OAI harvesting protocol in
closed, restricted systems
OAI 1
OAI 2
OAI 3
OAI 4
all OAI requests originate from these 4 DLs
35OAI Observation Monolithic
- An OAI server has no protocol-defined concept of
other OAI servers - backups, mirrors, etc. have to be resolved
outside of the scope of OAI - scope vs. complexity again
- fully connected graph of DLs harvesting from each
other is unnecessary - cf. web crawlers vs. gathers in U of Colorados
Harvest System - 3rd party harvesting interfaces raise more TC
and data coherency issues
36302 Load Balancing
- Interactive users on main DL machine should not
be impacted by metadata harvesting - dont take deliveries through the front door
- not part of the protocol defined outside the
protocol
OAI Server
harvester
naca.larc.nasa.gov/oai/
37OAI Observation Data Coherency
- In the interest of OAI implementer simplicity,
several issues are left for the service provider
to interpret - what is an update vs. addition?
- in the NACA OAI interface, they are reported as
the same and its up to the harvesting system to
figure it out - deletions?
- it is currently optional for OAI systems to mark
records as deleted or not - still left to the harvester to interpret
38OAI Observation Harvest Model
- Frequency of harvests
- all-at-once harvests?
- initial harvest
- resolving data coherency
- frequent incremental harvests?
- far more efficient for both service and data
providers - Webcrawling vs. digital library models
- webcrawlers little to no a priori information
about target - DLs frequent harvesting of a small number of
known targets - Realization we know very little about how
harvesting behavior - are we optimizing for all-at-once, when
incremental will be more common?
39Potentially Good Ideas(but were not sure yet)
- Sets
- intuition well be glad we included them
- arXiv the first to implement sets
- their DL is roughly built on sets, so it was an
easy mapping for them - a few other repositories have since adopted sets
- Flow control
- harvesting denial of service attack ?
- is resumptionToken solution not enough? too
much? - need data providers with large collections and
enough service providers to generate a load
40Potentially Good Ideas(but were not sure yet)
- Metadata
- Q Which format should I use?
- A any/all of them
- lowest common denominator unqualified Dublin
Core - Again, little known about actual behavior
- will DC be actually be useful? or too lossy?
- will communities create/adopt specific formats?
- will native (presumably richer) formats be
harvested?
41XML Observations
- Not too much of a problem for data providers
- XML is easier to write than read
- Service providers
- XML can be pretty picky a large ListRecords
result can be invalidated with a single error - harvest in chunks? individual records?
- author contributed metadata particularly a
problem (e.g. control characters from
copy-n-paste) - one advantage of resumptionToken is that it
compartmentalizes bad data
42Current NTRS / NIX Architecture
- NASA-wide page that federates N center/project
specific servers through distributed searching
user
search for cfd applications
http//techreports.larc.nasa.gov/cgi-bin/NTRS http
//nix.nasa.gov/
NTRS/NIX
search forcfd applications
search forcfd applications
search forcfd applications
search forcfd applications
each node independently maintained
. . .
43Current NTRS / NIX Architecture
- Or users can interact directly with the nodes of
NTRS/NIX
user
NTRS/NIX
search forcfd applications
search forcfd applications
. . .
44Proposed Strategy Data Providers
- Reduce the high interoperability expectations of
distributed searching - Each current node of NTRS, NIX and other NASA DLs
become an OAI data provider - LTRS NACA already have test OAI interfaces
- LTRS http//techreports.larc.nasa.gov/ltrs/oai/
- NACA http//naca.larc.nasa.gov/oai/
- each node is free to run their own software /
architecture / system / etc., but the method of
metadata exposure is standardized - very low interoperability requirements
- each node can continue to have a user interface
45Proposed Strategy Service Providers
- NTRS, NIX and other well known, destination DLs
become OAI service providers - no longer relying on distributed searching
- harvest metadata from their constituent data
providers - provide their value added services on local
copies of the metadata - data remains resident at the local data providers
46NTRS OAI Architecture
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
NTRS
local copy of metadata
metadata harvested offline, through OAI
interface
each node independently maintained
. . .
LTRS
ATRS
GTRS
CASITRS
content (reports) remain archived at the local
sites
47Additional Models
- First step
- OAI interfaces for data providers
- DLs use OAI interfaces to move from distributed
searching to metadata harvesting - Other possibilities
- hierarchical harvesting
- exposing metadata to other, possibly non-NASA DLs
- harvesting from other, possibly non-NASA DLs
- multi-genre DLs
- re-apply the OAI protocol for harvesting /
replicating content (not just metadata) - 3rd party service providers
48NASA DLs in the Larger STI Realm
DOE
DOD
Universities
Publishers
. . .
International
this could be a fully connected graph
NTRS could also be a data provider from the
point of view of other DLs allowing
the harvesting of NASA report metadata.
NTRS could also harvest metadata from other
DLs, and provide access to non-NASA content. We
hope to influence the direction of the
science.gov effort to use OAI.
49New Kinds of DLs
- Drawing from the same pool of DPs
- different interfaces, capabilities and collection
policies for - public affairs
- K-12 education
- science research
- authors / librarians / managers
- NTRS and NIX could harvest from the same sources
- be the same DL, but with different interfaces?
- be replaced with a new, all-encompassing DL?
- DL creators can now focus on collection
management - ala carting their collections and sub
collections - instead of fussing over syntax synchronization of
remote search services
50A Generic Harvesting Protocol
- The actual uses of OAI depend on your relative
position and concerns - What is metadata vs. data?
- Who is a SP vs. a DP?
- Multiple OAI interfaces make many things
possible - restricted / public interfaces
- Arc-like description of harvested archives
- updates of log files, authority lists, etc.
- Additional services can be built on top of OAI
- content replication
- awareness services
51OAI Impact
- Lightweight interoperability protocol
- an OAI layer is added to your existing DL
- Separation of responsibilities
- service providers
- data providers
- http//www.openarchives.org/