Title: New Developments in OAI
1New Developments in OAI
- Michael L. Nelson
- Old Dominion University
- http//www.cs.odu.edu/mln/
- mln_at_cs.odu.edu
- OA-Forum
- May 13-14, 2002
- Pisa, Italy
Many slides borrowed from Herbert Van de Sompel
Carl Lagoze
2N.B.
- OAI-PMH 2.0 is not scheduled for public beta
release until May 19, 2002 - some of the details of this presentation are
still subject to change! - final public release of 2.0 scheduled for June 1
3Whats New in 2.0?!
- Good news OAI-PMH is still
- Six Verbs DC
- Incremental improvements
- single XML schema
- ambiguities removed
- more expressive options
- cleaner separation of roles responsibilities
- Bad news not backwards compatible with 1.1
4Open Archives Initiative
5The Rise and Fall of Distributed Searching
- wholesale distributed searching, popular at the
time, is attractive in theory but troublesome in
practice - Davis Lagoze, JASIS 51(3), pp. 273-80
- Powell French, Proc 5th ACM DL, pp. 264-265
- distributed searching of N nodes still viable,
but only for small values of N - NCSTRL N gt 100 bad
- NTRS/NIX Nlt20 ok (but could be better)
6The Rise and Fall of Distributed Searching
- Other problems of distributed searching (from
STARTS) - source-metadata problem
- how do you know which nodes to search?
- query-language problem
- syntax varies and drifts over time between the
various nodes - rank-merging problem
- how do you meaningfully merge multiple result
sets? - Temptations
- centralize all functions
- everything will be done at X
- standardize on a single product
- everyone will use system Y
7Metadata Harvesting
- Move away from distributed searching
- Extract metadata from various sources
- Build services on local copies of metadata
- data remains at remote repositories
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
8Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
9 Santa Fe Convention 02/2000
- goal optimize discovery of e-prints
- input
- the UPS prototype
- RePEc /SODA data provider / service provider
model - Dienst protocol
- deliberations at Santa Fe meeting 10/99
10 OAI-PMH v.1.0 01/2001
- goal optimize discovery of document-like
objects - input
- SFC
- DLF meetings on metadata harvesting
- deliberations at Cornell meeting 09/00
- alpha test group of OAI-PMH v.1.0
11 OAI-PMH v.1.0 01/2001
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - focus on document-like objects
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- experimental 12-18 months
12pre- 2.0 OAI Timeline Highlights
- October 21-22, 1999 - initial UPS meeting
- February 15, 2000 - Santa Fe Convention published
in D-Lib Magazine - precursor to the OAI metadata harvesting protocol
- June 3, 2000 - workshop at ACM DL 2000 (Texas)
- August 25, 2000 - OAI steering committee formed,
DLF/CNI support - September 7-8, 2000 - technical meeting at
Cornell University - defined the core of the current OAI metadata
harvesting protocol - September 21, 2000 - workshop at ECDL 2000
(Portugal) - November 1, 2000 - Alpha test group announced
(15 organizations) - January 23, 2001 - OAI protocol 1.0 announced,
OAI Open Day in the U.S. (Washington DC) - purpose freeze protocol for 12-16 months,
generate critical mass - February 26, 2001 - OAI Open Day in Europe
(Berlin) - July 3, 2001 - OAI protocol 1.1 announced
- to reflect changes in the W3Cs XML latest schema
recommendation - September 8, 2001 - workshop at ECDL 2001
(Darmstadt)
13 OAI-PMH v.2.0 06/2002
- goal recurrent exchange of metadata about
resources between systems - input
- OAI-PMH v.1.0
- feedback on OAI-implementers
- deliberations by OAI-tech 09/01 -
- alpha test group of OAI-PMH v.2.0 03/02 -
14 OAI-PMH v.2.0 06/2002
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - metadata about resources
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- stable
15process leading to OAI-PMH v.2.0
16 creation of OAI-tech 06/01
- created for 1 year period
- charge
- review functionality and nature of OAI-PMH v.1.0
- investigate extensions
- release stable version of OAI-PMH by 05/02
- determine need for infrastructure to support
broad adoption of the protocol - communication listserv, SourceForge, conference
calls
17OAI-tech
US representatives Thomas Krichel (Long Island U)
- Jeff Young (OCLC) - Tim Cole - (U of Illinois
at Urbana Champaign) - Hussein Suleman (Virginia
Tech) - Simeon Warner (Cornell U) - Michael
Nelson (NASA) - Caroline Arms (LoC) - Mohammad
Zubair (Old Dominion U) - Steven Bird (U Penn.)
European representatives Andy Powell (Bath U.
UKOLN) - Mogens Sandfaer (DTV) - Thomas Baron
(CERN) - Les Carr (U of Southampton)
18 pre-alpha phase 09/01 02/02
- review process by OAI-tech
- identification of issues
- conference call to filter/combine issues
- white paper per issue
- on-line discussion per white paper
- proposal for resolution of issue by OAI-exec
- discussion of proposal closure of issue
- conference call to resolve open issues
19 pre-alpha phase 02/02
- creation of revised protocol document
- in-person meeting Lagoze - Van de Sompel -
Nelson Warner - autonomous decisions
- internal vetting of protocol document
20alpha phase 02/02 05/02
- alpha-1 release to OAI-tech March 1st 2002
- OAI-tech extended with alpha testers
- discussions/implementations by OAI-tech
- ongoing revision of protocol document
21OAI-PMH 2.0 alpha testers (1/2)
- The British Library
- Cornell U. -- NSDL project e-print arXiv
- Ex Libris
- FS Consulting Inc -- harvester for my.OAI
- Humboldt-Universität zu Berlin
- InQuirion Pty Ltd, RMIT University
- Library of Congress
- NASA
- OCLC
22OAI-PMH 2.0 alpha testers (2/2)
- Old Dominion U. -- ARC , DP9
- U. of Illinois at Urbana-Champaign
- U. Of Southampton -- OAIA, CiteBase, eprints.org
- UCLA, John Hopkins U., Indiana U., NYU -- sheet
music collection - UKOLN, U. of Bath -- RDN
- Virginia Tech -- repository explorer
23 beta phase 05/02
- beta release on May 1st 2002 to
- registered data providers and service providers
- interested parties
- fine tuning of protocol document
- preparation for the release of 2.0 conformant
tools by alpha testers
24Whats new in OAI-PMH v.2.0?
- general changes to improve solidity of protocol
25Overview of OAI Verbs
archival metadata
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
26Identify
1.1
2.0
- Arguments
- none
- Errors
- none
- Arguments
- none
- Errors
- badArgument
27ListMetadataFormats
1.1
2.0
- Arguments
- identifier (OPTIONAL)
- Errors
- id does not exist
- Arguments
- identifier (OPTIONAL)
- Errors
- badArgument
- noMetadataFormats
- idDoesNotExist
28ListSets
1.1
2.0
- Arguments
- resumptionToken (EXCLUSIVE)
- Errors
- no set hierarchy
- Arguments
- resumptionToken (EXCLUSIVE)
- Errors
- badArgument
- badResumptionToken
- noSetHierarchy
29ListIdentifiers
1.1
2.0
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- Errors
- no records match
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- badArgument
- cannotDisseminateFormat
- badGranularity
- badResumptionToken
- noSetHierarchy
- noRecordsMatch
30ListRecords
1.1
2.0
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- no records match
- metadata format cannot be disseminated
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- noRecordsMatch
- cannotDisseminateFormat
- badGranularity
- badResumptionToken
- noSetHierarchy
- badArgument
31GetRecord
1.1
2.0
- Arguments
- identifier (REQUIRED)
- metadataPrefix (REQUIRED)
- Errors
- id does not exist
- metadata format cannot be disseminated
- Arguments
- identifier (REQUIRED)
- metadataPrefix (REQUIRED)
- Errors
- badArgument
- cannotDisseminateFormat
- idDoesNotExist
32 general changes
- clear distinction between protocol and periphery
- fixed protocol document
- extensible implementation guidelines
- e.g. sample metadata formats, description
containers, about containers - allows for OAI guidelines and community
guidelines
33 general changes
- clear separation of OAI-PMH and HTTP
- OAI-PMH error handling
- all OK at HTTP level? gt 200 OK
- something wrong at OAI-PMH level? gt OAI-PMH
error (e.g. badVerb)
34OAI Data ModelResources / Items / Records
item identifier
record identifier metadata format datestamp
35 general changes
- better definitions of harvester, repository,
item, unique identifier, record, set, selective
harvesting - oai_dc schema builds on DCMI XML Schema for
unqualified Dublin Core - usage of must, must not etc. as in RFC2119
- wording on response compression
36 general changes
- all protocol responses can be validated with a
single XML Schema - easier for data providers
- no redundancy in type definitions
- SOAP-ready
- clean for error handling
37 response no errors
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verbGetRecord gthttp//arXiv.org/oai
2lt/requestgt ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergtoaiarXivcs/0112017lt/identifiergt
ltdatestampgt2001-12-14lt/datestampgt
ltsetSpecgtcslt/setSpecgt ltsetSpecgtmathlt/setSpecgt
lt/headergt ltmetadatagt ..
lt/metadatagt lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
38 response with error
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequestgthttp//arXiv.org/oai2lt/requestgt lterror
codebadVerbgtShowMe is not a valid OAI-PMH
verblt/errorgt lt/OAI-PMHgt
39 corrections
- all dates/times are UTC, encoded in ISO8601,
Z-notation - 1957-03-20T203000.00Z
40 resumptionToken
- idempotency of resumptionToken return same
incomplete list when rT is reissued - while no changes occur in the repo strict
- while changes occur in the repo all items with
unchanged datestamp - new attributes for the resumptionToken
- expirationDate
- completeListSize
- cursor
41 new functionality
- harvesting granularity
- mandatory support of YYYY-MM-DD
- optional support of YYYY-MM-DDThhmmssZ
- granularity of from and until must be the same
42 new functionality
ltIdentifygt ltrepositoryNamegtLibrary of
Congress 1lt/repositoryNamegt
ltbaseURLgthttp//memory.loc.gov/cgi-bin/oailt/baseUR
Lgt ltprotocolVersiongt2.0lt/protocolVersiongt
ltadminEmailgtdwoo_at_loc.govlt/adminEmailgt
ltadminEmailgtcaar_at_loc.govlt/adminEmailgt
ltdeletedRecordgttransientlt/deletedRecordgt
ltearliestDatestampgt1990-02-01T000000Zlt/earliestD
atestampgt ltgranularitygtYYYY-MM-DDThhmmssZlt/g
ranularitygt ltcompressiongtdeflatelt/compressiongt
43 new functionality
- header contains set membership of item
ltrecordgt ltheadergt ltidentifiergtoaiarXiv
cs/0112017lt/identifiergt ltdatestampgt2001-12-14
lt/datestampgt ltsetSpecgtcslt/setSpecgt
ltsetSpecgtmathlt/setSpecgt lt/headergt
ltmetadatagt .. lt/metadatagt lt/recordgt
44 new functionality
- ListIdentifiers returns headers
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verb gthttp//arXiv.org/oai2lt/reques
tgt ltListIdentifiersgt ltheadergt
ltidentifiergtoaiarXivhep-th/9801001lt/identifiergt
ltdatestampgt1999-02-23lt/datestampgt
ltsetSpecgtphysicheplt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiarXivhep-th/9801
002lt/identifiergt ltdatestampgt1999-03-20lt/datest
ampgt ltsetSpecgtphysicheplt/setSpecgt
ltsetSpecgtphysicexplt/setSpecgt lt/headergt
45 new functionality
- ListIdentifiers mandates metadataPrefix as
argument
http//www.perseus.tufts.edu/cgi-bin/pdataprov?
verbListIdentifiers metadataPrefixolac
from2001-01-01 until2001-01-01
setPerseuscollectionPersInfo
46 new functionality
- character set for metadataPrefix and setSpec
extended to URL-safe characters
A-Z a-z 0-9 _ ! ( ) - .
- identifierType anyURI
- repositoryName string
47 in the periphery
- introduction of provenance container to
facilitate tracing of harvesting history
ltaboutgt ltprovenancegt ltoriginDescriptiongt
ltbaseURLgthttp//an.oa.orglt/baseURLgt
ltidentifiergtoair1plog/9801001lt/identifiergt
ltdatestampgt2001-08-13T130002Zlt/datestampgt
ltmetadataPrefixgtoai_dclt/metadataPrefixgt
ltharvestDategt2001-08-15T120130Zlt/harvestDategt
lt/originDescriptiongt ltoriginDescriptiongt
lt/originDescriptiongt
lt/provenancegt lt/aboutgt
48 in the periphery
- introduction of friends container to facilitate
discovery of repositories
ltdescriptiongt ltFriendsgt ltbaseURLgthttp//cav2001
.library.caltech.edu/perl/oailt/baseURLgt
ltbaseURLgthttp//formations2.ulst.ac.uk/perl/oailt/b
aseURLgt ltbaseURLgthttp//cogprints.soton.ac.uk/pe
rl/oailt/baseURLgt ltbaseURLgthttp//wave.ldc.upenn.
edu/OLAC/dp/aps.php4lt/baseURLgt
lt/Friendsgt lt/descriptiongt
49 in the periphery
- revision of oai-identifier
- guidelines for collection-level and set-level
metadata
50future
51 the OAI-PMH
- release of OAI-PMH v.2.0 06/2002
- no backwards compatibility with v.1.0/1.1
- stable
- migration process for registered repos
- ? formal standardization ?
- ? SOAP version web services framework SOAP,
WSDL, UDDI ?
52 communities
- proliferation of community-specific add-ons for
- collection set level metadata
- expressive metadata formats (e.g. qualified DC
XML Schema) - shared set-structures
- machine readable rights (about the metadata)
53 adoption
- evolution
- from talking about OAI-PMH
- to talking about projects that use OAI-PMH
- to talking about projects and failing to mention
they use OAI-PMH - gt OAI-PMH becomes part of the infrastructure
54indicators of adoption of OAI-PMH
55 data providers
- 49 registered repositories 11/2001
- 65 registered repositories 03/2002
- 77 registered repositories 05/2002
- 5 million records
- many unregistered repositories
56 service providers
- Arc cross-searching of registered repositories
Old Dominion U - http//arc.cs.odu.edu
- OLAC cross-searching of Language Archive
Community repositories - http//www.language-archives.org/index.html
57 service providers
- Scirus scientific search engine Elsevier
- http//www.scirus.com
- my.OAI user-tailorable cross-searching of
registered repositories FS Consulting, Inc. - http//www.myoai.com
- growing interest from web search engines
58 OAI-PMH tools
- Repository Explorer interactive exploration of
repositories Virginia Tech - http//www.purl.org/NET/oai_explorer
- eprints.org generic OAI-PMH compliant
repository software U of Southampton - http//www.eprints.org
- ALCME repository and harvester software OCLC
- http//alcme.oclc.org/index.html
59 exploration
- Kepler Old Dominion U
- your personal OAI data provider Kepler
archivelet - the Kepler service provider harvests from
archivelets that register - archivelet downloadable
- http//www.dlib.org/dlib/april01/maly/04maly.html
60 exploration
- DP9 Old Dominion U
- provides entry page to repositories for
web-crawlers - provides bookmarkable URL for OAI record
- provides resolution of OAI identifier into
metadata - software downloadable
61http//www.openarchives.org openarchives_at_openarch
ives.org
62Emergency Backup Slides
63resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
64Open Archives Initiative
Open Archival Information System
insuring long-term preservation of archival
materials
exposure of metadata for harvesting
OAIS
OAIS w/ an OAI interface
http//www.dlib.org/dlib/april01/04editorial.html
http//www.dlib.org/dlib/may01/05letters.html http
//ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html
65Field of Dreams
- It should be easy to be a data provider, even if
it makes more work for the service provider. - if enough data providers exist, the service
providers will come (DPs gtgt SPs) - Open-source / freely available tools
- drop-in data providers
- industrial strength http//www.eprints.org/
- personal size http//kepler.cs.odu.edu/
- tools to make your existing DL a data provider
- http//www.openarchives.org/tools/tools.htm
- also OAI-implementers mailing list / mail
archive! - service providers
- only bits and pieces currently publicly
available...