Title: Introduction%20to%20the%20OAI-PMH
1Introduction to the OAI-PMH
- Michael L. Nelson
- mln_at_cs.odu.edu
- http//www.cs.odu.edu/mln/
- Several Slides from
- Herbert Van de Sompel, Simeon Warner and Terry L.
Harrison - University of Southern California
- 6/15/04
2Outline
- History of OAI-PMH
- UPS, Santa Fe Convention
- Overview of the OAI-PMH
- verbs
- data model
- OAI 1.0, 1.1, 2.0 and how 2.0 was created
- Example data providers and service providers
- More information
- http//www.openarchives.org/
3UPS and SFC
4The Rise and Fall of Distributed Searching
- wholesale distributed searching, popular at the
time, is attractive in theory but troublesome in
practice - Davis Lagoze, JASIS 51(3), pp. 273-80
- Powell French, Proc 5th ACM DL, pp. 264-265
- distributed searching of N nodes still viable,
but only for small values of N - NCSTRL N gt 100 bad
- NTRS/NIX Nlt20 ok (but could be better)
5The Rise and Fall of Distributed Searching
- Other problems of distributed searching (from
STARTS) - source-metadata problem
- how do you know which nodes to search?
- query-language problem
- syntax varies and drifts over time between the
various nodes - rank-merging problem
- how do you meaningfully merge multiple result
sets?
6Universal Preprint Service
- A cross-archive DL that that provides services on
a collection of metadata harvested from multiple
archives - based on NCSTRL a modified version of Dienst
- support for clustering
- support for buckets
- Demonstrated at Santa Fe NM, October 21-22, 1999
- http//ups.cs.odu.edu/
- D-Lib Magazine, 6(2) 2000 (2 articles)
- http//www.dlib.org/dlib/february00/02contents.htm
l - UPS was soon renamed the Open Archives Initiative
(OAI) http//www.openarchives.org/
7UPS Participants
totals ca. July 1999
8Metadata Harvesting
- Getting metadata out of archives
- not all archives support metadata extraction
- some archives have undocumented metadata
extraction procedures - not all archives support rich criteria for
extraction - single dump concept only
- Intellectual property and use rights not always
clear - many policies akin to dont ask, dont tell
9Metadata Formatting and Quality
- Quality problems with
- record duplication
- crucial missing fields
- internal errors
- ambiguous references to people and places,
publications - Different formats!
observation n digital libraries results in
O(n) metadata formats
10Buckets Information Surrogates in UPS
- Limitations on intellectual property,
- file size, transmission time, system
- load, etc. caused us to focus on
- metadata only
- Metadata was collected into
- buckets, with pointers back to the
- data files (still at the original sites)
11Value Added Services Attachedto the Buckets
SFX Reference Linking Service, developed at
Univ of Ghent, Belgium. - provides a layer
of indirection between reference
services available at a local site
and the object itself SFX buttons are
attached to the buckets themselves -
communication occurs between SFX server
and the bucket Adding other services to
the buckets is easy...
12Data and Service Providers
- Data Providers
- publishing into an archive
- Self-describing archives
- Much of the learning about the constituent UPS
archives occurred out of band - providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Service Providers
- harvest metadata from providers
- implement user interface to data
Even if these are done by the same DL, these are
distinct roles
13Metadata Harvesting
- Move away from distributed searching
- Extract metadata from various sources
- Build services on local copies of metadata
- data remains at remote repositories
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
14Result OAI
- The OAI was the result of the demonstration and
discussion during the Santa Fe meeting - Initial focus was on federating collections of
scholarly e-print materials - however, interest grew and the scope and
application of OAI expanded to become a generic
bulk metadata transport protocol - Note
- OAI is only about metadata -- not full text!
- what is metadata and what is full text?
- OAI is neutral with respect to the nature of the
metadata or the resources the metadata describes - read commercial publishers have an interest in
OAI too...
15Open Archives Initiative
16Open Archives Initiative
Open Archival Information System
insuring long-term preservation of archival
materials
exposure of metadata for harvesting
OAIS
OAIS w/ an OAI interface
http//www.dlib.org/dlib/april01/04editorial.html
http//www.dlib.org/dlib/may01/05letters.html http
//ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html
17OAI Protocol for Metadata Harvesting
- Then
- OAI-PMH originally a subset of the Dienst
(NCSTRL) protocol - and originally called the Santa Fe Convention
- originally defined an OAI-specific metadata
format - Now
- OAI metadata format dropped in favor of
unqualified Dublin Core - other formats possible, but DC is required as
lowest common denominator - No longer dependent on Dienst (Cornell CS TR
95-1514) - defined independently (though still easily
mappable)
18Dublin Core
- Dublin Core Metadata Initiative
- http//www.dublincore.org/
- from 1994-1995, recognizing the need for simple,
interoperable metadata for resource discovery - good overview of metadata DC
http//www.dlib.org/dlib/january01/lagoze/01lagoze
.html - 15 elements (qualifiers/refinements possible)
19Open Archives Initiative Protocol for Metadata
Harvesting
20OAI-PMH Actors
- data providers / repositories
- A repository is a network accessible server that
can process the 6 OAI-PMH requests in the manner
described in the OAI-PMH document. A
repository is managed by a data provider to
expose metadata to harvesters. - service providers / harvesters
- A harvester is a client application that issues
OAI-PMH requests. A harvester is operated by a
service provider as a means of collecting
metadata from repositories.
21Data Providers / Service Providers
22Aggregators
- aggregators allow for
- scalability for OAI-PMH
- load balancing
- community building
- discovery
service providers (harvesters)
data providers (repositories)
aggregator
23Aggregators
- Frequently interchangeable terms
- aggregators likely to be community /
institutionally focused - caches stores a copy, less likely to be
community-oriented - proxies less likely to store a copy, may gateway
between OAI-PMH and other protocols - Dienst / OAI Gateway Harrison, Nelson, Zubair,
JCDL 03 - To learn more about aggregators, caches
proxies - http//www.openarchives.org/OAI/2.0/guidelines-agg
regator.htm - http//www.cs.odu.edu/mln/jcdl03/
24OAI-PMH Data Model
item identifier
record identifier metadata format datestamp
25Overview of OAI-PMH Verbs
Verb Function
Identify description of repository
ListMetadataFormats metadata formats supported by repository
ListSets sets defined by repository
ListIdentifiers OAI unique ids contained in repository
ListRecords listing of N records
GetRecord listing of a single record
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
26supporting protocol requests
service provider harvester
data provider repository
Identify
- Identify / Time / Request
- Repository identifier
- Base-URL
- Admin e-mail
- OAI protocol version
- Description
herbert van de sompel
27Identify
1.1
2.0
- Arguments
- none
- Errors
- none
- Arguments
- none
- Errors
- badArgument
28supporting protocol requests
service provider harvester
data provider repository
ListMetadataFormats
identifieroaimlib123a
- ListMetadataFormats / Time / Request
- REPEAT
- Format prefix
- Format XML schema
- /REPEAT
herbert van de sompel
29ListMetadataFormats
1.1
2.0
- Arguments
- identifier (OPTIONAL)
- Errors
- id does not exist
- Arguments
- identifier (OPTIONAL)
- Errors
- badArgument
- noMetadataFormats
- idDoesNotExist
30supporting protocol requests
service provider harvester
data provider repository
ListSets resumptionToken
- ListSets / Time / Request
- REPEAT
- SetSpec
- SetName
- /REPEAT
herbert van de sompel
31ListSets
1.1
2.0
- Arguments
- resumptionToken (EXCLUSIVE)
- Errors
- no set hierarchy
- Arguments
- resumptionToken (EXCLUSIVE)
- Errors
- badArgument
- badResumptionToken
- noSetHierarchy
32harvesting requests
froma
untilb
setklm ListRecords metadataPrefixdc
resumptionToken
service provider harvester
data provider repository
- ListRecords / Time / Request
- REPEAT
- Identifier
- Datestamp
- Metadata
- /REPEAT
herbert van de sompel
33ListRecords
1.1
2.0
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- no records match
- metadata format cannot be disseminated
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- noRecordsMatch
- cannotDisseminateFormat
- badResumptionToken
- noSetHierarchy
- badArgument
34harvesting requests
service provider harvester
data provider repository
froma
untilb
setklam metadataPrefix ListIdentifiers
resumptionToken
- ListIdentifiers / Time / Request
- REPEAT
- Identifier
- Datestamp
- /REPEAT
herbert van de sompel
35ListIdentifiers
1.1
2.0
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- Errors
- no records match
- Arguments
- from (OPTIONAL)
- until (OPTIONAL)
- set (OPTIONAL)
- resumptionToken (EXCLUSIVE)
- metadataPrefix (REQUIRED)
- Errors
- badArgument
- cannotDisseminateFormat
- badResumptionToken
- noSetHierarchy
- noRecordsMatch
36harvesting requests
service provider harvester
data provider repository
GetRecord identifieroaimlib123a
metadataPrefixdc
- GetRecord / Time / Request
- Identifier
- Datestamp
- Metadata
herbert van de sompel
37GetRecord
1.1
2.0
- Arguments
- identifier (REQUIRED)
- metadataPrefix (REQUIRED)
- Errors
- id does not exist
- metadata format cannot be disseminated
- Arguments
- identifier (REQUIRED)
- metadataPrefix (REQUIRED)
- Errors
- badArgument
- cannotDisseminateFormat
- idDoesNotExist
38Argument Summary
metadataPrefix from until set resumptionToken identifier
Identify ? ? ? ? ? ?
ListMetadata Formats ? ? ? ? ? optional
ListSets ? ? ? ? exclusive ?
ListIdentifiers ? optional optional optional exclusive ?
ListRecords ? optional optional optional exclusive ?
GetRecord ? ? ? ? ? ?
39Error Summary
Identify BA
ListMetadata Formats BA NMF IDDNE
ListSets BA BRT NSH
ListIdentifiers BA BRT CDF NRM NSH
ListRecords BA BRT CDF NRM NSH
GetRecord BA CDF IDDNE
Generate badVerb on any input not matching the 6
defined verbs this is an inversion
of the table in section 3.6 of the OAI-PMH
specification
40Flow Control
- ListSets, ListIdentifiers, ListRecords are all
allowed to return partial responses, via a
combination of - resumptionToken an opaque, archive-defined data
string that when passed back to the archive
allows the response to begin where it left off - each archive defines their own resumptionToken
syntax it may have visible semantics or not - 503 http status code retry after
- up to the harvester to understand this code and
respect it, and up to the archive to enforce it
41resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
42Lets Look at some Repositories
- Repository Explorer
- http//www.purl.org/NET/oai_explorer
43OAI-PMH 1.0, 1.1, 2.0
44Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
45 Santa Fe Convention 02/2000
- goal optimize discovery of e-prints
- input
- the UPS prototype
- RePEc /SODA data provider / service provider
model - Dienst protocol
- deliberations at Santa Fe meeting 10/99
46 OAI-PMH v.1.0 01/2001
- goal optimize discovery of document-like
objects - input
- SFC
- DLF meetings on metadata harvesting
- deliberations at Cornell meeting 09/00
- alpha test group of OAI-PMH v.1.0
47 OAI-PMH v.1.0 01/2001
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - focus on document-like objects
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- experimental 12-18 months
48Selected Pre- 2.0 OAI Highlights
- October 21-22, 1999 - initial UPS meeting
- February 15, 2000 - Santa Fe Convention published
in D-Lib Magazine - precursor to the OAI metadata harvesting protocol
- June 3, 2000 - workshop at ACM DL 2000 (Texas)
- August 25, 2000 - OAI steering committee formed,
DLF/CNI support - September 7-8, 2000 - technical meeting at
Cornell University - defined the core of the current OAI metadata
harvesting protocol - September 21, 2000 - workshop at ECDL 2000
(Portugal) - November 1, 2000 - Alpha test group announced
(15 organizations) - January 23, 2001 - OAI protocol 1.0 announced,
OAI Open Day in the U.S. (Washington DC) - purpose freeze protocol for 12-16 months,
generate critical mass - February 26, 2001 - OAI Open Day in Europe
(Berlin) - July 3, 2001 - OAI protocol 1.1 announced
- to reflect changes in the W3Cs XML latest schema
recommendation - September 8, 2001 - workshop at ECDL 2001
(Darmstadt)
49 OAI-PMH v.2.0 06/2002
- goal recurrent exchange of metadata about
resources between systems - input
- OAI-PMH v.1.0
- feedback on OAI-implementers
- deliberations by OAI-tech 09/01 - 06/02
- alpha test group of OAI-PMH v.2.0 03/02 -
06/02 - officially released June 14, 2002
50 OAI-PMH v.2.0 06/2002
- low-barrier interoperability specification
- metadata harvesting model data provider /
service provider - metadata about resources
- autonomous protocol
- HTTP based
- XML responses
- unqualified Dublin Core
- stable
51releasing OAI-PMH v.2.0 (illustrating the OAI
process) See also Lagoze, Carl and Van de
Sompel, Herbert. The making of the Open Archives
Initiative Protocol for Metadata Harvesting.
2003. Library Hi Tech. v21, N2. Draft
52 53 creation of OAI-tech 06/01
- created for 1 year period
- charge
- review functionality and nature of OAI-PMH v.1.0
- investigate extensions
- release stable version of OAI-PMH by 05/02
- determine need for infrastructure to support
broad adoption of the protocol - communication listserv, SourceForge, conference
calls
54OAI-tech
US representatives Thomas Krichel (Long Island U)
- Jeff Young (OCLC) - Tim Cole - (U of Illinois
at Urbana Champaign) - Hussein Suleman (Virginia
Tech) - Simeon Warner (Cornell U) - Michael
Nelson (NASA) - Caroline Arms (LoC) - Mohammad
Zubair (Old Dominion U) - Steven Bird (U Penn.)
European representatives Andy Powell (Bath U.
UKOLN) - Mogens Sandfaer (DTV) - Thomas Baron
(CERN) - Les Carr (U of Southampton)
55 pre-alpha phase 09/01 02/02
- review process by OAI-tech
- identification of issues
- conference call to filter/combine issues
- white paper per issue
- on-line discussion per white paper
- proposal for resolution of issue by OAI-exec
- discussion of proposal closure of issue
- conference call to resolve open issues
56 pre-alpha phase 02/02
- creation of revised protocol document
- in-person meeting Lagoze - Van de Sompel -
Nelson Warner - autonomous decisions
- internal vetting of protocol document
57alpha phase 02/02 05/02
- alpha-1 release to OAI-tech March 1st 2002
- OAI-tech extended with alpha testers
- discussions/implementations by OAI-tech
- ongoing revision of protocol document
58OAI-PMH 2.0 alpha testers (1/2)
- The British Library
- Cornell U. -- NSDL project e-print arXiv
- Ex Libris
- FS Consulting Inc -- harvester for my.OAI
- Humboldt-Universität zu Berlin
- InQuirion Pty Ltd, RMIT University
- Library of Congress
- NASA
- OCLC
59OAI-PMH 2.0 alpha testers (2/2)
- Old Dominion U. -- ARC , DP9
- U. of Illinois at Urbana-Champaign
- U. Of Southampton -- OAIA (now Celestial),
CiteBase, eprints.org - UCLA, John Hopkins U., Indiana U., NYU -- sheet
music collection - UKOLN, U. of Bath -- RDN
- Virginia Tech -- repository explorer
60 beta phase 05/02-06/02
- beta release on May 1st 2002 to
- registered data providers and service providers
- interested parties
- fine tuning of protocol document
- preparation for the release of 2.0 conformant
tools by alpha testers
61OAI-PMH v.2.0 highlights
62- important improvements in 2.0
63 important improvements
64 protocol vs periphery
- clear distinction between protocol and periphery
- fixed protocol document
- extensible implementation guidelines
- e.g. sample metadata formats, description
containers, about containers - allows for OAI guidelines and community
guidelines
65 OAI-PMH vs HTTP
- clear separation of OAI-PMH and HTTP
- OAI-PMH error handling
- all OK at HTTP level? gt 200 OK
- something wrong at OAI-PMH level? gt OAI-PMH
error (e.g. badVerb) - http codes 302, 503, etc. still available to
implementers, but no longer represent OAI-PMH
events
66 other improvements
- better definitions of harvester, repository,
item, unique identifier, record, set, selective
harvesting - oai_dc schema builds on DCMI XML Schema for
unqualified Dublin Core - usage of must, must not etc. as in RFC2119
- wording on response compression
67 other improvements
- all protocol responses can be validated with a
single XML Schema - easier for data providers
- no redundancy in type definitions
- SOAP-ready
- clean for error handling
68 response no errors
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verbGetRecord gthttp//arXiv.org/oai
2lt/requestgt ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergtoaiarXivcs/0112017lt/identifiergt
ltdatestampgt2001-12-14lt/datestampgt
ltsetSpecgtcslt/setSpecgt ltsetSpecgtmathlt/setSpecgt
lt/headergt ltmetadatagt ..
lt/metadatagt lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
69 response with error
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequestgthttp//arXiv.org/oai2lt/requestgt lterror
codebadVerbgtShowMe is not a valid OAI-PMH
verblt/errorgt lt/OAI-PMHgt
70 corrections
71 dates/times
- all dates/times are UTC, encoded in ISO8601,
Z-notation - 1957-03-20T203000Z
72 resumptionToken
- idempotency of resumptionToken return same
incomplete list when rT is reissued - while no changes occur in the repo strict
- while changes occur in the repo all items with
unchanged datestamp - new, optional attributes for the resumptionToken
- expirationDate
- completeListSize
- cursor
73 noRecordsMatch
- 1.x - if no records match, an empty list was
returned
74 noRecordsMatch
- 2.0 - if no records match, the error condition
noRecordsMatch is returned -- not an empty list
75 new functionality
76 harvesting granularity
- harvesting granularity
- mandatory support of YYYY-MM-DD
- optional support of YYYY-MM-DDThhmmssZ
- other granularities considered, but ultimately
rejected - granularity of from and until must be the same
77 Identify
ltIdentifygt ltrepositoryNamegtLibrary of
Congress 1lt/repositoryNamegt
ltbaseURLgthttp//memory.loc.gov/cgi-bin/oailt/baseUR
Lgt ltprotocolVersiongt2.0lt/protocolVersiongt
ltadminEmailgtr.e.gillian_at_larc.nasa.govlt/adminEmailgt
ltadminEmailgtrgillian_at_visi.netlt/adminEmailgt
ltearliestDatestampgt1990-02-01T000000Zlt/earlies
tDatestampgt ltdeletedRecordgttransientlt/deletedR
ecordgt ltgranularitygtYYYY-MM-DDThhmmssZlt/gran
ularitygt ltcompressiongtdeflatelt/compressiongt
78 header
- header contains set membership of item
ltrecordgt ltheadergt ltidentifiergtoaiarXiv
cs/0112017lt/identifiergt ltdatestampgt2001-12-14
lt/datestampgt ltsetSpecgtcslt/setSpecgt
ltsetSpecgtmathlt/setSpecgt lt/headergt
ltmetadatagt .. lt/metadatagt lt/recordgt
79 ListIdentifiers
- ListIdentifiers returns headers
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMHgt lt
responseDategt2002-0208T085546Zlt/responseDategt
ltrequest verb gthttp//arXiv.org/oai2lt/reques
tgt ltListIdentifiersgt ltheadergt
ltidentifiergtoaiarXivhep-th/9801001lt/identifiergt
ltdatestampgt1999-02-23lt/datestampgt
ltsetSpecgtphysicheplt/setSpecgt lt/headergt
ltheadergt ltidentifiergtoaiarXivhep-th/9801
002lt/identifiergt ltdatestampgt1999-03-20lt/datest
ampgt ltsetSpecgtphysicheplt/setSpecgt
ltsetSpecgtphysicexplt/setSpecgt lt/headergt
80 ListIdentifiers
- ListIdentifiers mandates metadataPrefix as
argument
http//www.perseus.tufts.edu/cgi-bin/pdataprov?
verbListIdentifiers metadataPrefixolac
from2001-01-01 until2001-01-01
setPerseuscollectionPersInfo
81 ListIdentifiers
- the changes to ListIdentifiers are subtle, and
reflect a change in the OAI-PMH data model - Could have been named ListHeaders or reduced to
an option for ListRecords - ListIdentifiers kept for lexigraphical
consistency
82 metadataPrefix
- character set for metadataPrefix and setSpec
extended to URL-safe characters
A-Z a-z 0-9 _ ! ( ) - .
83 in the periphery
84 provenance
- introduction of provenance container to
facilitate tracing of harvesting history
ltaboutgt ltprovenancegt ltoriginDescriptiongt
ltbaseURLgthttp//an.oa.orglt/baseURLgt
ltidentifiergtoair1plog/9801001lt/identifiergt
ltdatestampgt2001-08-13T130002Zlt/datestampgt
ltmetadataPrefixgtoai_dclt/metadataPrefixgt
ltharvestDategt2001-08-15T120130Zlt/harvestDategt
ltoriginDescriptiongt
lt/originDescriptiongt lt/originDescriptiongt
lt/provenancegt lt/aboutgt
85 friends
- introduction of friends container to facilitate
dynamic discovery of repositories
ltdescriptiongt ltfriendsgt ltbaseURLgthttp//cav2001
.library.caltech.edu/perl/oailt/baseURLgt
ltbaseURLgthttp//formations2.ulst.ac.uk/perl/oailt/b
aseURLgt ltbaseURLgthttp//cogprints.soton.ac.uk/pe
rl/oailt/baseURLgt ltbaseURLgthttp//wave.ldc.upenn.
edu/OLAC/dp/aps.php4lt/baseURLgt
lt/friendsgt lt/descriptiongt
86 branding
- introduction of branding container for DPs to
suggest rendering association hints - ltbranding xmlns"http//www.openarchives.org/OAI/2
.0/branding/" - xmlnsxsi"http//www.w3.org/2001/XMLSchema-inst
ance" - xsischemaLocation"http//www.openarchives.org/
OAI/2.0/branding/ - http//www.openarchives.org/
OAI/2.0/branding.xsd"gt - ltcollectionIcongt
- lturlgthttp//my.site/icon.pnglt/urlgt
- ltlinkgthttp//my.site/homepage.htmllt/linkgt
- lttitlegtMySite(tm)lt/titlegt
- ltwidthgt88lt/widthgt
- ltheightgt31lt/heightgt
- lt/collectionIcongt
- ltmetadataRendering
- metadataNamespace"http//www.openarchives.org
/OAI/2.0/oai_dc/" - mimeType"text/xsl"gthttp//some.where/DCrender
.xsllt/metadataRenderinggt - ltmetadataRendering
- metadataNamespace"http//another.place/MARC"
- mimeType"text/css"gthttp//another.place/MARCr
ender.csslt/metadataRenderinggt
87 oai-identifier
- revision of oai-identifier
- ltdescriptiongt
- ltoai-identifier xmlns"http//www.openarchives.o
rg/OAI/2.0/oai-identifier" - xmlnsxsi"http//www.w3.org/2001/XMLSchema-
instance" - xsischemaLocation"http//www.openarchives.
org/OAI/2.0/oai-identifier - http//www.openarchives.org/OAI/2.0/oai-iden
tifier.xsd"gt - ltschemegtoailt/schemegt
- ltrepositoryIdentifiergtoai-stuff.foo.orglt/repos
itoryIdentifiergt - ltdelimitergtlt/delimitergt
- ltsampleIdentifiergtoaioai-stuff.foo.org5324lt/
sampleIdentifiergt - lt/oai-identifiergt
- lt/descriptiongt
domain based repository names
88 oai_dc
- OAI 1.x oai_dc Schema defined by OAI
- OAI 2.0 oai_dc Schema imports from DCMI Schema
for unqualified DC elements
89 MARC21
- OAI 1.x oai_marc
- OAI 2.0 LoC marxml, oai_marc
- http//www.loc.gov/standards/marcxml/
90 did not make it into OAI-PMH v.2.0
91- SOAP implementation
- Result set filtering
- Multiple / best metadata
- GetRecord -gt GetRecords
- Machine readable rights management
- XML format for mini-archives
92Example Data and Service Providers
93NTRS OAI Architecture
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
NTRS
local copy of metadata
metadata harvested offline, through OAI
interface
each node independently maintained
. . .
LTRS
ATRS
GTRS
CASITRS
content (reports) remain archived at the local
sites
94NASA Technical Report Server
- replacement for the previous distributed
searching version of NTRS - MySQL
- Va Tech harvester
- modified bucket
- details in Nelson, Rocker, Harrison, Library
Hi-Tech, 21(2) (March 2003) - a service provider aggregator
- same OAI baseURL as used for interactive searching
http//ntrs.nasa.gov/
95NASA Technical Report Server
- advanced, fielded search
- explicit query routing
- 12 NASA repositories
- 4 non-NASA repositories
- turned off by default
- gt600k abstracts gt300k full-text
96NASA DLs in the Larger STI Realm
DOE
. . .
DOD
Universities
Publishers
International
this could be a fully connected graph
NTRS could also be a data provider from the
point of view of other DLs allowing
the harvesting of NASA report metadata.
NTRS could also harvest metadata from other
DLs, and provide access to non-NASA content. We
hope to influence the direction of the
science.gov effort to use OAI-PMH
97New Kinds of DLs
- Drawing from the same pool of DPs
- different interfaces, capabilities and collection
policies for - public affairs
- K-12 education
- science research
- authors / librarians / managers
- NTRS and NIX could harvest from the same sources
- be the same DL, but with different interfaces?
- be replaced with a new, all-encompassing DL?
- DL creators can now focus on collection
management - ala carting their collections and sub
collections - instead of fussing over syntax synchronization of
remote search services
98Scientific Communication
- With only some exceptions, which interface is
used for discovery is not as important as the
fact that discovery occurred in the first place - control of the discovered objects is not lost
by data providers - however, higher level mirroring services can be
built on top of OAI (cf. NACA ARC mirroring
between NASA LaRC and MAGiC) -
99NACA Technical Report Server
- publicly available
- began in 1996
- details in NASA TM-1999-209127
- scanned reports from 1917-1958
- NACA predecessor to NASA
- contents mirrored with the MaGIC project
- a UK-based grey-literature preservation project
- OAI-PMH used to mirror contents
http//naca.larc.nasa.gov/ http//naca.larc.nasa.g
ov/oai2.0/
100NACA Report 1345 as seen through its native
DL http//naca.larc.nasa.gov/
101NACA Report 1345 as seen through
MAGiC http//www.magic.ac.uk/
102NACA Report 1345 as seen through Scirus
(Elsevier) http//www.scirus.com/
103NACA Report 1345 as seen through my.OAI (FS
Consulting) http//www.myoai.com/
104What Does OAI-PMH Mean for Authors?
- On the surface, absolutely nothing!
- the ideal OAI deployment should be absolutely
invisible to normal DL operations - uninterested users should not even notice or care
- Indirectly, they should enjoy the benefits of the
critical mass of current and developing DL tools
systems - personal, institutional data providers
- proliferation of targeted, value-added service
providers
105What Does OAI-PMH Mean For Publishers
Institutions?
- Absolutely everything
- The decoupling of SPs and DPs will have
significant and profound implications on
scientific and technical information exchange - OAI-PMH is actually just one component in a
larger engineering effort for scholarly
communication (e.g. OpenURL) - Service and resource integration will be the
focus of journals, professional societies,
universities, etc. - OAI-PMH will be a basic, core technology for
scientific publishing as http XML
106Field of Dreams
- It should be easy to be a data provider, even if
it makes more work for the service provider. - if enough data providers exist, the service
providers will come (DPs gtgt SPs) - Open-source / freely available tools
- drop-in data providers
- industrial strength http//www.eprints.org/
- personal size http//kepler.cs.odu.edu/
- tools to make your existing DL a data provider
- http//www.openarchives.org/tools/tools.htm
- also OAI-implementers mailing list / mail
archive! - service providers
- Arc http//sourceforge.net/projects/oaiarc/
107OAI-PMH Meeting History
108Shift of Topics
- From the protocol itself, supporting debugging
tools and how to retrofit (existing) DLs - to building (new) services that use the OAI-PMH
as a core technology and reporting on their
impact to the institution/community
109Arc
- http//arc.cs.odu.edu/
- harvests all known archives
- first end-user service provider
- source available through SourceForge
- hierarchical harvesting
110NCSTRL
- http//www.ncstrl.org/
- metadata harvesting replacement for Dienst-based
NCSTRL - based on Arc
- computer science metadata
111Archon
- http//archon.cs.odu.edu/
- physics metadata
- based on Arc
- features
- citation indexing
- equation-based searching
112Torii
- http//torii.sissa.it/
- physics metadata
- features
- personalization
- recommendations
- WAP access
113iCite
- http//icite.sissa.it/
- physics metadata
- features
- citation based access to arXiv metadata
114my.OAI
- http//www.myoai.com/
- covers all registered metadata
- features
- result sets
- personalization
- many other advanced features
115Cyclades
- http//www.ercim.org/cyclades
- scientific metadata
- features
- personalization
- recommendations
- collaboration
- status?
116citebase
- http//citebase.eprints.org/
- arXiv metadata
- citation based indexing, reporting
117OAIster
- http//oaister.umdl.umich.edu/
- harvests all known archives
118Public Knowledge Project
- http//www.pkp.ubc.ca/harvester/
- domain-specific filtering of harvested metadata
(?)
119Perseus
- http//www.perseus.tufts.edu/
- they claim to harvest all DPs, but only
humanities related DPs appear in the pull down
menu
120Others
- Commercial publishers
- American Physical Society (APS)
- Institute of Physics (IOP)
- Elsevier / Scirus (www.scirus.com)
- Department of Energy
- OSTI
- LANL
- Institutional servers
- DSpace (MIT www.dspace.org)
- Eprints (www.eprints.org)
- DARE (All Dutch universities)
121Service Providers
- It is clear that SPs are proliferating, despite
(because of?) the inherent bias toward DPs in the
protocol - easy to be a DP -gt many DPs -gt SPs eventually
emerge - hard to be a DP -gt SPs starve
- currently 5x DPs more than SPs
- SPs are beginning to offer increasingly
sophisticated services - competitive market originally envisioned for SPs
is emerging
122OAI-PMH Observation Front-End Only
- No input/registry mechanism
- OAI-PMH is always a front-end for something else
- filesystem, Dienst, RDBMS, LDAP, etc.
- convenient for pre-existing DLs, but does not
address new DLs - e.g., we want to do OAI
- Bounds the scope of OAI
- tension between functionality and simplicity
123OAI-PMH Observation No TC
- No terms conditions provisions
- assumes all metadata has uniform access rights
- how to restrict metadata to certain hosts?
- (see upcoming OAI-rights discussion)
- introducing TC would increase the scope of
application, but at the expense of simplicity - how expensive do we want to make a
just-a-front-end protocol ?
124OAI-PMH Observation No TC
- Possible to use multiple repositories in a
DMZ-like configuration
OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
125OAI-PMH Observation No TC
- Possible to use OAI-PMH in closed, restricted
systems
all OAI requests originate from these 4 DLs
OAI 1
OAI 2
OAI 3
OAI 4
see Technical Report Interchange Project ---
http//www.cs.odu.edu/mln/pubs/tri.pdf
126OAI-PMH Observation Monolithic
- A repository has no protocol-defined concept of
other OAI repositories - ltfriendsgt was added in 2.0
- backups, mirrors, etc. have to be resolved
outside of the scope of OAI - scope vs. complexity again
- fully connected graph of DLs harvesting from each
other is unnecessary - cf. web crawlers vs. gathers in U of Colorados
Harvest System - 3rd party harvesting interfaces raise more TC
and data coherency issues
127OAI-PMH Observation Data Coherency
- In the interest of implementer simplicity,
several issues are left for the service provider
to interpret - what is an update vs. addition?
- in the NACA repository, they are reported as the
same and its up to the harvesting system to
figure it out - deletions?
- it is currently optional for repositories to mark
records as deleted or not - still left to the harvester to interpret
- Liu, et al., JCDL 2003 Repository
Synchronization in the OAI Framework - http//www.cs.odu.edu/mln/pubs/freshness-jcdl.pdf
128OAI-PMH Observation Harvest Model
- Frequency of harvests
- all-at-once harvests?
- initial harvest
- resolving data coherency
- frequent incremental harvests?
- far more efficient for both service and data
providers - Webcrawling vs. digital library models
- webcrawlers little to no a priori information
about target - DLs frequent harvesting of a small number of
known targets
129DC?!
- Metadata
- Q Which format should I use?
- A any/all of them
- lowest common denominator unqualified Dublin
Core - Again, little known about actual behavior
- will DC be actually be useful? or too lossy?
- will communities create/adopt specific formats?
- will native (presumably richer) formats be
harvested?
130XML Observations
- Service providers
- XML can be pretty picky a large ListRecords
result can be invalidated with a single error - harvest in chunks? individual records?
- author contributed metadata particularly a
problem (e.g. control characters from
copy-n-paste) - one advantage of resumptionToken is that it
compartmentalizes bad data
131Why The OAI-PMH is NOT Important
- Users dont care
- OAI-PMH is middleware
- if done right, the uninterested user should never
have to know
- Using OAI-PMH does not insure a good SP
- OAI-PMH is (or is becoming) HTTP for DLs
- few people get excited about http now
- http OAI-PMH are core technologies whose
presence is now assumed