Title: the OAI Protocol for Metadata Harvesting
1the OAI Protocol for Metadata Harvesting
2the joint impact of these and future
initiatives can be substantially higher when
interoperability between them e-print archives
can be established
The Open Archives Initiative has been set up to
create a forum to discuss and solve matters of
interoperability between preprint solutions, as a
way to promote their global acceptance.
Paul Ginsparg, Rick Luce Herbert Van de Sompel
3Luce, Van de Sompel, Ginsparg
4federated services
5metadata harvesting via OAI-PMH
metadata
FTXT
6federated services via OAI-PMH
metadata
7Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
8the OAI-PMH
service provider
data provider
6
9Core concepts in OAI-PMH
- low-barrier interoperability
- data-provider service-provider model
- metadata harvesting model
OAI-PMH
HTTP based
- shared metadata format and parallel,
community-specific metadata formats
Dublin Core
Community specific, oai-rights
10 OAI-PMH data model
entry point to all records pertaining to the
resource
metadata pertaining to the resource
11OAI-PMH harvesting tools
service provider
data provider
- Supporting protocol requests
- Identify
- ListMetadataFormats
- ListSets
12supporting protocol requests
service provider
data provider
ListMetadataFormats
- ListMetadataFormats / Time / Request
- REPEAT
- Format prefix
- Format XML schema
- /REPEAT
13Supporting Identify
- Purpose
- Return general information about the archive and
its policies (e.g., datestamp granularity) - Parameters
- None
- Sample URL
- http//memory.loc.gov/cgi-bin/oai2_0?verbIdentify
14 Identify
Library of
Congress 1
http//memory.loc.gov/cgi-bin/oai
L 2.0
r.e.gillian_at_larc.nasa.gov
rgillian_at_visi.net
1990-02-01T000000Z
tDatestamp transient
ecord YYYY-MM-DDThhmmssZ
ularity deflate
Identify
15 harvesting granularity
- harvesting granularity
- mandatory support of YYYY-MM-DD
- optional support of YYYY-MM-DDThhmmssZ
- other granularities considered, but ultimately
rejected - granularity of from and until must be the same
16Supporting ListMetadataFormats
- Purpose
- List metadata formats supported by the repository
as well as their schema locations and namespaces - Parameters
- identifier for a specific record (O)
- Sample URL
- http//memory.loc.gov/cgi-bin/oai2_0?verbListMeta
dataFormats
17Supporting ListSets
- Purpose
- Provide a listing of sets in which records may be
organized (may be hierarchical, overlapping, or
flat) - Parameters
- None
- Sample URL
- http//memory.loc.gov/cgi-bin/oai2_0?verbListSets
18OAI-PMH harvesting tools
service provider
data provider
- Supporting protocol requests
- Identify
- ListMetadataFormats
- ListSets
- Harvesting protocol requests
- ListRecords
- ListIdentifiers
- GetRecord
19OAI-PMH harvesting tools
service provider
data provider
Datestamp Identifier Set
Records
20harvesting requests
service provider
data provider
froma
untilb
setklm ListRecords metadataPrefixdc
- ListRecords / Time / Request
- REPEAT
- Identifier
- Datestamp
- Metadata
- /REPEAT
21Harvesting GetRecord
- Purpose
- Returns the metadata (specific format) for a
single item in the form of a record - Parameters
- identifier unique id for item (R)
- metadataPrefix identifier of metadata format
for the record (R) - Sample URL
- http//memory.loc.gov/cgi-bin/oai2_0?verbGetRecor
didentifieroailcoa1.loc.govloc.gdc/lhbcb.00835
metadataPrefixoai_dc
22Harvesting ListRecords
- Purpose
- Retrieves the metadata (specific format) for
multiple items in the form of records - Parameters
- from start datestamp (O)
- until end datestamp (O)
- set set to harvest from (O)
- resumptionToken flow control mechanism (X)
- metadataPrefix metadata format (R)
- Sample URL
- http//memory.loc.gov/cgi-bin/oai2_0?verbListReco
rdsmetadataPrefixoai_dc
23Harvesting ListIdentifiers
- Purpose
- List headers for all items corresponding to the
specified parameters - Parameters
- from start datestamp (O)
- until end datestamp (O)
- set set to harvest from (O)
- metadataPrefix metadata format to list
identifiers for (R) - resumptionToken flow control mechanism (X)
- Sample URL
- http//memory.loc.gov/cgi-bin/oai2_0?verbListIden
tifiersmetadataPrefixoai_dc
24 header
- header contains set membership of item
oaiarXiv
cs/0112017 2001-12-14
cs
math
..
25 ListIdentifiers
- ListIdentifiers returns headers
responseDate2002-0208T085546Z
http//arXiv.org/oai2t
oaiarXivhep-th/9801001
1999-02-23
physichep
oaiarXivhep-th/9801
002 1999-03-20amp physichep
physicexp
26OAI-PMH identifiers
- Not (necessarily) identifier of the resource
- Each item must have a globally unique identifier
- identifiers must follow rules for valid URIs
- Example
- oai
- oaietd.vt.eduetd-1234567890
- Each identifier must resolve to a single item and
always to the same item - Cant reuse OAI item identifiers
27OAI-PMH datestamps
- Needed for every OAI record to support
incremental harvesting - Must be updated when addition or modification or
deletion made in order to ensure changes are
correctly propagated to harvesters - Also for dynamically generated metadata formats
- Different from dates within the metadata OAI
datestamp is used only for harvesting - Can be either YYYY-MM-DD or YYYY-MM-DDThhmmssZ
(must be GMT timezone)
28OAI-PMH request format
- requests must be submitted using the GET or POST
methods of HTTP - repositories must support both methods
29OAI-PMH response format
- formatted as HTTP responses
- content type must be text/xml
- status codes (distinguished from OAI-PMH
errors)e.g. 302 (redirect), 503 (service not
available)
30OAI-PMH response format
- response format well formed XML
- XML declaration (encoding"UTF-8" ?)
- OAI-PMH root element
- three child elements
- responseDate (UTC datetime)
- request (request that generated this response)
- a) error (in case of an error or exception
condition) b) element with the name of the
OAI-PMH request
31OAI-PMH response, no errors
responseDate2002-0208T085546Z
http//arXiv.org/oai
2
oaiarXivcs/0112017
2001-12-14
cs math
..
32 OAI-PMH response, error
responseDate2002-0208T085546Z
http//arXiv.org/oai2
codebadVerbShowMe is not a valid OAI-PMH
verb
33OAI-PMH errors
- repositories must indicate OAI-PMH errors
- inclusion of one or more error elements
- defined errors
- badArgument
- badResumptionToken
- badVerb
- cannotDisseminateFormat
- idDoesNotExist
- noRecordsMatch
- noMetaDataFormats
- noSetHierarchy
34Flow control
- flow control on two protocol levels
- HTTP (503, retry-after)
- OAI-PMH, resumptionToken
- HTTP retry-after mechanism can be used in order
to delay requests of clients - resumptionTokens are used to return parts
(incomplete lists) of the result. - client receives a resumptionToken which can be
used to issue another request in order to
receive further parts of the result
35Flow control
- four of the request types return a list of
entries - three of them may reply large lists
- OAI-PMH supports partitioning response
- decision on partitioning repository
- response to a request includes
- incomplete list
- resumption token expiration date, size of
complete list, cursor (optional) - new request with same request type
- resumption token as parameter
- all other parameters omitted!
36Flow control
Service Provider
Data Provider
Harvester
Repository
37record-level about container
- provenance container to facilitate tracing of
harvesting history
http//an.oa.org
oair1plog/9801001
2001-08-13T130002Z
oai_dc
2001-08-15T120130Z
38record-level about container
- rights container to express rights pertaining to
metadata
- W3C XML schema defines format for
package to be included in container
39repository-level description container
- friends container to facilitate dynamic discovery
of repositories
http//cav2001
.library.caltech.edu/perl/oai
http//formations2.ulst.ac.uk/perl/oaiaseURL http//cogprints.soton.ac.uk/pe
rl/oai http//wave.ldc.upenn.
edu/OLAC/dp/aps.php4
40more info
- Protocol document
- http//www.openarchives.org/OAI/openarchivesprotoc
ol.html - Validation tool
- http//re.cs.uct.ac.za/
- Repository and harvesting tools
- http//www.openarchives.org/tools/tools.html
- Registries of public OAI-PMH repositories
- http//re.cs.uct.ac.za/
- http//gita.grainger.uiuc.edu/registry/
- http//www.openarchives.org/Register/BrowseSites
41The OAI-PMH protocol is a low-barrier
interoperability specification for the recurrent
exchange of metadata between systems
- Things become really cool when allowing
flexibility re the interpretation of metadata. - Indeed in OAI-PMH metadata is XML-formatted
data pertaining to the resource