Title: Introduction to the Open Archives Initiative Protocol for Metadata Harvesting
1Introduction to the Open Archives Initiative
Protocol for Metadata Harvesting
- Timothy W. Cole (t-cole3_at_uiuc.edu), Mathematics
Librarian - William H. Mischo (w-mischo_at_uiuc.edu),
Engineering Librarian - Thomas G. Habing (thabing_at_uiuc.edu), Research
Programmer - Grainger Engineering Library Information Center
- University of Illinois at Urbana-Champaign
- Presented 27 May 2003
- in conjunction with JCDL 2003, Houston, TX
- http//dli.grainger.uiuc.edu/Publications/TWCole/J
CDL-OAI
2Todays Agenda (Part 1)
- Overview of OAI (Mischo)
- What it is, where it comes from, what its used
for - Relation to HTTP, XML, Dublin Core, Z39.50
- Basic Concepts Definitions (Cole)
- OAI verbs
- OAI transactions
- Protocol details architecture options
- Illustrations
- Implementation Guidelines for Repositories (Cole)
- Tools program layout options
- Metadata generation / mapping
- Optional protocol elements
- Error handling deleted records
3Todays Agenda (Part 2)
- Tools, testing, problems (Cole)
- XML OAI validation tools
- Common problems
- Implementation Guidelines for Harvesters (Mischo)
- How to harvest
- Harvesting policies strategies
- Harvester Technologies
- Advanced topics (Cole)
- Communities
- OAI Static Repository
- OAI SOAP
- Where do you go from here?
4OAI as a tool
- All about moving metadata around
- Designed to be a building block, useable by many
different communities - Can facilitate (in some cases enable) services
functions - Assumes widely distributed content,
butcentralized indexing(!) services - Build once, use for many applications
- Focus of OAI is interoperability
5Metadata vs. Information Resources
- Resource refers to information objects or digital
representations of information objects - Metadata item is a collection of properties about
a resource (e.g. title, author, etc.) - Metadata record is a metadata item expressed in a
specific syntax according to an XSD - OAI focuses on metadata, with the implicit
understanding that metadata contains useful links
to the source information object(s)
6OAI Antecedents
- Call to other E-Print archives (July 1999)
- Paul Ginsparg, Rick Luce, Herbert Von de
Sompel - mobilize core group to work towards achieving
a - universal service for author self-archived
scholarly literature. - Santa Fe Mtgs. (Oct. 1999 June 2000)
- OAI PMH version history
- First Alpha Release, Sept. 2000
- 1.0 (Beta) Release January 2001
- 1.1 (Beta 2) Release July 2001
- 2.0 (Production) Release June 2002
7Original OAI Organization
- OAI Executive
- Carl Lagoze Herbert Van de Sompel
- OAI Steering Committee
- Co-Chairs Dan Greenstein, Cliff Lynch
- OAI Technical Committee
- Funded by NSF, DLF CNI
- Seeks to be user community driven
- Adopters (selective list)
- NSDL, NDLTD, Open Archives Forum (EU), JISC/DNER
(UK) - E-Prints.Org, DLXS, DSpace, ContentDM, ENCompass
8OAI Protocol for Metadata Harvesting
- Harvesting approachto interoperabilityat
metadata level - Divides world intoMetadata Providers Service
Providers - Builds on HTTP,XML, Dublin Core
- http//www.openarchives.org/
9Harvesting/Federation vs. Broadcast
- Competing approaches to interoperability
- Distributed/Broadcast searching search and
discovery over remote services and data - Harvesting is when data/metadata is transferred
from the remote source to the destination where
the services are located (e.g. Union catalogs) - OAI designed to make it easy for providers
- Low barrier design
- OAI focuses on harvesting
10Data and Service Providers
- Data Providers (Repositories) refer to entities
who possess resources metadata and are willing
to share metadata with others via well-defined
OAI protocols - Service Providers (Harvesters) are entities who
harvest metadata from Data Providers in order to
supply higher-level services to users (e.g.
search discovery) - OAI uses these denotations for its client/server
model (dataserver, serviceclient)
11Reliance on HTTP XML
- OAI-PMH is a REpresentational State Transfer
(REST) protocol (unlike RPC, SOAP) - OAI requests and responses are sent via the HTTP
protocol - OAI Requests are encoded as HTTP GET or POST
operations - OAI Responses are valid XML documents
12XML Namespaces and Schema
- Consistency and data quality is ensured by
using XML Schema Definitions (XSD) for all
responses - XML Namespaces are used where necessary to
clearly define which parts of the responses are
actual metadata and which support the Metadata
Harvesting Protocol
13OAI-PMH Use of Dublin Core
- DC is OAIs lowest common denominator
- OAI supports encourages use of other,
community-driven metadata schemas - Typically, metadata provider stores metadata in
best schema as dictated by material resources - Crosswalk (semantic mapping) to simpler schemas
- Semantic mapping at metadata delivery (rather
than at time of search) - As with Z39.50, cant search for whats not there
14As Compared to Z39.50
Z39.50 OAI
Content (Objects) Distributed Distributed
World View Bibliographic Bibliographic
Object Presentation Data provider Data provider
Searching is Distributed Centralized
Search done by Data provider Service provider
Metadata searched is Up to date Stale
Semantic Mapping When searching Metadata delivery
15What OAI Is Not
- Not search
- Not database
- Not metadata
- Not OAIS
16What OAI is good for
- Where content is widely distributed, in different
kinds of non-Z39.50 enabled locations - Metadata provider more lightweight than Z39.50
- Metadata provider scales wellService provider
scales according to search capability - Metadata is sufficient for services desired
- Normalization, dedupping, augmentation desired
- Not mutually exclusive
- Portals can use both Z39.50 OAI
17The NSDL metadata repository
Services
The metadata repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL.
Users
Metadata repository
From The NSDL Metadata Strategy, A
presentation by William Y. Arms and Diane I.
Hillman. Available http//nsdl.comm.nsdlib.org/al
lprojects01/metastrategy.ppt
Collections
18- NSDL Metadata strategy Support eight
standard formats - Collect all existing metadata in these
formats - Provide crosswalks to Dublin Core
- Expose records in the metadata repository for
service providers to harvest - Concentrate human effort on collection-level
metadata - Use automatic generation to augment
item-level metadata
From The NSDL Metadata Strategy, A
presentation by William Y. Arms and Diane I.
Hillman. Available http//nsdl.comm.nsdlib.org/al
lprojects01/metastrategy.ppt
19IMLS Digital Collections Content
- Build a registry of all National Leadership Grant
collections with digital content. - Assist and guide NLG projects in making
item-level metadata sharable using OAI. - Build a repository and search discovery tools
for integrated access to the content of NLG
collections (unique metadata schema?). - Research best practices for sharing metadata
about diverse digital content and for supporting
the interests of diverse user communities.
20http//imlsdcc.grainger.uiuc.edu/
21 Open Language Archive Community
- Supports the OLAC Protocol for Metadata
Harvesting based on OAI - Includes metadata extensions to DC
- Supports Qualified DC refinements and encodings
and unique OLAC attribute code to hold
restricted element values - Also supports OLAC Static Repository Gateway
based on OAI Static Repository (still alpha) - Developing an OLAC Repository Editor for
creating a metadata provider
22Basic Concepts Definitions
- OAI verbs
- OAI transactions
- Protocol Details
- Architecture Options
- Illustrations
23How OAI Works
- OAI VERBS
- Identify
- ListMetadataFormats
- ListSets
- ListIdentifiers
- ListRecords
- GetRecord
Service Provider Metadata Provider
H A R VESTER
REPOSITORY
OAI
OAI
HTTP Request
(OAI Verb)
HTTP Response
(Valid XML)
24Identify
- Purpose
- Return general information about the archive and
its policies (e.g., datestamp granularity) - Parameters
- None
- Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbIdentify
25ListSets
- Purpose
- Provide a listing of sets in which records may be
organized (may be hierarchical, overlapping, or
flat) - Parameters
- None
- Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbListSets
26ListMetadataFormats
- Purpose
- List metadata formats supported by the archive as
well as their schema locations and namespaces - Parameters
- identifier for a specific record (O)
- Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbListMeta
dataFormats
27ListIdentifiers
- Purpose
- List headers for all items corresponding to the
specified parameters - Parameters
- from start date (O)
- until end date (O)
- set set to harvest from (O)
- metadataPrefix metadata format to list
identifiers for (R) - resumptionToken flow control mechanism (X)
- Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbListIde
ntifiersmetadataPrefixoai_dc
28GetRecord
- Purpose
- Returns the metadata for a single item in the
form of an OAI record - Parameters
- identifier unique id for item (R)
- metadataPrefix metadata format for the record
(R) - Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbGetReco
rdidentifieroaitest123metadataPrefixoai_dc
29ListRecords
- Purpose
- Retrieves metadata records for multiple items
- Parameters
- from start date (O)
- until end date (O)
- set set to harvest from (O)
- resumptionToken flow control mechanism (X)
- metadataPrefix metadata format (R)
- Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbListRec
ordmetadataprefixoai_dcfrom2001-01-01
30Protocol Details
- OAI Transaction An OAI request (HTTP)
corresponding OAI response (XML) - Optional use resumptionToken other flow
control mechanisms to manage service load - Item Identifiers Persistence Uniqueness
- Item Datestamps Date of last metadata change
supports selective harvesting
31Examples of OAI Requests
- http//www.language-archives.org/cgi-bin/olaca3.pl
?verbIdentify - http//publications.uu.se/portal/OAI?verbListSets
- http//www.language-archives.org/cgi-bin/olaca3.pl
?verbListMetadataFormats - http//www.language-archives.org/cgi-bin/olaca3.pl
?verbListIdentifiersmetadataPrefixoai_dcfrom
2002-12-01 - http//www.language-archives.org/cgi-bin/olaca3.pl
?verbGetRecordmetadataPrefixoai_dcidentifier
oai3Aacl.sr.language-archives.org3AA00-1006
32 An OAI Response
- lt?xml version"1.0" encoding"UTF-8" ?gt
- ltOAI-PMH xmlns xmlnsxsi xsischemaLocation
gt - ltresponseDategt2002-05-01T192030Zlt/responseDate
gt - ltrequest verb"GetRecord" identifier"oaiarXi
vhep-th/9901001 metadataPrefix"oai_dc"gt - http//an.oa.org/OAI-scriptlt/requestgt
- ltGetRecordgt
- ltrecordgt
- ...
- lt/recordgt
- lt/GetRecordgt
- lt/OAI-PMHgt
33An OAI Record
- ltheadergt
- ltidentifiergtoaiarXivcs/0112017lt/identifiergt
- ltdatestampgt2002-02-28lt/datestampgt
- ltsetSpecgtcslt/setSpecgt
- lt/headergt
- ltmetadatagt
- ltoai_dcdc xmlnsgt
- ltdctitlegtUsing Structural Metadatalt/dctitle
gt - lt/oai_dcdcgt
- lt/metadatagt
- ltaboutgt
- ltprovenance xmlnsgt
- .
- lt/provenancegt
- lt/aboutgt
34Unique Identifiers
- Each item must have a unique identifier
- Identifiers must follow rules for valid URIs
- Example
- oailtarchiveIdgtltrecordIdgt
- oaietd.vt.eduetd-1234567890
- Each identifier must resolve to a single item and
always to the same item - Cant reuse OAI item identifiers
35Datestamps
- Needed for every OAI record to support
incremental harvesting - Must be updated when addition or modification or
deletion made in order to ensure changes are
correctly propagated to harvesters - Different from dates within the metadata OAI
datestamp is used only for harvesting - Can be either YYYY-MM-DD or YYYY-MM-DDThhmmssZ
(must be GMT timezone)
36OAI Provider Architectures
Descriptive Metadata
OAI Administrative Metadata
OAI Harvesters
37Architecture Options
- Metadata items in database
- If individual metadata items are stored in a
database - Usually requires programmatic mapping to DC
- Metadata items as XML files
- If individual metadata items already in XML, can
do without the database component, or can use
database to cache and/or hold OAI administrative
metadata - May use XSLT stylesheets to extract / map
metadata - Metadata elements in HTML files
- As with XML file system options
- Static repository option (more later)
38Technology Options
- WWW Server (e.g., Apache, MS IIS)
- Protocol may be implemented in many forms
- CGI Script (Perl, C, Java)
- Java Servlet
- PHP
- Metadata (e.g. database) access mechanism
required - See www.openarchives.org for list of publicly
available software templates - See www.SourceForge.Net for UIUC OAI tools
39Illustrations
- Identify
- ListSets
- ListMetadataFormats
- ListIdentifiers
- GetRecord oai_dc
- GetRecord olac
- ListRecords
- Error
40 15 Minute Break
41Implementation Guidelines for Repositories
- Tools Required
- Basic program layout (incl. object-oriented
approaches) - Optional container elements
- Metadata generation / mapping, data cleaning
- Sets
- resumptionToken, flow control, load-balancing
- Denial-of-service prevention
- Error handling
- Deleted metadata records
42Typical Pre-Requisites
- Metadata Web server
- Code templates if available (available for many
languages) - Basic Web programming environment
- XML parsers (for non-trivial encoding)
- Database access libraries/drivers (e.g. ODBC,
JDBC)
43Basic program layout
- parse WWW request to extract parameters
- if (verbIdentify) Validate arguments
ProcessIdentify - else if (verbListMetadataFormats) Validate
arguments ProcessListMetadataFormats - else if (verbListSets) Validate arguments
ProcessListSets - else if (verbGetRecord) Validate arguments
ProcessGetRecord - else if (verbListIdentifiers) Validate
arguments ProcessListIdentifiers - else if (verbListRecords) Validate arguments
ProcessListRecords - else ReportError (badVerb)
- Re-usable subroutines to extract / clean up /
transform metadata, generate standard error
messages, etc.
44Object-Oriented Approaches
- Cleaner separation of protocol, database access
and metadata generation - Example approaches
- Each service request is handled by a object
- Simpler incremental development
- Protocol, Database and Metadata are objects
- Greater portability of code
- Inheritance from a basic OAI data provider
45Provider Performance Issues
- Database design impacts performance
- Work required to map to DC
- Use of resumptionTokens way to improve
performance - Fetch only records needed to satisfy current
request - Queries only retrieve needed records
- resumptionTokens should retain state information
for best performance and for idempotency
46Optional Container Elements
- ltIdentifygtltdescriptiongt
- Additional information about repository
- oai-identifier, eprints, friends, branding,
other - ltListSetsgtltsetDescriptiongt
- Additional information describing a set
- ltmetadatagt
- Other metadata besides Dublin Core
- rfc1807, marc21, oai_marc, mods, other
- ltaboutgt
- Meta-metadata, i.e. record level rights
47Metadata Generation / Mapping
- Approaches
- Map from source to each metadata format
- Use multiple crosswalks (may use XSLT) to
transform to multiple metadata formats
source (e.g., DB)
dc
rfc1807
name
title
title
author
creator
author
48Data Cleaning
- Escape special XML characters (lt, gt, , )
- Convert to UTF-8 version of Unicode
- Convert entity references (e.g., copy)
- Remove extraneous whitespace
- URLs
- /? must be encoded as escape sequences
49Sets another option for selective harvesting
- Optional no well-defined semantics depends
completely on local data providers - Must provide setSpec setName, may provide
setDescription, for each Set in repository - Sets may be hierarchical (use ) may overlap
- Allows for harvesting of sub-collections
- May be pre-defined by arrangement between data
providers and service providers - E.g. Subject areas, years, author names (but must
be pre-defined for ListSets) - Not a substitute for searching!
50resumptionToken, flow control, load-balancing
- Incomplete response resumptionToken can be used
to return partial results the client is issued
with a token which may be presented to the server
to receive more results - resumptionToken embeds state information,
allowing OAI to be stateless even for incomplete
response model - HTTP 503 retry-after mechanism can be used to
support server-side delaying of a clients
request - HTTP 302 / 303 can be used for load balancing
- HTTP 4xx can be used to deny a harvester
51Typical options for resumptionTokens
- resumptionTokens may have completeListSize,
cursor, and expiration date attributes - Combine from/until/metadataPrefix/set and a
record number indicator with delimiters into a
sequential tokenFor example - from!until!metadataPrefix!set!recordnumber
- 2000-01-01!2001-01-01!!All!100
- Use a session manager with automatic expiry For
example - vtetd14june10amsession12
52Denial-of-Service Prevention
- Return only partial results and issue a
resumption token for more - Use 503 retry-after HTTP errors to have clients
try again after a specified back-off time - Use access control lists to limit who may access
the archive - Invoke an explicit delay before sending back
results
53Error Handling
- All protocol errors are in XML format
- badVerb illegal verb requested
- badArgument illegal parameter values or
combinations - badResumptionToken, cannotDisseminateFormat,
idDoesNotExist parameters are in right format
but are not legal under current conditions - noRecordsMatch, noMetadataFormats,
noSetHierarchy empty response exception
54Handling Metadata Record Deletions
- deletedRecord no, transient, or persistent
- Archives may keep track of deleted records, by
identifier and datestamp - All protocol result sets can indicate deleted
records (possible to delete a record, but not
item) - Best Practice If deletions are being tracked,
this information should be stored indefinitely so
as to correctly propagate to service providers
with varying harvesting schedules
55Tools, Testing, Common Problems
- Validation Testing Tools
- Repository Explorer (Virginia Tech)
- OAI Registry
- XML Schema Validator (e.g., XSV)
- Reap command-line harvester
- Common Problems
- Incomplete / inconsistent metadata
- New metadata format
- No unique identifiers !
- No datestamps !
- XML responses not validating
- Character encoding
- Doesnt conform to XML Schema Definition
56http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
toai
57RE Parameter Testing
58RE Formatted View of Data
59RE Raw XML views of data
60RE Automatic Test Suite
61RE Error in XML
62OAI Registry
63OAI Registry
64XSV Schema Validator
65XSV Example
- Correct XML
- XSV Result
- Bad Character
- XSV Result
- Invalid Tag
- XSV Result
66Incomplete Metadata
- Synthesize metadata fields based on a priori
knowledge of the data - Example publisher and language may be hard-coded
for many archives - Omit fields that cannot be filled in correctly
better to have less information than incorrect
information !
67New metadata format
- Find the description, namespace and formal name
of the standard - Find an XML Schema description of the data format
- If none exists, write one (consult other OAI
people for assistance) - Create the mapping and test that it passes XML
schema validation
68No unique identifiers
- Create an independent identifier mapping
- Use row numbers for a database
- Use filenames for data in files
- Use encoded URL for Web pages
- Use a hash from other fields
- E.g. authoryearfirst word in title
69No datestamps
- Ignore the datestamp parameters and stamp all
records with the current date - Incremental harvests not possible
- Create a date table with the startup date for all
entries, then update dates as entries added /
changed - Most Important Any harvesting algorithm that is
interoperably stable for an archive with real
dates should be stable for an archive with
synthesized dates
70XML not validating
- Check namespaces and schema
- Use Repository Explorer in non-validating mode to
check structure of XML, without looking at
namespaces or schema - Validate schema by itself if it is non-standard
- Look at XML produced by other repositories
- Watch out for character encoding issues
71Implementation Guidelines for Harvesters
- How to Harvest
- Selective Harvesting Granularity
- Sets
- Error Recovery
- Flow Control / Load Balancing / Redirection
- Incomplete Lists
- Policies
- Tools
72How To Harvest
- Identify to get basic information
- ListIdentifiers, followed by ListMetadataFormats
for each record and then GetRecord for each
id/metadata combination - No. of short HTTP requests 1nn x mnno. of
identifiers, mno. of metadata formats - ListRecords for each metadata format required
- No. of long HTTP requests mmno. of metadata
formats - Response compression is indicated by
ltcompressiongt in the Identify response
73Selective Harvesting Datestamps
- Day or seconds granularity, declared in the
Identify response ltgranularitygt - All repositories must support from and until
params at the day granularity - This provides for incremental or differential
harvesting strategies - Because records may change or be added during a
harvest there should be a two-day overlap for
incremental harvests
74Sets
- Sets provide another means of selective
harvesting - ListSets to determine which or even if sets are
supported - May ignore sets
- Colons () in the setSpec values indicate
hierarchy.
75Error Recovery
- Because of idempotency harvesters can reissue the
previous resumptionToken, if it hasnt expired - Especially useful when harvesting very large
repositories in cases of network errors or
disconnects - Some harvesters can take multiple days to harvest
extremely large repositories
76Flow Control / Load Balancing / Redirection
- Repositories are free to utilize the various HTTP
status codes which harvesters must be prepared to
handle - 503 Service Unavailable should include a retry
after header - 403 Service Forbidden
- 302 Found should include a location header to
redirect to a new URL - Future harvesting requests should continue to use
original URL if the baseURL in the ltrequestgt
element remains the same
77Incomplete Lists
- Harvesters can receive incomplete list responses
to ListIdentifiers, ListRecords, and ListSets
requests - Indicated by ltresumptionTokengt in response
- Next list request is made using content of
ltresumptionTokengtas value of argument - http//an.oai.org/script?verbListIdentifiersresu
mptionToken2001-01-023A2001-01-033A0 - resumptionToken value must be correctly encoded
for HTTP GET and POST
78Policies
- Use schedule for harvesting regularly
- Store date when last harvested (before you start)
- Use a two day overlap (or one day if your archive
uses proper UTC datestamps) - New items may be added for the current day
- Timezones create up to a day of lag if you ignore
them - If the source uses correct UTC datestamps and
second granularity then only 1 second of overlap
is needed! - Each time a record is encountered, erase previous
instances - Harvesters should supply HTTP User-Agent and From
headers (practices for robots)
79Technologies for Harvesters
- To validate or not to validate
- No, well-formed, strictly valid (checked against
Schema) - Choice of parser MSXML, Apache Project
- Storing harvested metadat
- As XML
- Import to DB on the fly, batch afterwards
- Indexing tools
- DLXS, Encompass, Ex Libris, MySQL, SQL Server
80Advanced Topics
- Communities
- SOAP version
- Envisioned for near future
- OAI Static Repository
- http//www.openarchives.org/OAI/2.0/guidelines-sta
tic-repository.htm
81OAI Communities
- Shared Metadata Formats
- Shared semantics
- Layering over OAI
- Closed OAI networks
- OAI within the DL
82Shared Metadata Formats
- Use metadata formats accepted within a community
to convey more specific information - Examples
- E-Print format (under development)
- ETD-MS for theses and dissertations
- VRA Core for multimedia
- IMS Metadata for educational material
83Shared Semantics
- Develop a shared understanding for the meanings
of fields - Examples
- Developing controlled vocabularies for fields
- Using specific fields for external links (OAI
recommends using identifier in DC for this) - Choosing from among existing standards (like
language names)
84SOAP OAI
- SOAP Simple Object Access Protocol
- XML envelope for remote procedure calls
- Promoted by Microsoft, IBM, W3C
- OAI community members are exploring provision of
OAI services using SOAP rather than HTTP - May be more attractive to commercial and library
software vendors
85OAI Static Repository
86Where to go from here?
- DO I REALLY WANT TO DO THIS?
- Do I have an accessible metadata source?
- Do I have a server to host the OAI
script/program? - Can I satisfy the requirements to be a data
provider? - Can I write the code or modify a template or hire
a programmer to do either?
87Links
- Open Archives Initiative
- http//www.openarchives.org
- OAI Metadata Harvesting Protocol
- http//www.openarchives.org/OAI/openarchivesprotoc
ol.htm - Virginia Tech DLRL OAI Projects
- http//www.dlib.vt.edu/projects/OAI/
- Repository Explorer
- http//purl.org/net/oai_explorer
- NDLTD
- http//www.ndltd.org
88More Links
- ARC Cross-Archive Search Service
- http//arc.cs.odu.edu/
- XML Schema Validator
- http//www.w3.org/2001/03/webdata/xsv
- Dublin Core Metadata Initiative
- http//www.dublincore.org
- E-Prints DL-in-a-box
- http//www.eprints.org
- XML Tools at W3C
- http//www.w3.org/XML/software