OAI Protocol for Metadata Harvesting - PowerPoint PPT Presentation

About This Presentation
Title:

OAI Protocol for Metadata Harvesting

Description:

OAI Service Provider (harvester) is middleman between content provider and end ... Transactions initiated by harvester. Optional flow control mechanisms to ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 17
Provided by: dliGrain
Category:

less

Transcript and Presenter's Notes

Title: OAI Protocol for Metadata Harvesting


1
OAI Protocol for Metadata Harvesting Its
Usefulness to STM Publishers
  • Timothy W. Cole (t-cole3_at_uiuc.edu)Mathematics
    Librarian Professor of Library Administration
  • University of Illinois at Urbana-Champaign
  • 2005 Allen Press Emerging Trends SeminarNational
    Press Club, Washington, D.C.
  • 13 April 2005

2
OAI Protocol for Metadata Harvesting
  • Harvesting approachto interoperabilityat
    metadata level
  • Divides world intoMetadata Providers Service
    Providers
  • Builds on HTTP,XML, community metadata
    standards
  • http//www.openarchives.org/

3
Metadata Harvesting Model
Metadata Content Repositories
End-Users
OAI Service Provider
Retrieval
Content
Metadata(e.g. XML)
OAIProvider
Search
OAI Harvester
AggregatedMetadata
OAIProvider
Metadata(e.g. SQL)
Content
4
Metadata Harvesting Model (cont.)
  • OAI Service Provider (harvester) is middleman
    between content provider and end-user for
    selected metadata-based transactions e.g.,
  • Resource discovery
  • Value-added link mediation
  • Transactions involving full content still
    conducted directly between end-users and content
    provider e.g.,
  • Delivery of complete article in desired format
  • OAI-PMH is not synonymous with Open Access

5
How OAI-PMH Works
  • OAI VERBS
  • Identify
  • ListMetadataFormats
  • ListSets
  • ListIdentifiers
  • ListRecords
  • GetRecord

6
Protocol Details
  • OAI Transaction OAI request (HTTP)
    corresponding OAI response (XML)
  • Transactions initiated by harvester
  • Optional flow control mechanisms to manage
    provider load
  • OAI Item Identifiers persistent unique
  • Item (Metadata) Date Stamps support selective
    harvesting
  • OAI supports multiple metadata formats
  • Distinguishes between an ITEM (complete metadata)
    a RECORD (disseminated item of metadata in
    given format)

7
Reliance on HTTP XML
  • OAI-PMH is a REpresentational State Transfer
    (REST) protocol (unlike RPC, SOAP)
  • OAI requests and responses are sent via the HTTP
    protocol
  • OAI requests encoded as HTTP GET or POST
    operations
  • OAI responses are valid XML documents
  • Consistency and data quality is ensured by
    using XML Schema Definitions (XSD) for all
    responses
  • XML Namespaces used to identify which parts of
    response are metadata and which parts support the
    Protocol

8
Illustration of an OAI Transaction
  • Request http//an.oai.org/OAI-script?verbGetReco
    rdidentifieroaian.oai.org123metadataPrefixo
    ai_dc
  • lt?xml version"1.0" encoding"UTF-8" ?gt
  • ltOAI-PMH xmlns xmlnsxsi xsischemaLocation
    gt
  • ltresponseDategt2004-05-01T192030Zlt/responseDate
    gt
  • ltrequest verb"GetRecord" identifier"oaian.o
    ai.org123 metadataPrefix"oai_dc"gt http//an.oa
    i.org/OAI-scriptlt/requestgt
  • ltGetRecordgt
  • ltrecordgt
  • ...
  • lt/recordgt
  • lt/GetRecordgt
  • lt/OAI-PMHgt

9
Illustration of an OAI Record
  • ltrecordgt
  • ltheadergt
  • ltidentifiergtoaian.oai.org123lt/identifiergt
  • ltdatestampgt2002-02-28lt/datestampgt
  • ltsetSpecgtcslt/setSpecgt
  • lt/headergt
  • ltmetadatagt
  • ltoai_dcdc xmlns xmlnsxsi
    xsischemaLocationgt
  • ltdctitlegtUsing Structural Metadatalt/dctitle
    gt
  • lt/oai_dcdcgt
  • lt/metadatagt
  • ltaboutgt
  • ltprovenance xmlns xmlnsxsi
    xsischemaLocationgt
  • .
  • lt/provenancegt
  • lt/aboutgt
  • lt/recordgt

10
What it takes to implement OAI
  • Dynamic Web server functionality (e.g., CGI)
  • Capacity to respond with XML
  • Descriptive metadata in a standard format
  • OAI persistent identifiers date stamps may
    require changes to metadata creation workflow
  • Open source implementations available (starting
    points)
  • OAI-PMH included in turnkey publishing solutions
  • Public Knowledge Project (UBC)
  • Open Repository (BioMed Central), ...
  • Eprints.org, DSpace, Fedora, ARNO, CDSware, ...

11
Provider Performance Issues
  • Database design biggest impact on performance
  • e.g., load to dynamically map to DC, other
    formats
  • Webserver performance load can be kept quite low
  • Use resumptionTokens, other flow control
    mechanisms to improve performance
  • Fetch only records needed to satisfy current
    request
  • resumptionTokens should retain state information
    for best performance and for idempotency
  • Scale example OCLC repository with 4 million
    records

12
OAI Implementation Guidelines for Repositories
  • Tools Required
  • Basic program strategies (incl. object-oriented
    approaches)
  • Guidance for use of
  • optional container elements
  • Metadata generation / mapping, data cleaning
  • Use of OAI Sets
  • resumptionToken, flow control, load-balancing
  • Denial-of-service prevention
  • Error handling
  • Strategies for deleted metadata records
  • http//www.openarchives.org/OAI/2.0/guidelines-rep
    ository.htm

13
Why OAI?
  • OAI is not synonymous with open access -- content
    provider maintains access control over full
    content
  • Implement once, provide metadata to multiple
    services
  • Less performance impact than robotic Web
    harvesting
  • Simpler than z39.50
  • Puts your metadata in additional portals
  • But, less control over
  • How your metadata is presented to end-user
  • What your metadata is put next to by service
    providers
  • How valuable a commodity is your metadata?

14
Whos Using OAI to Expose Metadata
  • OAI Data Provider Registry (http//oai.grainger.ui
    uc.edu/registry)
  • As of 1 March 2005 607 active OAI metadata
    provider repositories
  • Range in size from millions of items, to less
    than 100 items
  • More than half are institutional repositories or
    eprint archives
  • Handful of publisher / publisher-aggregators,
    e.g.
  • PubMed Central BioOne BioMed Central (partial)
    Project Euclid Africa Journals Online Institute
    of Physics (user id password) American
    Physical Society (restricted access) ...
  • Individual journals, e.g.
  • J. of STEM Education Electronic J. of
    Probability J. of Cognitive Affective Learning
    Canadian J. of Communication ...

15
Whos Harvesting Metadata Using OAI-PMH
  • Portals encouraging Open Access, e.g.
  • OAIster Public Knowledge Project Citebase
    Cyclades ...
  • NSDL (STEM Education) NCSTRL (computer
    science)SAIL (physical science e-prints) ...
  • Local harvesting projects
  • As way to share data internally
  • As a collation service to their users e.g.,
    Grainger Search ServiceOAI harvesting supported
    by some Library meta-search utilities
  • Web search engines that use OAI as one input
    stream
  • Yahoo! ingests from OAIster Google looking to
    harvest DSpace sites Scirus includes OAI
    metadata ...
  • mod_OAI (Apache Web servers) as an alternative to
    Web robotic harvesting?

16
Indirect Benefits from OAI-PMH
  • From Bibusages study (French National Library)
  • Digital Libraries are used in conjunction with
    Web search engines, generalist portals,
    commercial sites
  • Mix of intensive casual users
  • DL users seeking answer for specific information
    need most time spent discovering, viewing,
    downloading documents
  • Digital Libraries are now attracting a new
    type of public, bringing about new, unique and
    original ways for reading and understanding
    texts.Houssem Assadi, et al. Users Uses of
    Online Digital Libraries in France, ECDL 2003

17
Evolution of Scholarly Communication
  • Ubiquitous nature of electronic pre-prints
    post-prints
  • Extensive linking to supporting content on the
    Web
  • Mixing of author-paid publication with
    traditional subscription based business models
    (e.g., AIP, Springer trials)
  • Citation frequency up for articles also available
    in arXiv
  • Demographic and citation trends in Astrophysical
    Journal papers and preprints / Greg J. Schwarz
    and Robert C. Kennicutt, Jr. BAAS 361654-1663,
    2004 also http//arxiv.org/abs/astro-ph/0411275
  • Some publishers encouraging self-archiving of
    pre-prints
  • IMS APS AIP ... see http//www.sherpa.ac.uk/ro
    meo.php
  • OAI-PMH underpins these kinds of self-archiving
    services

18
A Librarians Perspective
  • The information landscape can be seen as a
    contour map in which there are mountains,
    hillocks, valleys, plains and plateaus. A
    specialized collection of particular importance
    is like a sharp peak. Upon a plateau there might
    be undulations representing strengths and
    weaknesses. The landscape is, however,
    multidimensional. Where one scholar may see a
    peak another may see a trough. The task is to
    devise mapping conventions which enable scholars
    to read the map of the landscape fruitfully, at
    the appropriate level of generality or
    specificity.
  • Michael Heaney (2000), An Analytical Model of
    Collections and their Catalogues.
Write a Comment
User Comments (0)
About PowerShow.com