Metadata Harvesting - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Metadata Harvesting

Description:

The reality in most digital libraries is that no one location has all the ... compression deflate, compress * description oai-identifier, eprints, friends, ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 38
Provided by: lilliann
Category:

less

Transcript and Presenter's Notes

Title: Metadata Harvesting


1
Metadata Harvesting
  • Interoperable digital collections

2
Distributed libraries
  • The reality in most digital libraries is that no
    one location has all the materials that may be of
    interest.
  • It is often more efficient to allow a number of
    sites each to retain some of the materials.
  • How can we assure clients that they will see all
    relevant resources, regardless of which library
    they search?

3
Two basic approaches
  • One service provider with access to resources
    stored in multiple locations
  • Information about all the resources located at
    the service provider.
  • Services (DL scenarios) use the information to
    provide connections to resources at multiple
    locations
  • Distributed services
  • Information kept with the resources
  • Services, local to each collection, interact with
    other collection sites

4
Two protocols
  • Z39.50
  • Developed before the web
  • Protocol for communicating with collection
    holders in order to provide services.
  • Open Archives Initiative
  • Recent innovation
  • Central service provider gathers information from
    collection holders

5
Z39.50 - briefly
  • Information Retrieval Service Definition and
    Protocol Specifications for Library Applications
  • Initially developed over the OSI network
    standards
  • Protocol for information exchange
  • Free the information seeker from the need to know
    the details of the target database configuration
  • Each site provides services
  • Each service queries remote sites for needed
    information
  • Information requests mapped to database queries
    at the collection site.
  • Some inconsistency in the interpretation of
    queries.

6
Distributed ResourcesMultiple Services
Approach 1 - One service provider gathers
information about data and uses it to provide
services
Data provider
Data provider
Data provider
Service provider -- search, browse, compare, etc.
Data provider
Data provider
7
Distributed data and services
Search, browse
Approach 2 Each system is both a data repository
and a service provider. Services query other
data providers as needed.
Search, browse, compare
8

Hybrid systems
Each server likely to have its own clients.
Difference is whether the information exchange is
periodic or ad hoc
Data provider
Data provider
Data provider
Service provider -- search, browse, compare, etc.
Data provider
Data provider
9
Open Archives Initiative (OAI)
  • Web-based
  • Uses HTTP to communicate between sites
  • Centralized server
  • Services provided from a site that has already
    gathered the information it needs for those
    services from a distributed collection of sites.

10
Z39.50
  • Special purpose protocol (machine to machine, not
    web interface)
  • Gathers information when it is requested, not on
    a scheduled basis.

11
OAI Compared to Z39.50
Z39.50 OAI
Content (Objects) Distributed Distributed
World View Bibliographic Bibliographic
Object Presentation Data provider Data provider

Searching is Distributed Centralized
Search done by Data provider Service provider
Metadata searched is Up to date Stale
Semantic Mapping When searching Metadata delivery
Source oai.grainger.uiuc.edu/FinalReport/JCDL_20
03_OAI_Intro.ppt
12
Open Archives Initiative Protocol for Metadata
Harvesting -- OAI-PMH
Implemented as CGI, ASP, PHP, or other
Repository
Harvester
OAI PMH defines an interface between the
Harvester and any number of Repositories
HTTP req (OAI verb)
OAI
OAI
HTTP resp (XML)
Metadata Provider
Service Provider
Any system may serve as a harvester, repository,
or both
13
OAI components
Service Providers and Data Providers
Requests and Responses
http//www.oaforum.org/tutorial/english/page3.htm
section3
14
Records
  • Metadata of a resource.
  • Three parts
  • Header (required)
  • Identifier (required 1 only)
  • Datestamp (required 1 only)
  • setSpec elements (optional 0, 1, or more)
  • Status attribute for deleted item
  • Metadata (required)
  • XML encoded metadata with root tag, namespace
  • Repositories must support Dublin Core, other
    formats optional
  • About statement (optional)
  • Right statements
  • Provenance statements

15
Identifiers
  • Globally unique identifier
  • Valid URI
  • Examples
  • oailtarchiveIdgtltrecordIdgt
  • oaietd.vt.eduetd-1234567890
  • Must resolve to one item
  • No duplicates
  • No reuse of previously used identifiers

16
Datestamps
  • Date of last modification of a record
  • Used only for harvesting (meta metadata?)
  • Mandatory for each item in the repository
  • Two levels of granularity possible
  • YYYY-MM-DD
  • YYYY-MM-DDThhmmssZ
  • T Z Time zone -- must be GMT
  • Allows harvesting incrementally -- get only what
    is new since last visit
  • Accessed by arguments from and until

17
The OAI-PMH verbs
  • Each requests a specific response from a data
    repository

18
Identify
  • Function Description of the archive
  • Example http//www.language-archives.org/cgi-bin/
    olaca3.pl?verbIdentify
  • Parameters none
  • Errors/exceptions
  • badArgument (there should not be any)
  • Response format
  • Element Example Ordinality
  • repositoryName My Archive 1
  • baseURL http//archive.org/oai 1
  • protocolVersion 2.0 1
  • earliestDatestamp 1999-01-01 1
  • deleteRecords no, transient, persistent 1
  • granularity YYYY-MM-DD, YYYY-MM-DDThhmmssZ
    1
  • adminEmail oai-admin_at_archive.org
  • compression deflate, compress
  • description oai-identifier, eprints, friends,
  • Ordinality 1 mandatory,
    1 only mandatory, 1 only optional, 0 or
    more

19

Actual response from http//www.language-archives
.org/cgi-bin/olaca3.pl?verbIdentify
  • ltOAI-PMH xsischemaLocation"http//www.openarchiv
    es.org/OAI/2.0/ http//www.openarchives.o
    rg/OAI/2.0/OAI-PMH.xsd"gt
  • ltresponseDategt2006-10-17T013744Zlt/responseDategt
  • ltrequest verb"Identify"gthttp//www.language-archi
    ves.org/cgi-bin/olaca3.pllt/requestgt
  • - ltIdentifygt
  • ltrepositoryNamegtOLAC Aggregatorlt/repositoryNamegt
  • ltbaseURLgthttp//www.language-archives.org/cgi-bin/
    olaca3.pllt/baseURLgt
  • ltprotocolVersiongt2.0lt/protocolVersiongt
  • ltadminEmailgtmailtohaejoong_at_ldc.upenn.edult/adminEm
    ailgt
  • ltearliestDatestampgt2002-12-14lt/earliestDatestampgt
  • ltdeletedRecordgtnolt/deletedRecordgt
  • ltgranularitygtYYYY-MM-DDlt/granularitygt
  • - lt!-- maybe later
  • ltcompressiongtidentitylt/compressiongt
  • --gt

Continued
20
- ltdescriptiongt - ltoai-identifier
xsischemaLocation"http//www.openarchives.org/OA
I/2.0/oai-identifier http//www.openarchives
.org/OAI/2.0/oai-identifier.xsd"gt ltschemegtoailt/sch
emegt ltrepositoryIdentifiergtOLACA.language-archives
.orglt/repositoryIdentifiergt ltdelimitergtlt/delimite
rgt ltsampleIdentifiergtoaiethnologue.comaaalt/sampl
eIdentifiergt lt/oai-identifiergt lt/descriptiongt
Continued
21
- ltdescriptiongt - ltolac-archive
type"institutional" xsischemaLocation"http//ww
w.language-archives.org/OLAC/1.0/olac-archive
http//www.language-archives.org/OLAC/1.0/olac-a
rchive.xsd"gt ltarchiveURLgthttp//www.language-archi
ves.org8082/dp9/lt/archiveURLgt ltcuratorgtSteven
Bird Gary Simonslt/curatorgt ltcuratorTitlegtCoordin
atorslt/curatorTitlegt ltcuratorEmailgtmailtoolac-adm
in_at_language-archives.orglt/curatorEmailgt ltinstituti
ongtOpen Language Archives Communitylt/institutiongt
ltinstitutionURLgthttp//www.language-archives.org/lt
/institutionURLgt ltshortLocationgtPhiladelphia,
U.S.A.lt/shortLocationgt ltlocation/gt -
ltsynopsisgt This repository contains all records
from OLAC-registered archives. It is intended to
be used by services which do not want to harvest
individual OLAC archives. lt/synopsisgt -
ltaccessgt Metadata may be used only subject to the
access permissions given by the individual
archives. lt/accessgt lt/olac-archivegt lt/descriptiongt
lt/Identifygt lt/OAI-PMHgt
22
ListMetadataFormats
  • Function retrieve available metadata formats
    from archive
  • Example archive.org/oai-script?verbListMetadata
    Formats
  • identifieroaiHUBerlin.de3000218
  • Parameters identifier (optional)
  • Errors/exceptions
  • badArgument
  • idDoesNotExist
  • noMetadataFormats

23
- ltOAI-PMH xsischemaLocation"http//www.openarch
ives.org/OAI/2.0/ http//www.openarchives.org/OAI/
2.0/OAI-PMH.xsd"gt ltresponseDategt2006-10-17T01580
6Zlt/responseDategt ltrequest verb"ListMetadataForma
ts"gthttp//www.language-archives.org/cgi-bin/olaca
3.pllt/requestgt - ltListMetadataFormatsgt -
ltmetadataFormatgt ltmetadataPrefixgtolaclt/metadataPre
fixgt ltschemagthttp//www.language-archives.org/OLAC
/1.0/olac.xsdlt/schemagt ltmetadataNamespacegthttp//w
ww.language-archives.org/OLAC/1.0/lt/metadataNamesp
acegt lt/metadataFormatgt - ltmetadataFormatgt ltmetadat
aPrefixgtolac_displaylt/metadataPrefixgt ltschemagthttp
//www.language-archives.org/OLAC/1.0/olac.xsdlt/sc
hemagt ltmetadataNamespacegthttp//www.language-archi
ves.org/OLAC/1.0/lt/metadataNamespacegt lt/metadataFo
rmatgt - ltmetadataFormatgt ltmetadataPrefixgtoai_dclt/m
etadataPrefixgt ltschemagthttp//www.openarchives.org
/OAI/2.0/oai_dc.xsdlt/schemagt ltmetadataNamespacegtht
tp//www.openarchives.org/OAI/2.0/oai_dc/lt/metadat
aNamespacegt lt/metadataFormatgt lt/ListMetadataFormat
sgt lt/OAI-PMHgt
Response to http//www.language-archives.org/cgi-b
in/ olaca3.pl?verbListMetadataFormats
24
ListSets
  • Function retrieve set structure of a repository
  • Example archive.org/oai-script?verbListSets
  • Parameters resumptionToken (exclusive)
  • Errors/exceptions
  • badArgument
  • badResumptionToken
  • noSetHierarchy

Sets are optional and are used to divide a
repository into separate units that will be of
interest to different harvesters.
25
ListIdentifiers
  • Function abbieviated form of ListRecords,
    retrieve only headers
  • Example archive.org/oai-script?verbListIdentif
    iersmetadataPrefix oai_dcfrom2002-12-01
  • Parameters
  • from (optional)
  • until (optional)
  • metadataPrefix (required)
  • set (optional)
  • resumptionToken (exclusive)
  • Errors/exceptions
  • badArgument
  • badResumptionToken
  • cannotDisseminateFormat
  • noRecordsMatch
  • noSetHierarchy

26
ListRecords
  • Function harvest records from a repository
  • Example archive.org/oai-script?verbListRecords
    metadataPrefixoai_dcsetbiology
  • Parameters
  • from (optional)
  • until (optional)
  • metadataPrefix (required)
  • set (optional)
  • resumptionToken (exclusive)
  • Errors/exceptions
  • badArgument
  • badResumptionToken
  • cannotDisseminateFormat
  • noRecordsMatch
  • noSetHierarchy

27
GetRecord
  • Function retrieve an individual metadata record
    from a repository
  • Example
  • archive.org/oai-script?verbGetRecordidentifiero
    aiHUBerlin.de 3000218 metadataPrefixoai_dc
  • Parameters
  • Identifier (required)
  • metadataPrefix (required)
  • Errors/exceptions
  • badArgument
  • cannotDisseminateFormat
  • idDoesNotExist

28
(No Transcript)
29
(No Transcript)
30
Interoperability
  • The goal communication, without human
    intervention, between information sources
  • Books that talk to each other
  • Live links for references
  • Knowledge of how to find relevant resources when
    needed
  • Ability to query other information locations

31
Protocols
  • Precise rules for interactions between
    independent processes
  • Format of the messages
  • Both structure and content
  • Specified behavior in response to specific
    messages
  • Many ways to accomplish the same result, but both
    sides must have the same understanding of the
    rules of engagement.

32
Protocol Types
  • RPC model
  • Point to point
  • Completely open to definition by developer
  • Verbs (methods)
  • Nouns (objects, resources)
  • Useful to closed community or group who know
    about the availability of the resource.

33
SOAP
  • Initial words of the acronym have been
    discontinued.
  • Initially developed as part of the Microsoft .NET
    paradigm
  • Now in W3C committee
  • Stateless, one-way message exchange paradigm
  • XML encoded
  • Flexibility of RPC, but more constrained in the
    way communication is formatted.

34
REST
  • REpresentational State Transfer
  • An after-the-fact definition of the architecture
    of the World Wide Web
  • The model is
  • Client/server
  • Stateless
  • Cacheable
  • Layered
  • Resource interface constrained
  • Restricted verbs
  • Restricted content types

35
REST and RPC
  • RPC provides flexibility for any type of
    interaction between any type of resources
  • REST provides consistency to allow interaction
    among resources without prior discovery of
    accepted actions and responses.

36
SOAP and REST
  • Debate in the Web community about which is the
    better paradigm for application development
  • REST -- restricted, but simple extension of
    existing Web processes
  • SOAP -- added flexibility with cost in terms of
    bandwidth, security, complexity for development

37
References
  • Giving SOAP a REST http//www.devx.com/DevX/Artic
    le/8155
  • SOAP Version 1.2 Part 0 Primer
    http//www.w3.org/TR/2003/REC-soap12-part0-2003062
    4/L1153
  • OAI For Beginners - The Open Archives Forum
    online tutorial http//www.oaforum.org/tutorial/
    index.php
  • Z39.50 Resource Page http//www.niso.org/standard
    s/resources/Z3950_Resources.html
  • Z39.50 An Overview of Development and the Future
    (1995)
  • http//www.cqs.washington.edu/camel/z/z.html
Write a Comment
User Comments (0)
About PowerShow.com