Title: Digital Library Interoperability Architecture
1Digital Library Interoperability Architecture
- CS 502 20030305
- Carl Lagoze Cornell University
2Interoperability is multidimensional
- Syntax
- XML
- Semantics
- RDF/RDFS/OWL
- Vocabularies/Ontologies
- Dublin Core/ABC/CIDOC-CRM
- Search and discovery
- Z39.50
- SDLIP
- ZING
- Document models
- METS
- FEDORA
3Contrast to Distributed Systems
- Distributed systems
- Collections of components at different sites that
are carefully designed to work with each other - Heterogeneous or federated systems
- Cooperating systems in which individual
components are designed or operated automously
4Measuring success of interoperability solutions
- Degree of component automony
- Cost of infrastructure
- Ease of contributing components
- Ease of using components
- Breadth of task complexity supported by the
solution - Scalability in the number of components
5Families of interoperability solutions
6Interoperability Trade-offs
MetadataHarvesting
Dienst
7Dienst
- is a protocol and reference implementation of a
distributed digital library service - where a network of services provide
- World Wide Web browser access,
- uniform search over distributed indexes,
- and access to structured documents.
7
8Why a service based protocol?
- Expose the operational semantics of the services
through an API, - to permit flexible integration of the services,
- and use of the services by other
clients/consumers/services.
9Defining the services
- Repository deposit, storage, and access to
structured documents. - Index process queries on documents and returned
handles - Query Mediator route queries to appropriate
indexes - Collection define services and content in
logical collections - User Interface human-oriented front-end for
services. - Name Server Resolves URNs (handles) to
document location(s)
10Dienst Services
WWW browser
User Interface
11Defining the protocol
- Structured messages
- Service
- Version
- Verb
- Arguments
- Template
- /Dienst/ltservicegt/ltversiongt/ltverbgt?/ltargumentsgt
- Example
- /Dienst/Repository/4.0/Formats/ncstrl.cornell/TR94
-1418
12Why a Document Model?
- Documents in current web are both
- Unstructured (GET)
- Chaotic (CGI)
- Different views and pieces of contents are needed
for - Bandwidth reduction
- Rights management
- Usability
13Dienst Document Model
- Metadata support for multiple descriptive
formats - Views alternative expression or structural
representation of the content encapsulated in the
digital object - Divs hierarchically nested structure contained
in a view
14Expressing the document model in the protocol
- Structure expose the views and structure for
the digital object - Disseminate select the structural component
(and packaging of it) to disseminate - List-Meta-Formats list available descriptive
formats
15Protocol Demonstration
- http//techreports.library.cornell.edu8081/Dienst
/Repository/4.0/List-Contents?file-after2003-01-0
1 - http//techreports.library.cornell.edu8081/Dienst
/Repository/1.0/Disseminate/cul.cs/TR90-1160/23oa
ms/xml - http//techreports.library.cornell.edu8081/Dienst
/Repository/2.0/Structure/cul.cs/TR90-1160 - http//techreports.library.cornell.edu8081/Dienst
/Repository/4.0/Formats/cul.cs/TR90-1160?partbody
- http//techreports.library.cornell.edu8081/Dienst
/Repository/1.0/Disseminate/cul.cs/TR90-1160/body/
inline?pageimage3
16Collection Service
- Periodically polled by each user interface server
for - elements of the collection
- index servers for the collection
User Interface Servers
Index Servers
17Deploying Collection Globally
- Internet connectivity varies considerably
- Good connectivity between nodes often does not
correspond to geographic proximity - Connectivity Region - a group of nodes on the
network that among them have good connectivity,
relative to nodes outside of the region.
18Connectivity Regions
- When possible route queries within region
- In case of failure, use an alternate either
within the region or in a nearby region
19Origins of the OAI
- Increasing interest in alternative scholarly
publishing solutions e.g., LANL arXiv - Increasing impact through federation
- UPS Mtg., Sante Fe, October 1999
- Representatives of various ePrint, library,
publishing, communities - Goal definition of an interoperability framework
among ePrint providers - Reality Rich interoperability protocols like
Dienst are too complicated for widespread
deployment - Result Santa Fe Convention, interoperability
through metadata harvesting
20The World According to OAI
Service Providers
Discovery
Current Awareness
Preservation
Data Providers
21Yes, its about resource discovery over
distributed collections
metadata
Author Title Abstract Identifer
22Facilitating/Monitoring Longevity of Distributed
Content
PreservationService
23Personalization of Content
24Cross-Repository Reference Linking
Linkage Service
25OAI Technical Infrastructure Key technical
features
- Deploy now technology 80/20 rule
- Two-party model providers (data providers) and
consumers (service providers) - Simple HTTP encoding
- XML schema for some degree of protocol
conformance - Extensibility
- Multiple item-level metadata
- Collection level metadata
26Content and Metadata
Item (metadata)
repository
resource
record
010010
27http//www.openarchives.org/OAI/openarchivesprotoc
ol.html
28record
ltrecordgt ltheadergt ltidentifiergtoaieg001lt/ident
ifiergt ltdatestampgt1999-01-01lt/datestampgt lt/head
ergt ltmetadatagt ltdc xmlnshttp//purl.org/dcgt
lttitlegtMy Examplelt/titlegt lt/dcgt lt/metadatagt
ltaboutgt ltea xmlnshttp//www.arXiv.org/ea
ltusagegtNo restrictionslt/usagegt lt/eagt lt/aboutgtlt
/recordgt
29selective harvesting - datestamps
30selective harvesting - sets
S2
31set specifics
- repositories define hierarchical organization
- each item in a repository may be organized in one
set, several sets, or no sets at all - meaning of sets or of set hierarchy is not
defined in protocol - individual communities may formulate common set
configurations
32HTTP encoding - requests
BASE-URL -----------gt an.oa.org/OAI-scriptkeyword
arguments --gt verbListIdentiferssetS1
GET http//an.oa.org/OAI-script?verbListIdenti
ferssetS1
POST POST http//an.oa.org/OAI-script
HTTP/1.0 Content-Length 78 Content-Type
application/x-www-form-urlencoded
verbListIdentiferssetS1
33HTTP encoding - responses
ltxml version1.0 encodingUTF-9
?gtltGetRecord xmlnshttp//oai.namespace.uri
xmlnsxsihttp//w3.namespace.uri xsischemaL
ocationhttp//oai.namespace.uri http//oai.sc
hemaURLgt ltresponseDategt2000-19-01T193030-0400
lt/responseDategt ltrequestURLgthttp//an.oa.org/OAI-
script?verbGetRecord ampidentifieroai3Aar
Xiv3A0001 ampmetadataPrefixoai_dclt/request
URLgt ltrecordgt record contents lt/record addit
ional recordslt/GetRecordgt
34metadata prefix and schema
- support for harvesting multiple metadata formats
- metadata schema each format must have a
validating XML schema at a publicly accessible
URL (communities may define shared formats and
schema. - metadata prefix each repository maps a prefix to
the schema it supports, which is used in protocol
requests. - support for unqualified Dublin Core mandatory
- DC OAI record syntax that builds on base DCMI
schema - reserved prefix oai_dc.
35flow control
36flow control specifics
- applies to all protocol requests that return
lists ListRecords, ListIdentifiers, ListSets - resumptionToken is opaque
- semantics of partitioning of responses within
resumption requests is undefined
37Extensibility Feature Summary
- Multiple metadata formats
- Collection level metadata
- Identify about container
- Record data
- Terms and conditions
- Provenance
- Set structure
- Pre-configured queries
38OAI Protocol
service provider
data provider
- Supporting protocol requests
- Identify
- ListMetadataFormats
- ListSets
- Harvesting protocol requests
- ListRecords
- ListIdentifiers
- GetRecord
39Challenges and Questions
- Utility of lowest common denominator metadata
such as DC - Quality of metadata from non-professional
contributors - Machines processing to reduce and compliment
human effort - Functionality of service structure