Title: Open Archives Initiative
1Open Archives Initiative
- Where we are,
- Where we are going
Carl Lagoze4th OAF WorkshopSeptember, 2003
2Where we are now
- De facto standard for Internet information
exchange - Deployed extensively and internationally
- (digital) libraries
- Museums
- Eprint repositories
- Research projects
3Protocol Stability
- OAI-PMH has been stable since release
- No functional changes, just typographic edits
- Validation of leadership/participation model
- No plans for a 3.0 release
- Core protocol will not be extended
- Minor 2.x release could occur (more later)
- Additional implementation guidelines (more later)
4NSDL and OAI-PMH
5The NSDL Context
- National STEM (Science, Technology, Engineering,
Mathematics, Medicine) Digital Library - Major National Science Foundation project
targeted at the application of web and Internet
to (STEM) education - 25M over six years to over 100 projects
- Collections
- Services
- Targeted Research
- Core Integration
6NSDL technical guidelines
- Aggregation rather than collection
- Core integration team will not manage any
collections - Spectrum of interoperability
- Accommodate diversity of participation models
- Open interfaces and standards permitting plug in
of array of value-added services - One library many portals
- Accommodate multiple quality and selection
metrics - Tailor presentation of content and nature of
services to audience needs
7Spectrum of interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not
Web crawlers cooperate services
must and search engines seek out information
8Translating to initial goals
- This is a big task that no one has done before!
- Work on the priorities
- Focus on one point on spectrum of
interoperability - Metadata harvesting
- Incorporate NSF funded collections and selected
other collections - Leverage existing (or at least emerging)
technologies and protocols - OAI, uPortal, Shibboleth, SDLIP, InQuery
- Provide reliable base level services
- Search and Discovery, Access Management, User
Profiles, Exemplary Portals, Persistence - Plant some seeds for the future
- Machine-assisted metadata generation
- Automated collection aggregation
- Web gathering strategies
9Metadata Repository
- Central storage of all metadata about all
resources in the NSDL - Defines the extent of NSDL collection
- Metadata includes collections, items,
annotations, etc. - MR main functions
- Aggregation
- Normalization
- redistribution
- Ingest of metadata by various means
- Harvesting, manual, automatic, cross-walking
- Open access to MR contents for service builders
via OAI-PMH
10Metadata Strategy
- Collect and redistribute any native (XML)
metadata format - Provide crosswalks to Dublin Core from standard
formats - DC-GEM, LTSC (IMS), ADL (SCORM), MARC, FGCD, EAD
- Concentrate on collection-level metadata
- Use automatic generation to augment item-level
metadata
11Importing metadata into the MR
12Exporting metadata from the MR
13NSDL and OAI-PMH Two years later
- Concepts are good, practice is hard
- Issues
- Metadata is hard
- http//www.well.com/doctorow/metacrap.htm
- XML is hard
- Protocols are hard
- Static repositories (more later)
- IP is relevant (more later)
14Some Essential Metadata Questions
- Review original (DC) metadata assumptions
- Metadata is essential for good resource discovery
- Joe Sixpack could create metadata
- Account for current realities
- 2003 is not 1994
- Google, etc. keeps getting better
15Metadata Space
16Metadata Triage
17Reconsidering the Dublin Core Requirement
- Questions about utility of unqualified DC
- The conundrum.
- Specification too loose to serve intended
interoperability goal - But more complex metadata may be too hard
- Limited energy for interoperability
- Data providers implement required DC at expense
of better metadata - Use of protocol for purposes other than resource
discovery
18Rethinking record-oriented model
Implications for record-oriented harvesting????
19Topology Evolution
Simple Data Provider, Service Provider Topology
20Topology Evolution (cont.)
Metadata Aggregator
21Topology Evolution (cont.)
OAI-PMH p2p network
22OAI-P2pMH Issues
- Document (metadata) location
- Exploit unique identifiers, use efficient
key-based location mechanisms (distributed hash
tables) - Provenance-based queries
- Metadata records may go through refinement and/or
translation phases as they move through
value-added aggregators. - Exploit provenance guidelines
- Network harvesting
- Broadcast query (Gnutella) inefficient
- Exploit techniques for efficient routing of
queries (P-trees)
23OAI-PMH and Intellectual Property
- Protocol exists in a context where information
providers have concerns about use of intellectual
property - OAI-PMH is nominally about metadata, but
- Rich metadata is an intellectual product
- The protocol can be used to transmit anything
(e.g. content) that can be encoded in XML - Generally metadata leads to content so.
24OAI-rights effort
- Goal is to investigate and develop means of
expressing rights about metadata and resources in
the OAI framework. - The result will be an addition to the OAI
implementation guidelines that specifies
mechanisms for rights expressions within OAI-PMH.
- No changes to core protocol
25OAI-rights Effort (cont.)
- Extensible, providing a general framework for
expressing rights statements within OAI-PMH. - Not an effort to develop a new rights expression
language - Use Creative Commons licenses as a motivating and
deployable example. - Release of specification by 2nd quarter 04
- Invited OAI-rights group
- Standard OAI development model
26Dimensions of OAI-PMH and rightsEntity
Association
- Metadata concern in NSDL for (re)use of rich
metadata - Content predominant application of the protocol
to resource discovery and ultimate access makes
this important
27Dimensions of OAI-PMH and rights Aggregation
Association
- OAI-PMH aggregations
- Repository
- Set
- Item
- Rights association with an aggregation may
provide shortcut (e.g., the rights for all
resources in a repository/set) - Cost of shortcut is pseudo-statefulness, possibly
complex overriding rules
28Dimensions of OAI-PMH and rightsBinding
- Choices
- exploit mechanisms in metadata formats e.g.,
DC-rights - restrict the rights statements to some more
specific protocol mechanism - allow some mixture of these methods.
- DC-rights problems
- Semantics is restricted to rights about resource
- Cant embed XML in dc value
- What if DC is not required
- Burden on harvesters if rights embedding is not
explicit but scattered across several locations
29OAI-PMH Static Repositories
- Provide a lightweight mechanism for data provider
participation - Intended for relatively small and static
collections - Two components
- Static Repository XML format
- Semantically equivalent to Identify and
ListRecords - Invisible to harvester
- Static Repository Gateway
- Virtual data provider for static repository data
- Unique baseURL for each contained static
repository
30Static Repositories andStatic Repository Gateway
31Static Repositories Open Issue
32Conclusions
- Interoperability and lowest common denominator
- Rapid advances automated methods
- Moores law
- Smart algorithms
- Benefits of issues of scale
- Combining human effort and automated methods
- Extracting order from chaos
- Learning from order
- Move beyond resource discovery
33Typical Values
- repository
- collection of publications
- resource
- scholarly publication
- item
- all metadata (DC MARC)
- record
- a single metadata format
- datestamp
- last update / addition of a record
- metadata format
- bibliographic metadata format
- set
- originating institution or subject categories
34Repositories
- Stretching the idea of a repository a bit
- contextually sensitive repositories
- personalization for harvesters
- communication between strangers, or communication
between friends? - OAI-PMH for individual complex objects?
- OAI-PMH without MySQL?!
- Fedora, Multi-valent documents, buckets
- tar, jar, zip, etc. files
35Resource
- What if resource were
- computer system status
- uptime, who, w, df, ps, etc.
- or generalized system status
- e.g., sports league standings
- people
- personnel databases
- authority files for authors
36Item
- What if item were
- software
- union of versions formats
- all forms of metadata
- administrative structural
- citations, annotations, reviews, etc.
- data
- e.g., newsfeeds and other XML expressible content
- metadataPrefixes or sets could be defined to be
different versions
37Record
- What if record were
- specific software instantiations / updates
- access / retrieval logs for DLs (or computer
systems) - push / pull model inversion
- put a harvester on the client behind a firewall,
the client contacts a DP and receives
instructions on how to submit the desired
document (e.g., send email to a specified address)
38Datestamp
- semantics of datestamp are strongly influenced by
the choice of resource / item / record /
metadataPrefix, but it could be used to - signify change of set membership (e.g., workflow
item moves from submitted to approved) - change datestamp to reflect access to the DP
- e.g., in conjunction with metadataPrefixes of
accessed or mirrored
39metadataPrefix
- what if metadataPrefix were
- instructions for extracting / archiving /
scraping the resource - verbListRecordsmetadataPrefixextract_TIFFs
- code fragments to run locally
- (harvested from a trusted source!)
- XSLT for other metadataPrefixes
- branding container is at the repository-level,
this could be record- or item-level
40Set
- sets are already used for tunneling OAI-PMH
extensions (see Suleman Fox, D-Lib 7(12)) - other uses
- in aggregators, automatically create 1 set per
baseURL - have hidden sets (or metadataPrefix) that have
administrative or community-specific values (or
triggers) - setaccessedgt1000from2001-01-01
- setharvestMeWithTheseARGSuntil2002-05-05metada
taPrefixoai_marc