Building Reliable Distributed Information Spaces - PowerPoint PPT Presentation

About This Presentation
Title:

Building Reliable Distributed Information Spaces

Description:

'I don't do libraries' anonymous Cornell undergrad to Bob Constable. How do you use the library? ... MR contents for service builders via OAI-PMH. Metadata ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 40
Provided by: carll8
Category:

less

Transcript and Presenter's Notes

Title: Building Reliable Distributed Information Spaces


1
Building Reliable Distributed Information Spaces
  • Carl Lagoze
  • CS 430
  • 10/22/2002

2
Characteristics of a library
  • Functions
  • Selection
  • Access
  • Organization
  • User support
  • Preservation
  • Characteristics
  • Standardized
  • Professionalized
  • Service-oriented
  • In it for the long-haul
  • Conservative
  • Trustworthy
  • Expensive (human centric)

3
Perspective on the Budget
4
Library in current environment
  • I dont do libraries anonymous Cornell
    undergrad to Bob Constable
  • How do you use the library?
  • Go to the library to study?
  • Go to the library to do research?
  • Talked to a reference librarian?
  • Use the library gateway or electronic resources?

5
Characteristics of the Web
  • Decentralized/Anarchic/Illegal
  • Agreements are technical (at best)
  • Roles are undefined and fluid
  • Immediate
  • Ephemeral
  • Integrity not established
  • Anonymous (or no one knows you are a dog)

6
(No Transcript)
7
What is a Digital Library?
Evolutionary perspective digital libraries as
institutions that are the continuation of
libraries (library automation and digitization as
the link between libraries and digital
libraries). Revolutionary perspective digital
libraries as technical/organizational/economic/leg
al layers on top of networked information (the
Web) that render existing libraries obsolete.
8
What is a Digital Library?
A digital library is a managed collection of
information, with associated services, where the
information is stored in digital formats and is
accessible over a network. Arms CS502 sp00
9
Many facets of the problem/solution
10
Technical Trade-offs
11
National Science Digital Library(NSDL)
  • Goal Reform science education in the US in the
    digital age
  • 25M in funding 2002-2006
  • Over 80 institutional grants for collections,
    services, core infrastructure (technical,
    economic, organizational)
  • Cornell is primary technical development partner
  • Carl Lagoze, Director of Technology
  • http//www.nsdl.org

12
Building service and knowledge layers over a
variety of resources for a variety of users
13
How Big might the NSDL be?
  • All branches of science, all levels of
    education, very broadly defined
  • Five year targets
  • 1,000,000 different users
  • 10,000,000 digital objects
  • 10,000 to 100,000 independent sites

14
Core Integration Philosophy
  • It is possible to build a very large digital
    library with a small staff.
  • But ...
  • Every aspect of the library must be planned with
    scalability in mind.
  • Some compromises will be made.
  • Lots of standard library functions must be
    automated.

15
Resources for Core Integration
Core Integration

Budget 4-6 million Staff 25 -
30 Management Diffuse

How can a small team, without direct management
control, create a very large-scale digital
library?
16
Collections the Basic Assumption The Core
Integration team will not manage any collections
17
The NSDL program funds only a fraction of the
relevant collections.
Collections
18
Every Collection is Different
19
The Core Integration Task ...
... to provide a coherent set of collections and
services across great diversity.
20
Interoperability
The Problem Conventional approaches to
interoperability require partners to support
agreements (technical, content, and business But
NSDL needs thousands of very different
partners ... most of whom are not directly part
of the NSDL program The Approach A spectrum of
interoperability
21
Levels of interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not
Web crawlers cooperate services
must and search engines seek out information
22
Searching
What to Index? When possible, full text indexing
is excellent, but full text indexing is not
possible for all materials (non-textual, no
access for indexing). Comprehensive metadata is
an alternative, but available for very few of the
materials. What Architecture to Use? Few
collections support an established search
protocol (e.g., Z39.50)
23
Function versus cost of acceptance
Cost of acceptance
Z39.50
SDLIP
Metadata Harvesting
Function
24
Z39.50 principles
  • Servers store a set of databases with searchable
    indexes
  • Interactions are based on a session
  • The client opens a connection with the server(s),
    carries out a sequence of interactions and then
    closes the connection.
  • During the course of the session, both the server
    and the client remember the state of their
    interaction.

25
State
  • Z39.50
  • The server carries out the search and builds a
    results set
  • Server saves the results set.
  • Subsequent message from the client can reference
    the result set.
  • Thus the client can modify a large set by
    increasingly precise requests, or can request a
    presentation of any record in the set, without
    searching entire database.

26
Broadcast Searching does not Scale
Collections
User interface server
User
27
Open Archives Initiative Protocol for Metadata
Harvesting
  • Low-barrier protocol for exposing structured
    information (metadata) from cooperating
    repositories
  • Provides opportunity for building comprehensive
    service network
  • http//www.openarchives.org

28
OAI-PMH A simple two party model for sharing
structured information
Service Providers
Discovery
Current Awareness
Preservation
Data Providers
29
Resource discovery over distributed collections
metadata
Author Title Abstract Identifer
30
OAI-PMH Key technical features
  • Deploy now technology 80/20 rule
  • Simple HTTP encoding
  • Foundation of established XML standards
  • Multiple metadata formats
  • Repository partitioning (sets)
  • Selective harvesting (sets and dates)
  • Clean partition between core and
    implementation-specific extensions
  • Multiple item-level metadata
  • Collection level metadata

31
OAI Verbs
  • Identify repository characteristics
  • ListMetadataFormats DC required
  • ListSets repository paritioning
  • ListRecords (selectively) harvest metadata
  • ListIdentifiers (selectively) harvest metadata
    identifiers
  • GetRecord known item retrieval

32
The Metadata Repository
Services
The metadata repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL.
Users
Metadata repository
Collections
33
Metadata Repository
  • Central storage of all metadata about all
    resources in the NSDL
  • Defines the extent of NSDL collection
  • Metadata includes collections, items,
    annotations, etc.
  • MR main functions
  • Aggregation
  • Normalization
  • redistribution
  • Ingest of metadata by various means
  • Harvesting, manual, automatic, cross-walking
  • Open access to MR contents for service builders
    via OAI-PMH

34
Importing metadata into the MR
35
Exporting metadata from the MR
36
Search Architecture
Metadata repository
Portal
OAI
SDLIP
Search andDiscoveryServices
Portal
http
Portal
Collections
James Allan, Bruce Croft (University of
Massachusetts, Amherst)
37
The Metadata Repository as a Resource
Support for Service Providers
  • Records are exposed through Open Archives
    Initiative harvesting protocol.
  • Core Integration team will provide some services
    based on the metadata repository.
  • The architecture encourages others to build
    services.

38
Building on the basics
  • Gathering resources from the open web
  • Automated collection aggregation
  • Automated metadata generation
  • Content of resource
  • Context of resource
  • Automated quality assessment
  • Annotation, review, and aggregation environment

39
If you find this all interesting
  • CS502 Architecture of Web information Systems
Write a Comment
User Comments (0)
About PowerShow.com