Distributed%20Information%20Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Distributed%20Information%20Discovery

Description:

Hopcroft. doc1, doc2. Hartmanis. doc3, doc4. Routing Problem. Replicated Distributed Indexes ... Hopcroft. doc8. Tarjan. doc9. Tarjan. doc6. Wilensky. doc7 ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 36

Provided by: lag7

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed%20Information%20Discovery

1
Distributed Information Discovery
Lecture 14

CS 430
Carl Lagoze 2001-03-08

2
Goals and Motivation

Lesson from the Web relevant and valuable
information is everywhere
Rethinking the library in the digital age
Not as collector of information
Rather as access point to distributed information
Perfect scenario uniform access to all
information with rich functionality

3
Problems with the Perfect Scenario

Heterogeneity what is the structure of the
information we wish to discovery
Reliability machines, networks, and
organizations are sometimes (often) flaky
Complexity cost vs. functionality tradeoff

4
Function versus cost of acceptance
Cost of acceptance
Z39.50
SDLIP
Metadata Harvesting
Function
5
Z39.50

http//www.loc.gov/z3950/agency/

6
Aims of Z39.50

Permits one computer, the client, to search and
retrieve information on another, the database
server
Important both technically and for its wide use
in library systems
Most development has concentrated on
bibliographic data
Most implementations emphasize searches that use
a bibliographic set of attributes to search
databases of MARC records

7
Technical history

Z39.50
Developed for X.25 networks (connection
orientation), conversion to run over TCP fitted
later
Original concept in days when repeating a search
was expensive computation (about 1980)
WAIS is a stateless derivative of an early
version of Z39.50

8
Z39.50 principles

Abstract view of database searching.
Server stores a set of databases with searchable
indexes
Interactions are based on a session
The client opens a connection with the server,
carries out a sequence of interactions and then
closes the connection.
During the course of the session, both the server
and the client remember the state of their
interaction.

9
State

Z39.50
The server carries out the search and builds a
results set
Server saves the results set.
Subsequent message from the client can reference
the result set.
Thus the client can modify a large set by
increasingly precise requests, or can request a
presentation of any record in the set, without
searching entire database.

10
Z 39.50 services
init -- client connects to the server and
exchanges initial information, e.g., preferred
message size explain -- client inquires of the
server what databases are available for
searching, the fields that are available, the
syntax and formats supported, and other
options search -- client presents a query to a
database choices of syntax for specifying
searches only Boolean queries widely
implemented one or more records may be
returned to the client
11
Z 39.50 services
manipulation of results sets -- e.g., sort or
delete present -- requests the server to send
specified records from the results set to the
client in a specified format options for
controlling content and formats
for managing large records or large results sets
12
Sample query
In the database named "Books" find all records
for which the access point title contains the
value "evangeline" and the access point author
contains the value "longfellow. Z39.50 defines
a rich variety of search access points that can
be extended by implementers
13
Simple Digital Library Interoperability Protocol

http//www-diglib.stanford.edu/testbed/doc2/SDLIP
/

14
SDLIP

Compromise between a full-scale, all encompassing
search middleware design such as Z39.50 and the
anything goes approach typical for ad-hoc
search interface design on web
Developed jointly by Stanford, Berkeley, and UC
Santa Barbara
Heavily influenced by DASL from IETF

15
SDLIP search middleware
16
Managing complexity through separate interfaces
17
SDLIP Interfaces

Search Interface defines simple query language,
protocol can then include other languages
Result Interface parking meter metaphor
supports varying notions of results sets
Source Metadata Interface provides extension
mechanism through discovery server capabilities

18
Open Archives Initiative Metadata Harvesting
Protocol

http//www.openarchives.org

19
OAI Metadata Harvesting Protocol

Low-barrier framework for repository
interoperability
Minimal burden for data providers
Plug-in concept to allow community and service
specialization

20
Metadata Harvesting
metadata
e-print
21
Metadata Harvesting
metadata
e-print
22
OAI core concepts

low-barrier interoperability
data-provider service-provider model
metadata harvesting model
shared metadata format and parallel,
community-specific metadata formats

OAI 1.0 protocol
HTTP based
Dublin Core
Community specific
23
Some thoughts

There is (and will never be) one right solution
(technical vs. cost vs. complexity vs. ??)
Distributed technical solutions have
organizational ramifications
Distributed resource discovery (as with any
distributed computer solution) entails various
tradeoffs

24
Distributed Searching Issues

Global Distribution

25
Broadcast Distributed Search
26
Backup Index server
backup index
27
Deploying Collection Globally

Internet connectivity varies considerably
Good connectivity between nodes often does not
correspond to geographic proximity
Connectivity Region - a group of nodes on the
network that among them have good connectivity,
relative to nodes outside of the region.

28
Connectivity Regions

When possible route queries within region
In case of failure, use an alternate either
within the region or in a nearby region

29
Distributed Searching Issues

Query Routing

30
Routing ProblemDisjoint Indexes
Hopcroft I1, I3 Hartmanis I3 Tarjan I1,
I2 Wilensky I2
I1,I3
doc1, doc2
doc8
Content Summary
I1
I2
I3
31
Routing ProblemReplicated Distributed Indexes
Tarjan doc6 Wilensky doc7
Tarjan doc6 Wilensky doc7
32
Routing Issues

Choice of primary?, secondary?, etc.
Fault-tolerance
Routing Factors
Performance-based
Freshness-based
Cost-based
weighted mix based on user preference

33
Components of Replicated Routing Problem

Metadata Issue metadata made available by
indexer to aid in routing
Metadata Distribution Issue topology of metadata
repositories
Decision Issue routing decision algorithms
Fault-tolerance use of backup indexers

34
Distributed Metadata for Query Routing
central metadata store
35
Performance-based Routing
present
-
8
T
Timed low pass filter
Average response time
Predicted response time
New low pass filter(T, actual response
time, old )

Write a Comment

User Comments (0)