Distributed%20Information%20Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed%20Information%20Discovery

Description:

Hopcroft. doc1, doc2. Hartmanis. doc3, doc4. Routing Problem. Replicated Distributed Indexes ... Hopcroft. doc8. Tarjan. doc9. Tarjan. doc6. Wilensky. doc7 ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 36
Provided by: lag7
Category:

less

Transcript and Presenter's Notes

Title: Distributed%20Information%20Discovery


1
Distributed Information Discovery
Lecture 14
  • CS 430
  • Carl Lagoze 2001-03-08

2
Goals and Motivation
  • Lesson from the Web relevant and valuable
    information is everywhere
  • Rethinking the library in the digital age
  • Not as collector of information
  • Rather as access point to distributed information
  • Perfect scenario uniform access to all
    information with rich functionality

3
Problems with the Perfect Scenario
  • Heterogeneity what is the structure of the
    information we wish to discovery
  • Reliability machines, networks, and
    organizations are sometimes (often) flaky
  • Complexity cost vs. functionality tradeoff

4
Function versus cost of acceptance
Cost of acceptance
Z39.50
SDLIP
Metadata Harvesting
Function
5
Z39.50
  • http//www.loc.gov/z3950/agency/

6
Aims of Z39.50
  • Permits one computer, the client, to search and
    retrieve information on another, the database
    server
  • Important both technically and for its wide use
    in library systems
  • Most development has concentrated on
    bibliographic data
  • Most implementations emphasize searches that use
    a bibliographic set of attributes to search
    databases of MARC records

7
Technical history
  • Z39.50
  • Developed for X.25 networks (connection
    orientation), conversion to run over TCP fitted
    later
  • Original concept in days when repeating a search
    was expensive computation (about 1980)
  • WAIS is a stateless derivative of an early
    version of Z39.50

8
Z39.50 principles
  • Abstract view of database searching.
  • Server stores a set of databases with searchable
    indexes
  • Interactions are based on a session
  • The client opens a connection with the server,
    carries out a sequence of interactions and then
    closes the connection.
  • During the course of the session, both the server
    and the client remember the state of their
    interaction.

9
State
  • Z39.50
  • The server carries out the search and builds a
    results set
  • Server saves the results set.
  • Subsequent message from the client can reference
    the result set.
  • Thus the client can modify a large set by
    increasingly precise requests, or can request a
    presentation of any record in the set, without
    searching entire database.

10
Z 39.50 services
init -- client connects to the server and
exchanges initial information, e.g., preferred
message size explain -- client inquires of the
server what databases are available for
searching, the fields that are available, the
syntax and formats supported, and other
options search -- client presents a query to a
database choices of syntax for specifying
searches only Boolean queries widely
implemented one or more records may be
returned to the client
11
Z 39.50 services
manipulation of results sets -- e.g., sort or
delete present -- requests the server to send
specified records from the results set to the
client in a specified format options for
controlling content and formats
for managing large records or large results sets
12
Sample query
In the database named "Books" find all records
for which the access point title contains the
value "evangeline" and the access point author
contains the value "longfellow. Z39.50 defines
a rich variety of search access points that can
be extended by implementers
13
Simple Digital Library Interoperability Protocol
  • http//www-diglib.stanford.edu/testbed/doc2/SDLIP
    /

14
SDLIP
  • Compromise between a full-scale, all encompassing
    search middleware design such as Z39.50 and the
    anything goes approach typical for ad-hoc
    search interface design on web
  • Developed jointly by Stanford, Berkeley, and UC
    Santa Barbara
  • Heavily influenced by DASL from IETF

15
SDLIP search middleware
16
Managing complexity through separate interfaces
17
SDLIP Interfaces
  • Search Interface defines simple query language,
    protocol can then include other languages
  • Result Interface parking meter metaphor
    supports varying notions of results sets
  • Source Metadata Interface provides extension
    mechanism through discovery server capabilities

18
Open Archives Initiative Metadata Harvesting
Protocol
  • http//www.openarchives.org

19
OAI Metadata Harvesting Protocol
  • Low-barrier framework for repository
    interoperability
  • Minimal burden for data providers
  • Plug-in concept to allow community and service
    specialization

20
Metadata Harvesting
metadata
e-print
21
Metadata Harvesting
metadata
e-print
22
OAI core concepts
  • low-barrier interoperability
  • data-provider service-provider model
  • metadata harvesting model
  • shared metadata format and parallel,
    community-specific metadata formats

OAI 1.0 protocol
HTTP based
Dublin Core
Community specific
23
Some thoughts
  • There is (and will never be) one right solution
    (technical vs. cost vs. complexity vs. ??)
  • Distributed technical solutions have
    organizational ramifications
  • Distributed resource discovery (as with any
    distributed computer solution) entails various
    tradeoffs

24
Distributed Searching Issues
  • Global Distribution

25
Broadcast Distributed Search
26
Backup Index server
backup index
27
Deploying Collection Globally
  • Internet connectivity varies considerably
  • Good connectivity between nodes often does not
    correspond to geographic proximity
  • Connectivity Region - a group of nodes on the
    network that among them have good connectivity,
    relative to nodes outside of the region.

28
Connectivity Regions
  • When possible route queries within region
  • In case of failure, use an alternate either
    within the region or in a nearby region

29
Distributed Searching Issues
  • Query Routing

30
Routing ProblemDisjoint Indexes
Hopcroft I1, I3 Hartmanis I3 Tarjan I1,
I2 Wilensky I2
I1,I3
doc1, doc2
doc8
Content Summary
I1
I2
I3
31
Routing ProblemReplicated Distributed Indexes
Tarjan doc6 Wilensky doc7
Tarjan doc6 Wilensky doc7
32
Routing Issues
  • Choice of primary?, secondary?, etc.
  • Fault-tolerance
  • Routing Factors
  • Performance-based
  • Freshness-based
  • Cost-based
  • weighted mix based on user preference

33
Components of Replicated Routing Problem
  • Metadata Issue metadata made available by
    indexer to aid in routing
  • Metadata Distribution Issue topology of metadata
    repositories
  • Decision Issue routing decision algorithms
  • Fault-tolerance use of backup indexers

34
Distributed Metadata for Query Routing
central metadata store
35
Performance-based Routing
present
-
8
T
Timed low pass filter
Average response time
Predicted response time
New low pass filter(T, actual response
time, old )
Write a Comment
User Comments (0)
About PowerShow.com