Title: Distributed%20Information%20Discovery
1Distributed Information Discovery
Lecture 14
- CS 430
- Carl Lagoze 2001-03-08
2Goals and Motivation
- Lesson from the Web relevant and valuable
information is everywhere - Rethinking the library in the digital age
- Not as collector of information
- Rather as access point to distributed information
- Perfect scenario uniform access to all
information with rich functionality
3Problems with the Perfect Scenario
- Heterogeneity what is the structure of the
information we wish to discovery - Reliability machines, networks, and
organizations are sometimes (often) flaky - Complexity cost vs. functionality tradeoff
4Function versus cost of acceptance
Cost of acceptance
Z39.50
SDLIP
Metadata Harvesting
Function
5Z39.50
- http//www.loc.gov/z3950/agency/
6Aims of Z39.50
- Permits one computer, the client, to search and
retrieve information on another, the database
server - Important both technically and for its wide use
in library systems - Most development has concentrated on
bibliographic data - Most implementations emphasize searches that use
a bibliographic set of attributes to search
databases of MARC records
7Technical history
- Z39.50
- Developed for X.25 networks (connection
orientation), conversion to run over TCP fitted
later - Original concept in days when repeating a search
was expensive computation (about 1980) - WAIS is a stateless derivative of an early
version of Z39.50
8Z39.50 principles
- Abstract view of database searching.
- Server stores a set of databases with searchable
indexes - Interactions are based on a session
- The client opens a connection with the server,
carries out a sequence of interactions and then
closes the connection. - During the course of the session, both the server
and the client remember the state of their
interaction.
9State
- Z39.50
- The server carries out the search and builds a
results set - Server saves the results set.
- Subsequent message from the client can reference
the result set. - Thus the client can modify a large set by
increasingly precise requests, or can request a
presentation of any record in the set, without
searching entire database.
10Z 39.50 services
init -- client connects to the server and
exchanges initial information, e.g., preferred
message size explain -- client inquires of the
server what databases are available for
searching, the fields that are available, the
syntax and formats supported, and other
options search -- client presents a query to a
database choices of syntax for specifying
searches only Boolean queries widely
implemented one or more records may be
returned to the client
11Z 39.50 services
manipulation of results sets -- e.g., sort or
delete present -- requests the server to send
specified records from the results set to the
client in a specified format options for
controlling content and formats
for managing large records or large results sets
12Sample query
In the database named "Books" find all records
for which the access point title contains the
value "evangeline" and the access point author
contains the value "longfellow. Z39.50 defines
a rich variety of search access points that can
be extended by implementers
13Simple Digital Library Interoperability Protocol
- http//www-diglib.stanford.edu/testbed/doc2/SDLIP
/
14SDLIP
- Compromise between a full-scale, all encompassing
search middleware design such as Z39.50 and the
anything goes approach typical for ad-hoc
search interface design on web - Developed jointly by Stanford, Berkeley, and UC
Santa Barbara - Heavily influenced by DASL from IETF
15SDLIP search middleware
16Managing complexity through separate interfaces
17SDLIP Interfaces
- Search Interface defines simple query language,
protocol can then include other languages - Result Interface parking meter metaphor
supports varying notions of results sets - Source Metadata Interface provides extension
mechanism through discovery server capabilities
18Open Archives Initiative Metadata Harvesting
Protocol
- http//www.openarchives.org
19OAI Metadata Harvesting Protocol
- Low-barrier framework for repository
interoperability - Minimal burden for data providers
- Plug-in concept to allow community and service
specialization
20Metadata Harvesting
metadata
e-print
21Metadata Harvesting
metadata
e-print
22OAI core concepts
- low-barrier interoperability
- data-provider service-provider model
- metadata harvesting model
- shared metadata format and parallel,
community-specific metadata formats
OAI 1.0 protocol
HTTP based
Dublin Core
Community specific
23Some thoughts
- There is (and will never be) one right solution
(technical vs. cost vs. complexity vs. ??) - Distributed technical solutions have
organizational ramifications - Distributed resource discovery (as with any
distributed computer solution) entails various
tradeoffs
24Distributed Searching Issues
25Broadcast Distributed Search
26Backup Index server
backup index
27Deploying Collection Globally
- Internet connectivity varies considerably
- Good connectivity between nodes often does not
correspond to geographic proximity - Connectivity Region - a group of nodes on the
network that among them have good connectivity,
relative to nodes outside of the region.
28Connectivity Regions
- When possible route queries within region
- In case of failure, use an alternate either
within the region or in a nearby region
29Distributed Searching Issues
30Routing ProblemDisjoint Indexes
Hopcroft I1, I3 Hartmanis I3 Tarjan I1,
I2 Wilensky I2
I1,I3
doc1, doc2
doc8
Content Summary
I1
I2
I3
31Routing ProblemReplicated Distributed Indexes
Tarjan doc6 Wilensky doc7
Tarjan doc6 Wilensky doc7
32Routing Issues
- Choice of primary?, secondary?, etc.
- Fault-tolerance
- Routing Factors
- Performance-based
- Freshness-based
- Cost-based
- weighted mix based on user preference
33Components of Replicated Routing Problem
- Metadata Issue metadata made available by
indexer to aid in routing - Metadata Distribution Issue topology of metadata
repositories - Decision Issue routing decision algorithms
- Fault-tolerance use of backup indexers
34Distributed Metadata for Query Routing
central metadata store
35Performance-based Routing
present
-
8
T
Timed low pass filter
Average response time
Predicted response time
New low pass filter(T, actual response
time, old )