Title: Information Retrieval
1Information Retrieval
Architecture of Information Retrieval Systems
2Distributed Architecture 1 Standard Search
Protocols
Find x
Strict adherence to standards allows any user
interface to search any conforming search service.
Find x
3Distributed Architecture 1 Standard Search
Protocols
Example Z 39.50 Family of Standards for
Searching Library Catalogs Content Anglo
American Cataloging Rules Structure of Content
MARC Encoding Rules Base Encoding Rules
(character sets, separators, etc.) Message
Passing Protocol Z 39.50 Query Format Bib 1
(Boolean), Type 102 (full text) In addition,
there are the underlying network standards, e.g.
the Internet suite of protocols.
4Z39.50 principles
- Servers store a set of databases with searchable
indices - Interactions are based on a session
- The client opens a connection with the server(s),
carries out a sequence of interactions and then
closes the connection. - During the course of the session, both the server
and the client remember the state of their
interaction.
5State
- Z39.50
- The server carries out the search and builds a
results set - Server saves the results set.
- Subsequent message from the client can reference
the result set. - Thus the client can modify a large set by
increasingly precise requests, or can request a
presentation of any record in the set, without
searching entire database.
6Standard Search Protocols
- Example Z 39.50 Family of Standards for
Searching Library Catalogs - The Z 39.50 family of standards has proved
successful in a tightly knit community, where - There is a strong tradition of standardization,
with many professionally trained people. - The categories of material change gradually,
allowing a slow-moving standardization process. - The standardization approach has failed where
these two criteria are not met. - Historic note WAIS was based on an early version
of Z39.50.
7Distributed Architecture 2 Broadcast Search (aka
Federated Search)
An interface server broadcasts a query to each
collection, combines the results and returns them
to the user. Examples Dienst (digital library
protocol), Web metasearch services
Find x
8Distributed Architecture 2 Broadcast Search
Interface Service Can be a separate server
(e.g., CGI), or run on the user's computer (e.g.,
applet). Protocols In the simple version, each
collection must support the same standards and
protocols (e.g., Z 39.50, http, etc.).
9Distributed Architecture 2 Broadcast Search
Problems with Broadcast Search Performance If
any collection does not respond, the Interface
Server waits for a time out. Recall If any
collection does not respond, documents in that
collection are not found. Ranking and
duplicates There are great difficulties in
reconciling ranked lists from different
collections. Broadcast searching is as bad as its
weakest link! Conclusion Broadcast search does
not scale beyond about five or ten collections,
even with strict standardization.
10Standardization Function Versus Cost of
Acceptance
Cost of acceptance
Few adopters
Many adopters
Function
11Example Textual Mark-up
Cost of acceptance
SGML
XML
HTML
Function
ASCII
12Distributed Architecture 3 Centralized Search
Services
Batch indexing Metadata about all items is
accumulated in a central system. Real-time
searching The user (a) searches the central
system, and (b) retrieves items from
collections. Examples Union catalogs, Web
search services
retrieve
search
Find x
13Distributed Architecture 3 Centralized Search
Services
Gathering by Web Crawling Entirely automatic,
low cost. Highly efficient at gathering very
large amounts of material. but ... Can only
gather openly accessible materials. Cannot
gather material in databases unless explicit URLs
are known. Cannot easily make use of metadata
provided by collections. Examples Web search
services.
14Distributed Architecture 3 Centralized Search
Services
Harvesting Each collection makes a copy of its
metadata available from a sever associated with
the collection. A search service harvests
metadata from all collections on a regular cycle
and builds a central search system. Advantages
... Can index material from databases without
explicit URLs. Allows authentication and
selection of material. but ... Requires that
collections have metadata and support harvesting
protocol (e.g., Open Archives Initiative Protocol
for Metadata Harvesting).
15Open Archives Initiative Protocol for Metadata
Harvesting
- Low-barrier protocol for exposing structured
information (metadata) from cooperating
repositories - Provides opportunity for building comprehensive
service network - http//www.openarchives.org/
16OAI-PMH A simple two party model for sharing
structured information
Service Providers
Discovery
Current Awareness
Preservation
Data Providers
17OAI-PMH Key technical features
- Simple HTTP encoding
- Built on of established XML standards
- Multiple metadata formats, but Dublin Core
required - Repository partitioning (sets)
- Selective harvesting (sets and dates)
- Clean partition between core and
implementation-specific extensions - Multiple item-level metadata
- Collection level metadata
18OAI Verbs
- Identify repository characteristics
- ListMetadataFormats DC required
- ListSets repository partitioning
- ListRecords (selectively) harvest metadata
- ListIdentifiers (selectively) harvest metadata
identifiers - GetRecord known item retrieval
19The National Science Digital Library
The Integration Task is to provide a coherent set
of collections and services across great
diversity (all digital collections relevant to
science education).
http//nsdl.org/
20Interoperability in the NSDL
The Problem Conventional approaches require
partners to support agreements (technical,
content, and business) But NSDL needs thousands
of very different partners ... most of whom are
not directly part of the NSDL program The
challenge is to create incentives for independent
digital libraries to adopt agreements
21Architecture for Searching
Basic Assumptions The integration team will not
manage most of the collections The integration
team will not create most of the metadata
22The NSDL Search Service
Full Text or Metadata? Full text indexing is
excellent, but is not possible for all materials
(non-textual, no access for indexing). Comprehensi
ve metadata is available for very few of the
materials. What Architecture to Use? Few
collections support an established search
protocol (e.g., Z39.50).
23NSDL The Spectrum of Interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not Web
crawlers cooperate services must and search
engines seek out information
24The NSDL Repository
The repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL, including
contextual information.
Services
NSDL Repository
Users
Collections
25NSDL Search Service First Phase
NSDL Repository
OAI harvest
Portal
SDLIP
Search andDiscoveryService
Portal
Portal
crawl
Inquery -gt Lucene
Collections
26NSDL Search Service First Phase
- Approach
- Collections map metadata to Dublin Core, provide
via Open Archives protocol. - Search service augments Dublin Core metadata with
indexing of full-text where available. - User interface returns snippets derived from the
metadata, links to full content and to metadata.
27NSDL Search Service First Phase
- Weaknesses
- Ranking by similarity to query not sufficient.
- Snippets do not indicate why item was returned
(e.g., terms in full text but not in metadata). - Dublin Core records provide limited information.
- (d) Browsing environment limited.
- (e) Most users begin their search with a Web
search engine (e.g., Google)
28NSDL Search Service Second Phase Developments
- Metadata
- Accept any metadata that is available in a range
of formats - System for reviews and annotations, with
reputation management - Search system
- Multimodal retrieval and ranking
- Dynamic generation of snippets by search engine
29NSDL Search Service Second Phase Developments
(cont.)
- Usability and human factors
- Wider range of browsing tools (e.g., collection
visualization) - Filters by education level and education quality,
where known - Web compatibility
- Expose records for Web crawlers to index
- Browser bookmarklet to add NSDL information to
Web pages