Information Retrieval - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Information Retrieval

Description:

Information Retrieval Architecture of Information Retrieval Systems – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 30
Provided by: wya49
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
Architecture of Information Retrieval Systems
2
Distributed Architecture 1 Standard Search
Protocols
Find x
Strict adherence to standards allows any user
interface to search any conforming search service.
Find x
3
Distributed Architecture 1 Standard Search
Protocols
Example Z 39.50 Family of Standards for
Searching Library Catalogs Content Anglo
American Cataloging Rules Structure of Content
MARC Encoding Rules Base Encoding Rules
(character sets, separators, etc.) Message
Passing Protocol Z 39.50 Query Format Bib 1
(Boolean), Type 102 (full text) In addition,
there are the underlying network standards, e.g.
the Internet suite of protocols.
4
Z39.50 principles
  • Servers store a set of databases with searchable
    indices
  • Interactions are based on a session
  • The client opens a connection with the server(s),
    carries out a sequence of interactions and then
    closes the connection.
  • During the course of the session, both the server
    and the client remember the state of their
    interaction.

5
State
  • Z39.50
  • The server carries out the search and builds a
    results set
  • Server saves the results set.
  • Subsequent message from the client can reference
    the result set.
  • Thus the client can modify a large set by
    increasingly precise requests, or can request a
    presentation of any record in the set, without
    searching entire database.

6
Standard Search Protocols
  • Example Z 39.50 Family of Standards for
    Searching Library Catalogs
  • The Z 39.50 family of standards has proved
    successful in a tightly knit community, where
  • There is a strong tradition of standardization,
    with many professionally trained people.
  • The categories of material change gradually,
    allowing a slow-moving standardization process.
  • The standardization approach has failed where
    these two criteria are not met.
  • Historic note WAIS was based on an early version
    of Z39.50.

7
Distributed Architecture 2 Broadcast Search (aka
Federated Search)
An interface server broadcasts a query to each
collection, combines the results and returns them
to the user. Examples Dienst (digital library
protocol), Web metasearch services
Find x
8
Distributed Architecture 2 Broadcast Search
Interface Service Can be a separate server
(e.g., CGI), or run on the user's computer (e.g.,
applet). Protocols In the simple version, each
collection must support the same standards and
protocols (e.g., Z 39.50, http, etc.).
9
Distributed Architecture 2 Broadcast Search
Problems with Broadcast Search Performance If
any collection does not respond, the Interface
Server waits for a time out. Recall If any
collection does not respond, documents in that
collection are not found. Ranking and
duplicates There are great difficulties in
reconciling ranked lists from different
collections. Broadcast searching is as bad as its
weakest link! Conclusion Broadcast search does
not scale beyond about five or ten collections,
even with strict standardization.
10
Standardization Function Versus Cost of
Acceptance
Cost of acceptance
Few adopters
Many adopters
Function
11
Example Textual Mark-up
Cost of acceptance
SGML
XML
HTML
Function
ASCII
12
Distributed Architecture 3 Centralized Search
Services
Batch indexing Metadata about all items is
accumulated in a central system. Real-time
searching The user (a) searches the central
system, and (b) retrieves items from
collections. Examples Union catalogs, Web
search services
retrieve
search
Find x
13
Distributed Architecture 3 Centralized Search
Services
Gathering by Web Crawling Entirely automatic,
low cost. Highly efficient at gathering very
large amounts of material. but ... Can only
gather openly accessible materials. Cannot
gather material in databases unless explicit URLs
are known. Cannot easily make use of metadata
provided by collections. Examples Web search
services.
14
Distributed Architecture 3 Centralized Search
Services
Harvesting Each collection makes a copy of its
metadata available from a sever associated with
the collection. A search service harvests
metadata from all collections on a regular cycle
and builds a central search system. Advantages
... Can index material from databases without
explicit URLs. Allows authentication and
selection of material. but ... Requires that
collections have metadata and support harvesting
protocol (e.g., Open Archives Initiative Protocol
for Metadata Harvesting).
15
Open Archives Initiative Protocol for Metadata
Harvesting
  • Low-barrier protocol for exposing structured
    information (metadata) from cooperating
    repositories
  • Provides opportunity for building comprehensive
    service network
  • http//www.openarchives.org/

16
OAI-PMH A simple two party model for sharing
structured information
Service Providers
Discovery
Current Awareness
Preservation
Data Providers
17
OAI-PMH Key technical features
  • Simple HTTP encoding
  • Built on of established XML standards
  • Multiple metadata formats, but Dublin Core
    required
  • Repository partitioning (sets)
  • Selective harvesting (sets and dates)
  • Clean partition between core and
    implementation-specific extensions
  • Multiple item-level metadata
  • Collection level metadata

18
OAI Verbs
  • Identify repository characteristics
  • ListMetadataFormats DC required
  • ListSets repository partitioning
  • ListRecords (selectively) harvest metadata
  • ListIdentifiers (selectively) harvest metadata
    identifiers
  • GetRecord known item retrieval

19
The National Science Digital Library
The Integration Task is to provide a coherent set
of collections and services across great
diversity (all digital collections relevant to
science education).
http//nsdl.org/
20
Interoperability in the NSDL
The Problem Conventional approaches require
partners to support agreements (technical,
content, and business) But NSDL needs thousands
of very different partners ... most of whom are
not directly part of the NSDL program The
challenge is to create incentives for independent
digital libraries to adopt agreements
21
Architecture for Searching
Basic Assumptions The integration team will not
manage most of the collections The integration
team will not create most of the metadata
22
The NSDL Search Service
Full Text or Metadata? Full text indexing is
excellent, but is not possible for all materials
(non-textual, no access for indexing). Comprehensi
ve metadata is available for very few of the
materials. What Architecture to Use? Few
collections support an established search
protocol (e.g., Z39.50).
23
NSDL The Spectrum of Interoperability
Level Agreements Example Federation Strict use
of standards AACR, MARC (syntax, semantic, Z
39.50 and business) Harvesting Digital
libraries expose Open Archives metadata
simple metadata harvesting protocol and
registry Gathering Digital libraries do not Web
crawlers cooperate services must and search
engines seek out information
24
The NSDL Repository
The repository is a resource for service
providers. It holds information about every
collection and item known to the NSDL, including
contextual information.
Services
NSDL Repository
Users
Collections
25
NSDL Search Service First Phase
NSDL Repository
OAI harvest
Portal
SDLIP
Search andDiscoveryService
Portal
Portal
crawl
Inquery -gt Lucene
Collections
26
NSDL Search Service First Phase
  • Approach
  • Collections map metadata to Dublin Core, provide
    via Open Archives protocol.
  • Search service augments Dublin Core metadata with
    indexing of full-text where available.
  • User interface returns snippets derived from the
    metadata, links to full content and to metadata.

27
NSDL Search Service First Phase
  • Weaknesses
  • Ranking by similarity to query not sufficient.
  • Snippets do not indicate why item was returned
    (e.g., terms in full text but not in metadata).
  • Dublin Core records provide limited information.
  • (d) Browsing environment limited.
  • (e) Most users begin their search with a Web
    search engine (e.g., Google)

28
NSDL Search Service Second Phase Developments
  • Metadata
  • Accept any metadata that is available in a range
    of formats
  • System for reviews and annotations, with
    reputation management
  • Search system
  • Multimodal retrieval and ranking
  • Dynamic generation of snippets by search engine

29
NSDL Search Service Second Phase Developments
(cont.)
  • Usability and human factors
  • Wider range of browsing tools (e.g., collection
    visualization)
  • Filters by education level and education quality,
    where known
  • Web compatibility
  • Expose records for Web crawlers to index
  • Browser bookmarklet to add NSDL information to
    Web pages
Write a Comment
User Comments (0)
About PowerShow.com