Parallel and Distributed IR - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel and Distributed IR

Description:

Parallel and Distributed IR Eric Brown – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 25
Provided by: carr2174
Category:

less

Transcript and Presenter's Notes

Title: Parallel and Distributed IR


1
Parallel and Distributed IR
  • Eric Brown

2
Parallel Computing
  • SISDsingle instruction stream, single data
    stream.
  • SIMDsingle instruction stream, multiple data
    stream.
  • MISDmultiple instruction stream, single data
    stream.
  • MIMDmultiple instruction stream, multiple data
    stream.

3
Performance Measures
S
Running time of best available sequential
algorithm ----------------------------------------
----------------------- Running time of parallel
algorithm
1 f (1-f)/N
Slt
1 f
lt
S N
?
4
Parallel IR
  • Introduction
  • Develop new retrieval strategies that directly
    lend themselves to parallel implementation.
  • Adapt existing, well studied information
    retrieval algorithms to parallel processing.

5
MIMD Architecture
6
MIMD Architecture
  • Inverted Files
  • Logical Document Partitioning
  • Essentially the same basic underlying inverted
    file index as in the original sequential
    algorithm.
  • Physical Document Partitioning
  • Each subcollection has its own inverted file and
    the search processes shard nothing during query
    evaluation.

7
MIMD Architecture
  • Logical document partitioning requires less
    communication than physical document partitioning
    with similar parallelization, and so is likely to
    provide better overall performance.
  • Physical document partitioning, on the other
    hand, offers more flexibility and conversion of
    an existing IR system into a parallel IR system
    is simpler using physical document partition.

8
MIMD Architectures
  • Term partitioning
  • When term partitioning is used with an inverted
    file is created for the document collection and
    the inverted lists are spread across the
    processors.
  • Assuming each processor has its own I/O channel
    and disks when term distribution in the documents
    and the queries are more skewed, document
    partition performs better. When terms are
    uniformly distributed in user queries, term
    partition performs better.

9
MIMD Architecture
10
SIMD Architecture
  • Signature Files

11
SIMD Architecture
  • Signature Files

12
SIMD Architecture
  • Signature Files

13
SIMD Architectures
  • Inverted Files

14
SIMD Architectures
15
SIMD Architectures
  • Inverted Files

16
SIMD Architectures
17
Distributed IR
  • Introduction
  • A distributed computing system can be viewed as a
    MIMD parallel processor with relatively slow
    inter-processor communication channel and the
    freedom to employ a heterogeneous collection of
    processors in the system.

18
Distributed IR
  • Introduction
  • Distributed Model is very similar to the MIMD
    parallel processing model.
  • The main difference here is that subtasks run on
    different computers and the communication between
    the subtasks is performed using network protocol
    such as TCP/IP.

19
Collection Partitioning
  • The procedure used to adding documents to search
    servers in a distributed IR system depends a
    number of factors.
  • Consider whether or not the system is centrally
    administered.

20
Collection Partitioning
  • When the distribute system is centrally
    administered, more options are available.
  • The first option is simple replication of the
    collection across all of the search servers.
  • The second option is random distribution of the
    documents.
  • The final option is explicit semantic
    partitioning of the documents.

21
Source Selection
  • Source selection is the process of determining
    which of the distributed document collections are
    most likely to contain relevant documents for the
    current query, and therefore should receive the
    query for processing.
  • The basic technique is to treat each collection
    as if it were a single large document, index the
    collections, and evaluate the query against the
    collections to produce a ranked listing of
    collections.

22
Query Processing
  • Query processing in a distributed IR system
    proceeds
  • as follows
  • Select collection to search.
  • Distribute query to selected collections.
  • Evaluate query at distributed collection in
    parallel.
  • Combine results from distributed collection into
    final result.

23
Web Issues
  • The parallel and distributed techniques described
    above can then be used directly as if the Web
    were any other large document collection. This is
    the approach currently taken by most of the
    popular Web search services.

24
Trends and Research Issues
  • The trend in parallel hardware is the develop of
    general MIMD machines.
  • Many challenges remain in the area of parallel
    and distributed text retrieval.
  • The first challenge is measuring retrieval
    effectiveness on large text collections.
  • The second significant challenge is
    interoperability, or building distributed IR
    systems form heterogeneous components.
Write a Comment
User Comments (0)
About PowerShow.com