Modern Information Retrieval - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Modern Information Retrieval

Description:

Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999 – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 30
Provided by: bert9
Category:

less

Transcript and Presenter's Notes

Title: Modern Information Retrieval


1
Modern Information Retrieval
  • Chapter 9 Parallel and Distributed IR
  • Section 9.1 Introduction
  • Section 9.2.2. MIMD Architectures
  • Inverted Files
  • November 5, 1999

2
Summary
  • Introduction
  • Review of parallel computing and parallel program
    performance measures
  • Exploration of techniques for implementing
    inverted file on MIMD parallel architecture
  • Conclusion

3
Introduction
  • The volume of electronic text available online
    today is staggering.
  • The WWW contains over 800 millions pages of text,
    comprising nearly 6 terabytes of data (NATUREVol
    4008 July 1999www.nature.com).
  • As document collections grow larger, they become
    more expensive to manage with an information
    retrieval system.
  • To support the demanding requirements of modern
    search environments, we must turn to alternative
    architectures and algorithms.

4
Parallel Computing
  • Parallel computing is the simultaneous aplication
    of multiple processors to solve a single problem.
  • Flynns Taxonomy
  • SISD single instruction, single data
  • SIMD single instruction, multiple data
  • MISD multiple instruction, single data
  • MIMD multiple instruction, multiple data

5
Parallel Program Performance Measures
  • Speedup
  • Amdahls Law
  • where f is the fraction of the problem that
  • must be computed sequencially
  • N is the number of processors.

6
Parallel Program Performance Measures
  • Efficiency
  • where S is speedup
  • N is the number of processors.

7
MIMD Architectures
  • MIMD architectures offer a great deal of
    flexibility in how parallelism is defined and
    exploited to solve a problem.
  • There are two ways in which a retrieval system
    can exploit a MIMD machine
  • Parallel multitasking
  • Partitioned parallel processing.

8
MIMD Architectures
  • Parallel multitasking on a MIMD machine

9
MIMD Architectures
Partitioned parallel processing on a MIMD machine
10
MIMD Architectures
Basic data elements processed by a seach algorithm
11
MIMD Architectures
  • There are two possible methods for partitioning
    the data
  • Document partitioning the N documents are
    distributed across the P processors each
    parallel process evaluates the query on the
    subcollection of N/P documents assigned to it
  • Term partitioning the t indexing items are
    distributed across the P processors the
    evaluation process for each document is spread
    over multiple processors.

12
Inverted FilesLogical Document Partitioning
  • Data Partitioning
  • The data partitioning is done logically using
    essentially the same basic underlying inverted
    file index as in the original sequential
    algorithm
  • The inverted file is extended to give each
    parallel process direct access to that portion of
    the index related to the processors
    subcollection of documents.

13
Inverted FilesLogical Document Partitioning
Extended dictionary entry for document partitionin
g
14
Inverted FilesLogical Document Partitioning
  • Query Evaluation
  • The broker initiates P parallel processes to
    evaluate the query
  • Each process executes the same document scoring
    algorithm on its document subcollection
  • The search processes record document scores in a
    single shared array of document score
    accumulators
  • The broker produces the final ranked list of
    documents.

15
Inverted Files Logical Document Partitioning
  • Inverted File Construction
  • The indexer partitions the documents among the
    processors
  • Each indexing process generates a batch of
    inverted lists, sorted by indexing item
  • A merge step is performed to create the final
    inverted file.

16
Inverted FilesPhysical Document Partitioning
  • Data Partitioning
  • The documents are physically partitioned into
    separate subcollections, one for each parallel
    processor
  • Each subcollection has its own inverted file.

17
Inverted FilesPhysical Document Partitioning
  • Query Evaluation
  • The broker distributes the query to all of the
    parallel search processes
  • Each parallel search process evaluates the query
    on its portion of the document collection,
    producing an intermediate hit-list
  • The broker collects the intermediate hit-lists
    from all of the parallel search processes and
    merges them into a final hit-list.

18
Inverted FilesPhysical Document Partitioning
  • Inverted File Construction
  • Each processor creates, in parallel, its own
    complete index corresponding to its document
    partition
  • A merge step is performed to accumulate the
    global statistics for all of the partitions and
    distribute them to each of the partition
    dictionaries.

19
Inverted FilesTerm Partitioning
  • Data Partitioning
  • Inverted lists are spread across the processors.

20
Inverted FilesTerm Partitioning
  • Query Evaluation
  • Query is decomposed into indexing items and each
    indexing item is sent to the processor that holds
    the corresponding inverted list
  • The processors create hit-lists with partial
    document scores and return them to the broker
  • The broker combines the hit-lists.

21
Inverted FilesTerm Partitioning
  • Inverted File Construction
  • Inverted file is created using the parallel
    construction technique described for logical
    document partitioning.

22
Example
Document collection Document
Text 1
Pease porridge hot 2 Pease
porridge cold 3 Pease porridge
in the pot 4 Pease porridge
hot, pease porridge not cold 5
Pease porridge cold, pease porridge not hot
6 Pease porridge hot in the pot
23
Example
Inverted File
24
Example
Logical Document Partitioning
25
Example
Physical Document Partitioning
26
Example
Term Partitioning
27
Conclusion
  • The task of indexing and searching in very large
    text collections is costly
  • Faster indexing and searching algorithms are
    always desirable and the use of parallel hardware
    is and obvious alternative
  • We discussed two possible organization for the
    document collection index on a MIMD parallel
    architecture
  • Document partitioning
  • Term partitioning.

28
Conclusion
  • Document partitioning affords simpler inverted
    index construction and maintenance than term
    partitioning
  • When term distributions in the documents and
    queries are more skewed, document partitioning
    performs better
  • When terms are uniformily distributed in user
    queries, term partitioning performs better.

29
Adicional References
Lawrence, S., Giles, C.L. 1999. Accessibility of
Information on the Web. Nature.
Vol.400.pp.107-109. Ribeiro-Neto, B.A., Barbosa,
R.A. 1998. Query Performance for Tighly Coupled
Distributed Digital Libraries. Digital Libraries
98. pp.182-190. Ribeiro-Neto, B.A., Moura, E.S.,
Neubert, M.S., Ziviani, N. 1999. Efficient
Distributed Algorithms to Build Inverted Files.
SIGIR99. pp.105-112.
Write a Comment
User Comments (0)
About PowerShow.com