Modern Information Retrieval - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Modern Information Retrieval

Description:

Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999 – PowerPoint PPT presentation

Number of Views:231

Avg rating:3.0/5.0

Slides: 30

Provided by: bert9

Category:

more less

Transcript and Presenter's Notes

Title: Modern Information Retrieval

1
Modern Information Retrieval

Chapter 9 Parallel and Distributed IR
Section 9.1 Introduction
Section 9.2.2. MIMD Architectures
Inverted Files
November 5, 1999

2
Summary

Introduction
Review of parallel computing and parallel program
performance measures
Exploration of techniques for implementing
inverted file on MIMD parallel architecture
Conclusion

3
Introduction

The volume of electronic text available online
today is staggering.
The WWW contains over 800 millions pages of text,
comprising nearly 6 terabytes of data (NATUREVol
4008 July 1999www.nature.com).
As document collections grow larger, they become
more expensive to manage with an information
retrieval system.
To support the demanding requirements of modern
search environments, we must turn to alternative
architectures and algorithms.

4
Parallel Computing

Parallel computing is the simultaneous aplication
of multiple processors to solve a single problem.
Flynns Taxonomy
SISD single instruction, single data
SIMD single instruction, multiple data
MISD multiple instruction, single data
MIMD multiple instruction, multiple data

5
Parallel Program Performance Measures

Speedup
Amdahls Law
where f is the fraction of the problem that
must be computed sequencially
N is the number of processors.

6
Parallel Program Performance Measures

Efficiency
where S is speedup
N is the number of processors.

7
MIMD Architectures

MIMD architectures offer a great deal of
flexibility in how parallelism is defined and
exploited to solve a problem.
There are two ways in which a retrieval system
can exploit a MIMD machine
Parallel multitasking
Partitioned parallel processing.

8
MIMD Architectures

Parallel multitasking on a MIMD machine

9
MIMD Architectures
Partitioned parallel processing on a MIMD machine
10
MIMD Architectures
Basic data elements processed by a seach algorithm
11
MIMD Architectures

There are two possible methods for partitioning
the data
Document partitioning the N documents are
distributed across the P processors each
parallel process evaluates the query on the
subcollection of N/P documents assigned to it
Term partitioning the t indexing items are
distributed across the P processors the
evaluation process for each document is spread
over multiple processors.

12
Inverted FilesLogical Document Partitioning

Data Partitioning
The data partitioning is done logically using
essentially the same basic underlying inverted
file index as in the original sequential
algorithm
The inverted file is extended to give each
parallel process direct access to that portion of
the index related to the processors
subcollection of documents.

13
Inverted FilesLogical Document Partitioning
Extended dictionary entry for document partitionin
g
14
Inverted FilesLogical Document Partitioning

Query Evaluation
The broker initiates P parallel processes to
evaluate the query
Each process executes the same document scoring
algorithm on its document subcollection
The search processes record document scores in a
single shared array of document score
accumulators
The broker produces the final ranked list of
documents.

15
Inverted Files Logical Document Partitioning

Inverted File Construction
The indexer partitions the documents among the
processors
Each indexing process generates a batch of
inverted lists, sorted by indexing item
A merge step is performed to create the final
inverted file.

16
Inverted FilesPhysical Document Partitioning

Data Partitioning
The documents are physically partitioned into
separate subcollections, one for each parallel
processor
Each subcollection has its own inverted file.

17
Inverted FilesPhysical Document Partitioning

Query Evaluation
The broker distributes the query to all of the
parallel search processes
Each parallel search process evaluates the query
on its portion of the document collection,
producing an intermediate hit-list
The broker collects the intermediate hit-lists
from all of the parallel search processes and
merges them into a final hit-list.

18
Inverted FilesPhysical Document Partitioning

Inverted File Construction
Each processor creates, in parallel, its own
complete index corresponding to its document
partition
A merge step is performed to accumulate the
global statistics for all of the partitions and
distribute them to each of the partition
dictionaries.

19
Inverted FilesTerm Partitioning

Data Partitioning
Inverted lists are spread across the processors.

20
Inverted FilesTerm Partitioning

Query Evaluation
Query is decomposed into indexing items and each
indexing item is sent to the processor that holds
the corresponding inverted list
The processors create hit-lists with partial
document scores and return them to the broker
The broker combines the hit-lists.

21
Inverted FilesTerm Partitioning

Inverted File Construction
Inverted file is created using the parallel
construction technique described for logical
document partitioning.

22
Example
Document collection Document
Text 1
Pease porridge hot 2 Pease
porridge cold 3 Pease porridge
in the pot 4 Pease porridge
hot, pease porridge not cold 5
Pease porridge cold, pease porridge not hot
6 Pease porridge hot in the pot
23
Example
Inverted File
24
Example
Logical Document Partitioning
25
Example
Physical Document Partitioning
26
Example
Term Partitioning
27
Conclusion

The task of indexing and searching in very large
text collections is costly
Faster indexing and searching algorithms are
always desirable and the use of parallel hardware
is and obvious alternative
We discussed two possible organization for the
document collection index on a MIMD parallel
architecture
Document partitioning
Term partitioning.

28
Conclusion

Document partitioning affords simpler inverted
index construction and maintenance than term
partitioning
When term distributions in the documents and
queries are more skewed, document partitioning
performs better
When terms are uniformily distributed in user
queries, term partitioning performs better.

29
Adicional References
Lawrence, S., Giles, C.L. 1999. Accessibility of
Information on the Web. Nature.
Vol.400.pp.107-109. Ribeiro-Neto, B.A., Barbosa,
R.A. 1998. Query Performance for Tighly Coupled
Distributed Digital Libraries. Digital Libraries
98. pp.182-190. Ribeiro-Neto, B.A., Moura, E.S.,
Neubert, M.S., Ziviani, N. 1999. Efficient
Distributed Algorithms to Build Inverted Files.
SIGIR99. pp.105-112.

Write a Comment

User Comments (0)