Title: Modern Information Retrieval
1Modern Information Retrieval
- Chapter 9 Parallel and Distributed IR
- Section 9.1 Introduction
- Section 9.2.2. MIMD Architectures
- Inverted Files
- November 5, 1999
2Summary
- Introduction
- Review of parallel computing and parallel program
performance measures - Exploration of techniques for implementing
inverted file on MIMD parallel architecture - Conclusion
3Introduction
- The volume of electronic text available online
today is staggering. - The WWW contains over 800 millions pages of text,
comprising nearly 6 terabytes of data (NATUREVol
4008 July 1999www.nature.com). - As document collections grow larger, they become
more expensive to manage with an information
retrieval system. - To support the demanding requirements of modern
search environments, we must turn to alternative
architectures and algorithms.
4Parallel Computing
- Parallel computing is the simultaneous aplication
of multiple processors to solve a single problem. - Flynns Taxonomy
- SISD single instruction, single data
- SIMD single instruction, multiple data
- MISD multiple instruction, single data
- MIMD multiple instruction, multiple data
5Parallel Program Performance Measures
- Speedup
- Amdahls Law
- where f is the fraction of the problem that
- must be computed sequencially
- N is the number of processors.
6Parallel Program Performance Measures
- Efficiency
- where S is speedup
- N is the number of processors.
7MIMD Architectures
- MIMD architectures offer a great deal of
flexibility in how parallelism is defined and
exploited to solve a problem. - There are two ways in which a retrieval system
can exploit a MIMD machine - Parallel multitasking
- Partitioned parallel processing.
8MIMD Architectures
- Parallel multitasking on a MIMD machine
9MIMD Architectures
Partitioned parallel processing on a MIMD machine
10MIMD Architectures
Basic data elements processed by a seach algorithm
11MIMD Architectures
- There are two possible methods for partitioning
the data - Document partitioning the N documents are
distributed across the P processors each
parallel process evaluates the query on the
subcollection of N/P documents assigned to it - Term partitioning the t indexing items are
distributed across the P processors the
evaluation process for each document is spread
over multiple processors.
12Inverted FilesLogical Document Partitioning
- Data Partitioning
- The data partitioning is done logically using
essentially the same basic underlying inverted
file index as in the original sequential
algorithm - The inverted file is extended to give each
parallel process direct access to that portion of
the index related to the processors
subcollection of documents.
13Inverted FilesLogical Document Partitioning
Extended dictionary entry for document partitionin
g
14Inverted FilesLogical Document Partitioning
- Query Evaluation
- The broker initiates P parallel processes to
evaluate the query - Each process executes the same document scoring
algorithm on its document subcollection - The search processes record document scores in a
single shared array of document score
accumulators - The broker produces the final ranked list of
documents.
15Inverted Files Logical Document Partitioning
- Inverted File Construction
- The indexer partitions the documents among the
processors - Each indexing process generates a batch of
inverted lists, sorted by indexing item - A merge step is performed to create the final
inverted file.
16Inverted FilesPhysical Document Partitioning
- Data Partitioning
- The documents are physically partitioned into
separate subcollections, one for each parallel
processor - Each subcollection has its own inverted file.
17Inverted FilesPhysical Document Partitioning
- Query Evaluation
- The broker distributes the query to all of the
parallel search processes - Each parallel search process evaluates the query
on its portion of the document collection,
producing an intermediate hit-list - The broker collects the intermediate hit-lists
from all of the parallel search processes and
merges them into a final hit-list.
18Inverted FilesPhysical Document Partitioning
- Inverted File Construction
- Each processor creates, in parallel, its own
complete index corresponding to its document
partition - A merge step is performed to accumulate the
global statistics for all of the partitions and
distribute them to each of the partition
dictionaries.
19Inverted FilesTerm Partitioning
- Data Partitioning
- Inverted lists are spread across the processors.
20Inverted FilesTerm Partitioning
- Query Evaluation
- Query is decomposed into indexing items and each
indexing item is sent to the processor that holds
the corresponding inverted list - The processors create hit-lists with partial
document scores and return them to the broker - The broker combines the hit-lists.
21Inverted FilesTerm Partitioning
- Inverted File Construction
- Inverted file is created using the parallel
construction technique described for logical
document partitioning.
22Example
Document collection Document
Text 1
Pease porridge hot 2 Pease
porridge cold 3 Pease porridge
in the pot 4 Pease porridge
hot, pease porridge not cold 5
Pease porridge cold, pease porridge not hot
6 Pease porridge hot in the pot
23Example
Inverted File
24Example
Logical Document Partitioning
25Example
Physical Document Partitioning
26Example
Term Partitioning
27Conclusion
- The task of indexing and searching in very large
text collections is costly - Faster indexing and searching algorithms are
always desirable and the use of parallel hardware
is and obvious alternative - We discussed two possible organization for the
document collection index on a MIMD parallel
architecture - Document partitioning
- Term partitioning.
28Conclusion
- Document partitioning affords simpler inverted
index construction and maintenance than term
partitioning - When term distributions in the documents and
queries are more skewed, document partitioning
performs better - When terms are uniformily distributed in user
queries, term partitioning performs better.
29Adicional References
Lawrence, S., Giles, C.L. 1999. Accessibility of
Information on the Web. Nature.
Vol.400.pp.107-109. Ribeiro-Neto, B.A., Barbosa,
R.A. 1998. Query Performance for Tighly Coupled
Distributed Digital Libraries. Digital Libraries
98. pp.182-190. Ribeiro-Neto, B.A., Moura, E.S.,
Neubert, M.S., Ziviani, N. 1999. Efficient
Distributed Algorithms to Build Inverted Files.
SIGIR99. pp.105-112.