Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR - PowerPoint PPT Presentation

About This Presentation

Title:

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR

Description:

Quick-and-dirty rejection of ... processed in batches. I/O overhead. Use multiple queries with the same batch. This improves throughput, but not response time ... – PowerPoint PPT presentation

Number of Views:393

Avg rating:3.0/5.0

Slides: 25

Provided by: alexande95

Category:

more less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR

1
Special Topics in Computer Science Advanced
Topics in Information Retrieval Lecture 7 (book
chapter 9) Parallel and Distributed IR

Alexander Gelbukh
www.Gelbukh.com

2
Previous Chapter Conclusions

How to accelerate search? Same results as
sequential
Ideas
Quick-and-dirty rejection of bad objects, 100
recall
Fast data structure for search (based on
clustering)
Careful check of all found candidates
Solution mapping into fewer-D feature space
Condition lower-bounding of the distance
Assumption skewed spectrum distribution
Few coefficients concentrate energy, rest are
less important

3
Previous Chapter Research topics

Object detection (pattern and image recognition)
Automatic feature selection
Spatial indexing data structures (more than 1D)
New types of data.
What features to select? How to determine them?
Mixed-type data (e.g., webpages, or images
withsound and description)
What clustering/IR methods are better suited
forwhat features? (What features for what
methods?)
Similar methods in data mining, ...

4
The problem

Very large document collections
Google 4,000,000,000 pages
Slow response?
Solution parallel computing
Google 10,000 computers

5
Parallel architectures
Data stream Data stream
Single Multiple
Instruction stream Single SISD classical SIMD simple
Instruction stream Multiple MISD (rare) MIMD many SISD
6
MIMD architecture

The most common
Can be
tightly coupled
loosely coupled
Distributed
Many computers interacting via network
PC Clusters
Similar to MIMD computers, but greater cost of
communication
very loosely coupled
More coarse-grained programs

7
Performance improvement

Time speedup S
Ideally, N times (number of processors)
In practice impossible
The problem does not decompose into N equal parts
Communication and control overhead
lt 1 / f, where f is the largest separable
fraction of theproblem
Cost
Per processor S / N

8
Two approaches to parallelism

Build new algorithms
E.g., neural nets
Naturally parallel
Problem to define the retrieval task
Adapt the existing techniques to parallelism
Allows relying on well-studied approaches
We will consider this option

9
Ways to use parallelism

Multitasking
N search engines
Good for processing many queries
Problems
A single query is not speeded up
Bottleneck disk access (index)
Possible solution replicating (part of) data.
RAIDs
Parallel algorithms
IR data. Main question how to partition the
data
Document / index term matrix(terms can be LSI
dimensions, signature bits, etc)

10
Possible partitionings

Horizontal document partitioning. Union of
results
Vertical term partitioning. Basically, intersect
results

11
Inverted files Logical partitioning

Logical vs. physical document partitioning
Logical for each term, use pointers into
inverted file data for each processor, to
indicate its portion

12
Inverted files Logical partitioning
Construction and updating

Also parallel
Construction
Assign docs to processors
Order docs such that each processor has an
interval
Process in parallel
Merge. Each piece is ordered already

13
Inverted filesPhysical document partitioning

Several separate collections, one per processor
Separate indices
Then the lists are merged (they are already
ordered)
Priority queue is used
The result is not sorted Insertion is quick
The maximal element can be found quickly
First k elements can be found rather quickly
Details in the book
Consistent scores are needed
Global statistics is needed. Can be computed at
index time

14
Logical or physical partitioning?

Logical requires less communication
Faster
Physical is more flexible. Simpler implementation
Simpler conversion of existing systems

15
Inverted files Term partitioning

Each processor processes a part of the inverted
file
The results are intersected (for AND)
(or as appropriate for Boolean operations, OR and
NOT)
When term distribution in user queries is
skewed,then document partitioning is better
When uniform, term partitioning is better.
Twice for long queries, 5 10 times for short
(Web-like)

16
Suffix arrays

Array construction can be parallelized
merges are parallel
Document partitioning is applied
straightforwardly
Each processor maintains its own suffix array
Term partitioning can be applied
Each processor owns a branch of the tree
(lexicographicinterval)
Bottleneck all processors need access to the
entire text

17
(No Transcript)
18
Signature files

Document partitioning straightforward
Create query signature, distribute to each
processor
Merge results (using Boolean operations if
needed)
Term partitioning shorter signatures
Merging and eliminating false drops is slow
This method is not recommended

19
SIMD computers

Single Instruction, Multiple data
Uncommon
Good for simple operations
Bit operations in signature files
Details in the book
Ranking is supported in hardware in some
computers
If signature file does not fit into memory, can
beprocessed in batches
I/O overhead
Use multiple queries with the same batch
This improves throughput, but not response time

20
SIMD computers

Inverted files are difficult to adapt to SIMD
The inverted file is restructured
Details in the book

21
Distributed IR

MIMD with
Slow communication
Not all nodes are used for a given query
Encryption issues
Document partitioning is usually used
Term partitioning imposes greater
communicationoverhead
Document clustering can be useful (to distribute
docs by processors)
Index clusters and then search only the best ones
Another approach use training queries, then
similarity of the user query to these

22
Research topics