Title: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR
1Special Topics in Computer Science Advanced
Topics in Information Retrieval Lecture 7 (book
chapter 9) Parallel and Distributed IR
- Alexander Gelbukh
- www.Gelbukh.com
2Previous Chapter Conclusions
- How to accelerate search? Same results as
sequential - Ideas
- Quick-and-dirty rejection of bad objects, 100
recall - Fast data structure for search (based on
clustering) - Careful check of all found candidates
- Solution mapping into fewer-D feature space
- Condition lower-bounding of the distance
- Assumption skewed spectrum distribution
- Few coefficients concentrate energy, rest are
less important
3Previous Chapter Research topics
- Object detection (pattern and image recognition)
- Automatic feature selection
- Spatial indexing data structures (more than 1D)
- New types of data.
- What features to select? How to determine them?
- Mixed-type data (e.g., webpages, or images
withsound and description) - What clustering/IR methods are better suited
forwhat features? (What features for what
methods?) - Similar methods in data mining, ...
4The problem
- Very large document collections
- Google 4,000,000,000 pages
- Slow response?
- Solution parallel computing
- Google 10,000 computers
5Parallel architectures
Data stream Data stream
Single Multiple
Instruction stream Single SISD classical SIMD simple
Instruction stream Multiple MISD (rare) MIMD many SISD
6MIMD architecture
- The most common
- Can be
- tightly coupled
- loosely coupled
- Distributed
- Many computers interacting via network
- PC Clusters
- Similar to MIMD computers, but greater cost of
communication - very loosely coupled
- More coarse-grained programs
7Performance improvement
- Time speedup S
- Ideally, N times (number of processors)
- In practice impossible
- The problem does not decompose into N equal parts
- Communication and control overhead
- lt 1 / f, where f is the largest separable
fraction of theproblem - Cost
- Per processor S / N
8Two approaches to parallelism
- Build new algorithms
- E.g., neural nets
- Naturally parallel
- Problem to define the retrieval task
- Adapt the existing techniques to parallelism
- Allows relying on well-studied approaches
- We will consider this option
9Ways to use parallelism
- Multitasking
- N search engines
- Good for processing many queries
- Problems
- A single query is not speeded up
- Bottleneck disk access (index)
- Possible solution replicating (part of) data.
RAIDs - Parallel algorithms
- IR data. Main question how to partition the
data - Document / index term matrix(terms can be LSI
dimensions, signature bits, etc)
10Possible partitionings
- Horizontal document partitioning. Union of
results - Vertical term partitioning. Basically, intersect
results
11Inverted files Logical partitioning
- Logical vs. physical document partitioning
- Logical for each term, use pointers into
inverted file data for each processor, to
indicate its portion
12Inverted files Logical partitioning
Construction and updating
- Also parallel
- Construction
- Assign docs to processors
- Order docs such that each processor has an
interval - Process in parallel
- Merge. Each piece is ordered already
13Inverted filesPhysical document partitioning
- Several separate collections, one per processor
- Separate indices
- Then the lists are merged (they are already
ordered) - Priority queue is used
- The result is not sorted Insertion is quick
- The maximal element can be found quickly
- First k elements can be found rather quickly
- Details in the book
- Consistent scores are needed
- Global statistics is needed. Can be computed at
index time
14Logical or physical partitioning?
- Logical requires less communication
- Faster
- Physical is more flexible. Simpler implementation
- Simpler conversion of existing systems
15Inverted files Term partitioning
- Each processor processes a part of the inverted
file - The results are intersected (for AND)
- (or as appropriate for Boolean operations, OR and
NOT) - When term distribution in user queries is
skewed,then document partitioning is better - When uniform, term partitioning is better.
- Twice for long queries, 5 10 times for short
(Web-like)
16Suffix arrays
- Array construction can be parallelized
- merges are parallel
- Document partitioning is applied
straightforwardly - Each processor maintains its own suffix array
- Term partitioning can be applied
- Each processor owns a branch of the tree
(lexicographicinterval) - Bottleneck all processors need access to the
entire text
17(No Transcript)
18Signature files
- Document partitioning straightforward
- Create query signature, distribute to each
processor - Merge results (using Boolean operations if
needed) - Term partitioning shorter signatures
- Merging and eliminating false drops is slow
- This method is not recommended
19SIMD computers
- Single Instruction, Multiple data
- Uncommon
- Good for simple operations
- Bit operations in signature files
- Details in the book
- Ranking is supported in hardware in some
computers - If signature file does not fit into memory, can
beprocessed in batches - I/O overhead
- Use multiple queries with the same batch
- This improves throughput, but not response time
20 SIMD computers
- Inverted files are difficult to adapt to SIMD
- The inverted file is restructured
- Details in the book
21Distributed IR
- MIMD with
- Slow communication
- Not all nodes are used for a given query
- Encryption issues
- Document partitioning is usually used
- Term partitioning imposes greater
communicationoverhead - Document clustering can be useful (to distribute
docs by processors) - Index clusters and then search only the best ones
- Another approach use training queries, then
similarity of the user query to these
22Research topics
- How to evaluate the speedup
- New algorithms
- Adaptation of existing algorithms
- Merging the results is a bottleneck
- Meta search engines
- Creating large collections with judgements
- Is recall important?
23Conclusions
- Parallel computing can improve
- response time for each query and/or
- throughput number of queries processed with same
speed - Document partitioning is simple
- good for distributed computing
- Term partitioning is good for some data
structures - Distributed computing is MIMD computing with slow
communication - SIMD machines are good for Signature files
- Both are out of favor now
24Thank you! Till May 17? 18?, 6 pm