Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR - PowerPoint PPT Presentation

About This Presentation
Title:

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR

Description:

Quick-and-dirty rejection of ... processed in batches. I/O overhead. Use multiple queries with the same batch. This improves throughput, but not response time ... – PowerPoint PPT presentation

Number of Views:393
Avg rating:3.0/5.0
Slides: 25
Provided by: alexande95
Category:

less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 7 (book chapter 9): Parallel and Distributed IR


1
Special Topics in Computer Science Advanced
Topics in Information Retrieval Lecture 7 (book
chapter 9) Parallel and Distributed IR
  • Alexander Gelbukh
  • www.Gelbukh.com

2
Previous Chapter Conclusions
  • How to accelerate search? Same results as
    sequential
  • Ideas
  • Quick-and-dirty rejection of bad objects, 100
    recall
  • Fast data structure for search (based on
    clustering)
  • Careful check of all found candidates
  • Solution mapping into fewer-D feature space
  • Condition lower-bounding of the distance
  • Assumption skewed spectrum distribution
  • Few coefficients concentrate energy, rest are
    less important

3
Previous Chapter Research topics
  • Object detection (pattern and image recognition)
  • Automatic feature selection
  • Spatial indexing data structures (more than 1D)
  • New types of data.
  • What features to select? How to determine them?
  • Mixed-type data (e.g., webpages, or images
    withsound and description)
  • What clustering/IR methods are better suited
    forwhat features? (What features for what
    methods?)
  • Similar methods in data mining, ...

4
The problem
  • Very large document collections
  • Google 4,000,000,000 pages
  • Slow response?
  • Solution parallel computing
  • Google 10,000 computers

5
Parallel architectures
Data stream Data stream
Single Multiple
Instruction stream Single SISD classical SIMD simple
Instruction stream Multiple MISD (rare) MIMD many SISD
6
MIMD architecture
  • The most common
  • Can be
  • tightly coupled
  • loosely coupled
  • Distributed
  • Many computers interacting via network
  • PC Clusters
  • Similar to MIMD computers, but greater cost of
    communication
  • very loosely coupled
  • More coarse-grained programs

7
Performance improvement
  • Time speedup S
  • Ideally, N times (number of processors)
  • In practice impossible
  • The problem does not decompose into N equal parts
  • Communication and control overhead
  • lt 1 / f, where f is the largest separable
    fraction of theproblem
  • Cost
  • Per processor S / N

8
Two approaches to parallelism
  • Build new algorithms
  • E.g., neural nets
  • Naturally parallel
  • Problem to define the retrieval task
  • Adapt the existing techniques to parallelism
  • Allows relying on well-studied approaches
  • We will consider this option

9
Ways to use parallelism
  • Multitasking
  • N search engines
  • Good for processing many queries
  • Problems
  • A single query is not speeded up
  • Bottleneck disk access (index)
  • Possible solution replicating (part of) data.
    RAIDs
  • Parallel algorithms
  • IR data. Main question how to partition the
    data
  • Document / index term matrix(terms can be LSI
    dimensions, signature bits, etc)

10
Possible partitionings
  • Horizontal document partitioning. Union of
    results
  • Vertical term partitioning. Basically, intersect
    results

11
Inverted files Logical partitioning
  • Logical vs. physical document partitioning
  • Logical for each term, use pointers into
    inverted file data for each processor, to
    indicate its portion

12
Inverted files Logical partitioning
Construction and updating
  • Also parallel
  • Construction
  • Assign docs to processors
  • Order docs such that each processor has an
    interval
  • Process in parallel
  • Merge. Each piece is ordered already

13
Inverted filesPhysical document partitioning
  • Several separate collections, one per processor
  • Separate indices
  • Then the lists are merged (they are already
    ordered)
  • Priority queue is used
  • The result is not sorted Insertion is quick
  • The maximal element can be found quickly
  • First k elements can be found rather quickly
  • Details in the book
  • Consistent scores are needed
  • Global statistics is needed. Can be computed at
    index time

14
Logical or physical partitioning?
  • Logical requires less communication
  • Faster
  • Physical is more flexible. Simpler implementation
  • Simpler conversion of existing systems

15
Inverted files Term partitioning
  • Each processor processes a part of the inverted
    file
  • The results are intersected (for AND)
  • (or as appropriate for Boolean operations, OR and
    NOT)
  • When term distribution in user queries is
    skewed,then document partitioning is better
  • When uniform, term partitioning is better.
  • Twice for long queries, 5 10 times for short
    (Web-like)

16
Suffix arrays
  • Array construction can be parallelized
  • merges are parallel
  • Document partitioning is applied
    straightforwardly
  • Each processor maintains its own suffix array
  • Term partitioning can be applied
  • Each processor owns a branch of the tree
    (lexicographicinterval)
  • Bottleneck all processors need access to the
    entire text

17
(No Transcript)
18
Signature files
  • Document partitioning straightforward
  • Create query signature, distribute to each
    processor
  • Merge results (using Boolean operations if
    needed)
  • Term partitioning shorter signatures
  • Merging and eliminating false drops is slow
  • This method is not recommended

19
SIMD computers
  • Single Instruction, Multiple data
  • Uncommon
  • Good for simple operations
  • Bit operations in signature files
  • Details in the book
  • Ranking is supported in hardware in some
    computers
  • If signature file does not fit into memory, can
    beprocessed in batches
  • I/O overhead
  • Use multiple queries with the same batch
  • This improves throughput, but not response time

20
SIMD computers
  • Inverted files are difficult to adapt to SIMD
  • The inverted file is restructured
  • Details in the book

21
Distributed IR
  • MIMD with
  • Slow communication
  • Not all nodes are used for a given query
  • Encryption issues
  • Document partitioning is usually used
  • Term partitioning imposes greater
    communicationoverhead
  • Document clustering can be useful (to distribute
    docs by processors)
  • Index clusters and then search only the best ones
  • Another approach use training queries, then
    similarity of the user query to these

22
Research topics
  • How to evaluate the speedup
  • New algorithms
  • Adaptation of existing algorithms
  • Merging the results is a bottleneck
  • Meta search engines
  • Creating large collections with judgements
  • Is recall important?

23
Conclusions
  • Parallel computing can improve
  • response time for each query and/or
  • throughput number of queries processed with same
    speed
  • Document partitioning is simple
  • good for distributed computing
  • Term partitioning is good for some data
    structures
  • Distributed computing is MIMD computing with slow
    communication
  • SIMD machines are good for Signature files
  • Both are out of favor now

24
Thank you! Till May 17? 18?, 6 pm
Write a Comment
User Comments (0)
About PowerShow.com