Implementing Scalable Tree-based Algorithms using MRNet - PowerPoint PPT Presentation

About This Presentation
Title:

Implementing Scalable Tree-based Algorithms using MRNet

Description:

Ting Chen. Mark Cowlishaw. 2. Introduction. Background on MRNet. Our Applications. Reverse index ... Each word points to a list of document IDs containing the word. ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 11
Provided by: Syste98
Category:

less

Transcript and Presenter's Notes

Title: Implementing Scalable Tree-based Algorithms using MRNet


1
Implementing Scalable Tree-based Algorithms
using MRNet
  • Ting ChenMark Cowlishaw

2
Introduction
  • Background on MRNet
  • Our Applications
  • Reverse index
  • Online queries
  • Description of Experiments
  • Progress
  • Next steps

3
Background MRNet Roth, Arnold, Miller03
  • Tree-Based Overlay Network
  • Nodes of a distributed application are arranged
    in a tree-structure
  • Leaves producing data that is aggregated and
    filtered by higher levels of the tree
  • Separation of programs and their running tree
    topology
  • A TBON program can run on tree-network of
    different topologies

4
Reverse indexes for Keyword Queries
  • A keyword query is a list of words lt w1,w2,
    ...,wn gt .
  • A typical reverse index is a list of words
  • Each word points to a list of document IDs
    containing the word.
  • A document ID list can either be sorted by
    document names or by the number of times the word
    appears in the document.

Wisconsin
Badgers
DocId No. of Occurences
B.htm C.htm . 6 4
DocId No. of Occurences
A.htm B.htm C.htm . 2 7 5
5
On-line queries N-best frequency (N2)
DocC 7
DocB 3
DocD 0
DocF 6
DocA 5
DocD 0
DocC 7
DocF 6
DocB 3
DocA 5
DocE 2
6
Objective / Experiments
  • How does tree topology affect application
    performance
  • Macro-benchmarks throughput/time-to-completion
    for index building and response time for on-line
    queries
  • Micro-benchmarks the amount of total IO, I/O
    performed for each node, data transferred
  • Scale-Up and Speed-Up curves with the increase of
    cluster nodes
  • Is Multiple-level ( gt 2) helpful?

7
Progress - data
  • Experiment Data
  • All Wikipedia documents
  • 8GB of data
  • 4 Million documents
  • Probably enough for a tree with 32 leaves
  • Data in Wiki-text format

8
Progress Design / Implementation
  • Message Formats
  • Document/keyword delivery
  • Acknowledgment
  • keys / statistics
  • Messages for Microbenchmarks
  • Shared Classes
  • DocumentStatistics (back end) in test
  • StatisticsEntry (all) in test
  • StatisticsList (all)
  • Familiarity with MRNet Toolset
  • Debugging MRNet Programs

9
Next Steps
  • Deploy at medium scale (32 Nodes)
  • Experiments to determine fan-out
  • Maximize throughput (bytes/second)
  • Minimize time-to-completion
  • Microbenchmarks
  • Collected and filtered using MRNet
  • Idle time
  • Messages per second, total message traffic

10
Futures
  • Replace N-best with distance from median
  • Add support for new document types
  • XML
  • HTML
  • Generalize to other MapReduceDean04
    Applications
  • More realistic relevance ranking
  • Reverse hyperlink count
  • Collection term vectors
Write a Comment
User Comments (0)
About PowerShow.com