Benchmarking MapReduce-Style Parallel Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Benchmarking MapReduce-Style Parallel Computing

Description:

Corporate sales, stock market transactions, census, airline traffic, ... Entertainment ... Internet images, Hollywood movies, MP3 files, ... Medicine. MRI & CT ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 15
Provided by: randa181
Learn more at: https://iiswc.org
Category:

less

Transcript and Presenter's Notes

Title: Benchmarking MapReduce-Style Parallel Computing


1
BenchmarkingMapReduce-StyleParallel Computing
Randal E. Bryant Carnegie Mellon University
http//www.cs.cmu.edu/bryant
2
Programming with MapReduce
  • Background
  • Developed at Google for aggregating web data
  • Dean Ghemawat MapReduce Simplified Data
    Processing on Large Clusters, OSDI 2004
  • Strengths
  • Easy way to write scalable parallel programs
  • Powerful programming model
  • Beyond web search applications
  • Runtime system automatically handles many of the
    challenges of parallel programming
  • Scheduling, load balancing, fault tolerance

3
Overall Execution Model
  • General Form
  • Input
  • Large set of files
  • Compute
  • Aggregate information
  • Output
  • Files containing aggregations
  • Example Word Count Index
  • Input
  • 1010 cached web pages
  • Stored on cluster of 1000 machines, each with own
    local disk
  • Compute
  • Index of words with occurrence counts
  • Output
  • File containing count for each word

4
MapReduce Programming
  • Map
  • Function generating keyword/value pairs from
    input file
  • E.g., word/count for each word in document
  • Reduce
  • Function aggregating values for single keyword
  • E.g.,Sum word counts

5
MapReduce Implementation
  • (Somewhat naïve implementation)
  • Map
  • Spawn mapping task for each input file
  • Execute on processor local to file
  • Generate file for each keyword/value
  • Shuffle
  • Redistribute files by hashing keywords K gt Ph(K)
  • Reduce
  • Spawn reduce task for each keyword
  • On processor to which keyword hashes Ph(K)

6
Appealing Features
  • Ease of Programming
  • Programmer provides only two functions
  • Express in terms of computation over data, not
    detailed execution on system
  • Robustness
  • Tolerant to failures of disks, processors,
    network
  • Source files stored redundantly
  • Runtime monitor detects and reexecutes failed
    tasks
  • Dynamic scheduling automatically adapts to
    resource limitations

7
Tolerating Failures
  • Dean Ghemawat, OSDI 2004
  • Sorting 10 million 100-byte records with 1800
    processors
  • Proactively restart delayed computations to
    achieve better performance and fault tolerance

8
Our Data-Driven World
  • Science
  • Data bases from astronomy, genomics, natural
    languages, seismic modeling,
  • Humanities
  • Scanned books, historic documents,
  • Commerce
  • Corporate sales, stock market transactions,
    census, airline traffic,
  • Entertainment
  • Internet images, Hollywood movies, MP3 files,
  • Medicine
  • MRI CT scans, patient records,

9
Big Data Computing Beyond Web Search
  • Application Domains
  • Rely on large, ever-changing data sets
  • Collecting maintaining data is major effort
  • Computational Requirements
  • Extract information from large volumes of raw
    data
  • Hypothesis
  • Can apply MapReduce style computation to many
    other application domains
  • Give it a Try!
  • Hadoop Open source implementation of parallel
    file system MapReduce

10
Q1 Workload Characteristics
  • Hardware
  • 1000s of nodes
  • Each with processor(s), disk(s), network
    interface
  • High-speed, local network using commodity
    technology
  • E.g., gigabit ethernet with switches
  • Data Organization
  • Distributed file system providing uniform name
    space and redundant storage
  • Computation
  • Each task executed as separate process with file
    I/O
  • Rely on file system for data transfer

11
Q2 Hardware/Software Challenges
  • Performance Issues
  • Disk bandwidth limitations
  • ? 3.6 hours to read data from 1TB disk
  • Data transfer across network
  • Process file I/O overhead
  • Runtime Issues
  • Detecting and mitigating effects of failed
    components

12
Q3 Benchmarking Challenges
  • Generalizing Results
  • Beyond specific data set cluster configuration
  • Performance depends on many different factors
  • Can we predict how program will scale?
  • Identifying Bottlenecks
  • Many interacting parts to system
  • Evaluating Robustness
  • Creating realistic failure modes

13
Q4 University Contributions
  • Currently Industry ahead of universities
  • Dealing with massive data sets
  • Computing at very large scale
  • Developing new programming/runtime approaches
  • Google, Yahoo!, Microsoft
  • University Role
  • More open and systematic inquiry
  • Apply to noncommercial problems
  • Extend and improve programming model and
    notations
  • Expose students to emerging styles of computing

14
Background Information
  • Data-Intensive Supercomputing The case for
    DISC
  • Tech Report CMU-CS-07-128
  • Available from http//www.cs.cmu.edu/bryant
Write a Comment
User Comments (0)
About PowerShow.com