MapReduce - PowerPoint PPT Presentation

About This Presentation
Title:

MapReduce

Description:

Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of ... Distributed Grep - Distributed Sort. Programming model ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 24
Provided by: prasadrag
Learn more at: http://www.cs.umsl.edu
Category:
Tags: mapreduce | grep

less

Transcript and Presenter's Notes

Title: MapReduce


1
MapReduce
  • Simplified
    Data Processing on Large Clusters
  • Google, Inc.

  • Presented by

  • Prasad Raghavendra

2
Introduction
  • Model for processing large data sets.
  • Contains Map and Reduce functions.
  • Runs on a large cluster of machines.
  • A lot of MapReduce programs are executed on
    Googles cluster everyday.

3
Motivation
  • Very large data sets need to be processed.
  • - The whole Web, billions of Pages
  • Lots of machines
  • - Use them efficiently.

4
Processing of Large Data Sets
  • For example
  • - Counting access frequency to URLs
  • Input list(RequestURL)
  • Output list(RequestURL, total_number)
  • - Distributed Grep
  • - Distributed Sort

5
Programming modelInput Output each a set of
key/value pairs Programmer specifies two
functions map (in_key, in_value) -gt
list(out_key, intermediate_value) Name comes
from map function in LISPEx. (map 'list '(1
2 3) '(1 2 3)) gt (2 4 6)-Processes input
key/value pair -Produces set of intermediate
pairs map(document, content) for each word in
contentemit(word, 1)
6
reduce (out_key, list(intermediate_value)) -gt
list(out_value) Name comes from reduce
function in LISPEx. (reduce '(1 2 3 4 5)) gt
15- Combines all intermediate values for a
particular key - Produces a set of merged output
values (usually just one)reduce(word, values)
result 0for each value in valuesresult
valueemitString(w, result)
7
Example The problem of counting the number of
occurrences of each word in a large collection
ofdocuments.
  • Page 1 the weather is good
  • Page 2 today is good
  • Page 3 good weather is good

8
Map output
  • Worker 1
  • (the 1), (weather 1), (is 1), (good 1).
  • Worker 2
  • (today 1), (is 1), (good 1).
  • Worker 3
  • (good 1), (weather 1), (is 1), (good 1).

9
Reduce Input
  • Worker 1(the 1)
  • Worker 2 (is 1), (is 1), (is 1)
  • Worker 3(weather 1), (weather 1)
  • Worker 4(today 1)
  • Worker 5(good 1),(good 1), (good 1),

  • (good 1)

10
Reduce Output
  • Worker 1 (the 1)
  • Worker 2 (is 3)
  • Worker 3 (weather 2)
  • Worker 4 (today 1)
  • Worker 5 (good 4)

11
(No Transcript)
12
Example 2
13
Implementation
14
(No Transcript)
15
Flow of MapReduce Operation
  • The MapReduce library in the user program splits
    the input files into M pieces(16,64 MB).
  • One of the copies of the program is special . The
    master. The rest are workers .
  • A worker who is assigned a map task parses
    key/value pairs out of the input data.
  • Periodically, the buffered pairs are written to
    local disk.
  • When a reduce worker is notified by the master
    about these locations, it uses remote procedure
    calls to read the buffered data.
  • The output of the Reduce function is appended to
    a final output file.
  • When all map tasks and reduce tasks have been
    completed, the master wakes up the user program.

16
Problem Stragglers
  • Often some machines are late in their replies
  • - slow disk, overloaded, etc
  • Approach
  • - when only few tasks left to execute,
    start backup
  • tasks
  • - a task completes when either primary or
    backup
  • completes task
  • Performance
  • - without backup, sort (-gt) takes 44 longer

17
Partition Function
  • Defines which worker processes which keys
  • - default hash(key2) mod R
  • Other partition functions useful
  • - sort prefix of k bytes of line
  • - idea based on known/sampled distribution of
    key2
  • to evenly distribute processed keys

18
Combiner Function
  • Problem
  • intermediate results can be quite verbose
  • e.g., (the, 1) could occur many times in
    previous example
  • Approach
  • perform a local reduction before writing
    intermediate
  • results
  • typically, combiner same function as reduce
    func
  • This will reduce the run-time because less
    writing to
  • disk and across the network

19
Performance
  • Scan 1010 100-byte records to extract records
    matching a rare pattern (92K matching records)
    150 seconds.
  • Sort 1010 100-byte records (modeled after
    TeraSort benchmark) normal 839 seconds.

20
Fault Tolerance
  • Crash of worker
  • all - even finished - tasks are redone
  • Crash of leader
  • crash of leader process -gt restart process
    with

  • checkpoint
  • crash of leader machine-gt unlikely - restart

  • computation
  • redo computation

21
Conclusion
  • MapReduce has proven to be a useful abstraction
  • Easy to use
  • Very large variety of problems are easily
    expressible as MapReduce computations
  • Greatly simplifies large-scale computations at
    Google

22
Questions?
23
Thank You
Write a Comment
User Comments (0)
About PowerShow.com