MapReduce - PowerPoint PPT Presentation

About This Presentation

Title:

MapReduce

Description:

Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of ... Distributed Grep - Distributed Sort. Programming model ... – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 24

Provided by: prasadrag

Learn more at: http://www.cs.umsl.edu

Category:

more less

Transcript and Presenter's Notes

Title: MapReduce

1
MapReduce

Simplified
Data Processing on Large Clusters
Google, Inc.
Presented by
Prasad Raghavendra

2
Introduction

Model for processing large data sets.
Contains Map and Reduce functions.
Runs on a large cluster of machines.
A lot of MapReduce programs are executed on
Googles cluster everyday.

3
Motivation

Very large data sets need to be processed.
- The whole Web, billions of Pages
Lots of machines
- Use them efficiently.

4
Processing of Large Data Sets

For example
- Counting access frequency to URLs
Input list(RequestURL)
Output list(RequestURL, total_number)
- Distributed Grep
- Distributed Sort

5
Programming modelInput Output each a set of
key/value pairs Programmer specifies two
functions map (in_key, in_value) -gt
list(out_key, intermediate_value) Name comes
from map function in LISPEx. (map 'list '(1
2 3) '(1 2 3)) gt (2 4 6)-Processes input
key/value pair -Produces set of intermediate
pairs map(document, content) for each word in
contentemit(word, 1)
6
reduce (out_key, list(intermediate_value)) -gt
list(out_value) Name comes from reduce
function in LISPEx. (reduce '(1 2 3 4 5)) gt
15- Combines all intermediate values for a
particular key - Produces a set of merged output
values (usually just one)reduce(word, values)
result 0for each value in valuesresult
valueemitString(w, result)
7
Example The problem of counting the number of
occurrences of each word in a large collection
ofdocuments.

Page 1 the weather is good
Page 2 today is good
Page 3 good weather is good

8
Map output

Worker 1
(the 1), (weather 1), (is 1), (good 1).
Worker 2
(today 1), (is 1), (good 1).
Worker 3
(good 1), (weather 1), (is 1), (good 1).

9
Reduce Input

Worker 1(the 1)
Worker 2 (is 1), (is 1), (is 1)
Worker 3(weather 1), (weather 1)
Worker 4(today 1)
Worker 5(good 1),(good 1), (good 1),
(good 1)

10
Reduce Output

Worker 1 (the 1)
Worker 2 (is 3)
Worker 3 (weather 2)
Worker 4 (today 1)
Worker 5 (good 4)

11
(No Transcript)
12
Example 2
13
Implementation
14
(No Transcript)
15
Flow of MapReduce Operation

The MapReduce library in the user program splits
the input files into M pieces(16,64 MB).
One of the copies of the program is special . The
master. The rest are workers .
A worker who is assigned a map task parses
key/value pairs out of the input data.
Periodically, the buffered pairs are written to
local disk.
When a reduce worker is notified by the master
about these locations, it uses remote procedure
calls to read the buffered data.
The output of the Reduce function is appended to
a final output file.
When all map tasks and reduce tasks have been
completed, the master wakes up the user program.

16
Problem Stragglers

Often some machines are late in their replies
- slow disk, overloaded, etc
Approach
- when only few tasks left to execute,
start backup
tasks
- a task completes when either primary or
backup
completes task
Performance
- without backup, sort (-gt) takes 44 longer

17
Partition Function

Defines which worker processes which keys
- default hash(key2) mod R
Other partition functions useful
- sort prefix of k bytes of line
- idea based on known/sampled distribution of
key2
to evenly distribute processed keys

18
Combiner Function

Problem
intermediate results can be quite verbose
e.g., (the, 1) could occur many times in
previous example
Approach
perform a local reduction before writing
intermediate
results
typically, combiner same function as reduce
func
This will reduce the run-time because less
writing to
disk and across the network

19
Performance

Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
150 seconds.
Sort 1010 100-byte records (modeled after
TeraSort benchmark) normal 839 seconds.

20
Fault Tolerance