Title: MapReduce
1MapReduce
-
- Simplified
Data Processing on Large Clusters
- Google, Inc.
-
Presented by -
Prasad Raghavendra
2Introduction
- Model for processing large data sets.
- Contains Map and Reduce functions.
- Runs on a large cluster of machines.
- A lot of MapReduce programs are executed on
Googles cluster everyday.
3Motivation
- Very large data sets need to be processed.
- - The whole Web, billions of Pages
- Lots of machines
- - Use them efficiently.
-
-
4Processing of Large Data Sets
- For example
- - Counting access frequency to URLs
- Input list(RequestURL)
- Output list(RequestURL, total_number)
- - Distributed Grep
- - Distributed Sort
5Programming modelInput Output each a set of
key/value pairs Programmer specifies two
functions map (in_key, in_value) -gt
list(out_key, intermediate_value) Name comes
from map function in LISPEx. (map 'list '(1
2 3) '(1 2 3)) gt (2 4 6)-Processes input
key/value pair -Produces set of intermediate
pairs map(document, content) for each word in
contentemit(word, 1)
6reduce (out_key, list(intermediate_value)) -gt
list(out_value) Name comes from reduce
function in LISPEx. (reduce '(1 2 3 4 5)) gt
15- Combines all intermediate values for a
particular key - Produces a set of merged output
values (usually just one)reduce(word, values)
result 0for each value in valuesresult
valueemitString(w, result)
7Example The problem of counting the number of
occurrences of each word in a large collection
ofdocuments.
- Page 1 the weather is good
- Page 2 today is good
- Page 3 good weather is good
8Map output
- Worker 1
- (the 1), (weather 1), (is 1), (good 1).
- Worker 2
- (today 1), (is 1), (good 1).
- Worker 3
- (good 1), (weather 1), (is 1), (good 1).
9Reduce Input
- Worker 1(the 1)
- Worker 2 (is 1), (is 1), (is 1)
- Worker 3(weather 1), (weather 1)
- Worker 4(today 1)
- Worker 5(good 1),(good 1), (good 1),
-
(good 1)
10Reduce Output
- Worker 1 (the 1)
- Worker 2 (is 3)
- Worker 3 (weather 2)
- Worker 4 (today 1)
- Worker 5 (good 4)
11(No Transcript)
12Example 2
13Implementation
14(No Transcript)
15Flow of MapReduce Operation
- The MapReduce library in the user program splits
the input files into M pieces(16,64 MB). - One of the copies of the program is special . The
master. The rest are workers . - A worker who is assigned a map task parses
key/value pairs out of the input data. - Periodically, the buffered pairs are written to
local disk. - When a reduce worker is notified by the master
about these locations, it uses remote procedure
calls to read the buffered data. - The output of the Reduce function is appended to
a final output file. - When all map tasks and reduce tasks have been
completed, the master wakes up the user program.
16Problem Stragglers
- Often some machines are late in their replies
- - slow disk, overloaded, etc
- Approach
- - when only few tasks left to execute,
start backup - tasks
- - a task completes when either primary or
backup - completes task
- Performance
- - without backup, sort (-gt) takes 44 longer
17Partition Function
- Defines which worker processes which keys
- - default hash(key2) mod R
-
- Other partition functions useful
- - sort prefix of k bytes of line
- - idea based on known/sampled distribution of
key2 - to evenly distribute processed keys
-
-
18Combiner Function
- Problem
- intermediate results can be quite verbose
- e.g., (the, 1) could occur many times in
previous example - Approach
- perform a local reduction before writing
intermediate - results
- typically, combiner same function as reduce
func -
- This will reduce the run-time because less
writing to - disk and across the network
19Performance
- Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
150 seconds. - Sort 1010 100-byte records (modeled after
TeraSort benchmark) normal 839 seconds.
20Fault Tolerance
- Crash of worker
- all - even finished - tasks are redone
- Crash of leader
- crash of leader process -gt restart process
with -
checkpoint - crash of leader machine-gt unlikely - restart
-
computation - redo computation
21Conclusion
- MapReduce has proven to be a useful abstraction
- Easy to use
- Very large variety of problems are easily
expressible as MapReduce computations - Greatly simplifies large-scale computations at
Google
22 Questions?
23 Thank You