Title: Outline
 1 Outline
- Parallel Processing Patterns 
- MapReduce Abstraction 
- MapReduce Pseudocode 
- MapReduce Examples 
- Relational Join 
- Matrix Multiplication 
- MapReduce Implementation Overview 
2 Run thousands of simulations
You have sets of Parameters for thousands of 
small simulations 
 3 Run thousands of simulations
You have sets of Parameters for thousands of 
small simulations
Divide the parameter sets among k computers 
 4 Run thousands of simulations
You have sets of Parameters for thousands of 
small simulations
Divide the parameter sets among k computers
f runs the simulation and produces some output 
apply it to every item
f
f
f
f
f
f 
 5 Run thousands of simulations
You have sets of Parameters for thousands of 
small simulations
Divide the parameter sets among k computers
f runs the simulation and produces some output 
apply it to every item
f
f
f
f
f
f
Now we have a big distributed set of simulation 
results 
 6Find the most common word in each document
You have millions of documents 
 7Find the most common word in each document
You have millions of documents
Distribute the documents among k computers 
 8Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
f finds the most common word in a single document
f
f
f
f
f
f 
 9Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
f finds the most common word in a single document
f
f
f
f
f
f
Now we have a big distributed list of (doc_id, 
word) pairs 
 10Consider a slightly more general program to 
compute theword frequency of every word in a 
single document
Biography of Pat Martino When the anesthesia 
wore off, Pat Martino looked up hazily at his 
parents and his doctors. and tried to piece 
together any memory of his life.  One of the 
greatest guitarists in jazz, Martino had suffered 
a severe brain aneurysm and underwent surgery 
after being told that his condition could be 
terminal. After his operations he could remember 
almost nothing. He barely recognized his parents. 
and had no memory of his guitar or his career. He 
remembers feeling as if he had been "dropped 
cold, empty, neutral, cleansed, ... naked. In 
the following months. Martino made a remarkable 
recovery. Through intensive study of his own 
historic recordings, and with the help of 
computer technology, Pat managed to reverse his 
memory loss and return to form on his instrument. 
His past recordings eventually became "an old 
friend, a spiritual experience which remained 
beautiful and honest." This recovery fits in 
perfectly with Pat's illustrious personal 
history. Since playing his first notes while 
still in his pre-teenage years, Martino has been 
recognized as one of the most exciting and 
virtuosic guitarists in jazz. With a distinctive, 
fat sound and gut-wrenching performances, he 
represents the best not just in jazz, but in 
music. He embodies thoughtful energy and 
soul. Born Pat Azzara in Philadelphia in 1944, 
Pat was first exposed to jazz through his father, 
Carmen "Mickey" Azzara, who sang in local clubs 
and briefly studied guitar with Eddie Lang. He 
took Pat to all the city's hot-spots to hear and 
meet Wes Montgomery and other musical giants. "I 
have always admired my father and have wanted to 
impress him. As a result, it forced me to get 
serious with my creative powers."
(memory, 3) (jazz, 4) (life, 1) (with, 
6) (recovery, 2)  
 11Compute the word frequency of 5M documents
You have millions of documents 
 12Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers 
 13Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
For each document f returns a set of (word, freq) 
pairs
f
f
f
f
f
f 
 14Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
For each document f returns a set of (word, freq) 
pairs
f
f
f
f
f
f
Now we have a big distributed list of sets of 
word freqs. 
 15 There is a pattern here
- A function that maps a set of parameters to 
 asimulation result
- A function that maps a document to its 
 mostcommon word
- A function that maps a document to a histogramof 
 word frequencies
16What if we want to compute the wordfrequency 
across all documents?  
 17Compute the word frequency across 5M documents
You have millions of documents 
 18Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers 
 19Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map 
 20Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map
How can we make sure that a single computer has 
access to every occurrence of a given word 
regardless of which document it appeared in?
Now what? 
 21Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of 
word freqs. 
 22Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of 
word freqs. 
 23Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of 
word freqs. 
 24Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of 
word freqs.
Now just count the occurrences of each word
reduce
reduce
reduce
reduce 
 25Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq) 
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of 
word freqs.
Now just count the occurrences of each word
reduce
reduce
reduce
reduce
We have our distributed histogram 
 26 Outline
- Parallel Processing Patterns 
- MapReduce Abstraction 
- MapReduce Pseudocode 
- MapReduce Examples 
- Relational Join 
- Matrix Multiplication 
- MapReduce Implementation Overview 
27 MapReduce
MAP
REDUCE
Shuffle
(did1, v1)
(did2, v2)
(did3, v3)
. . .
(w1, 1)
(w2, 1)
(w3, 1)
. . . 
(w1, 1)
(w2, 1)
. . . 
(w1, (1, 1, 1, , 1))
(w1, (1, 1, ))
(w1, (1, ))
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . . 
 28 MapReduce
- Google paper published 2004MapReduce Simplifie
 d Data Processingon Large ClustersJeffrey
 Dean and Sanjay Ghemawat
- Free variant Hadoop 
- MapReduce  High-level programmingmodel and 
 implementation for large-scaleparallel data
 processing
29 Hadoop History
- Created by Doug Cutting, the creator of 
 ApacheLucene, the widely used text search
 library
- 2002 Nutch before GFS publication 
- 2004 Nutch Distributed Filesystem 
- 2006 Hadoop at Yahoo! 
- 2008 Yahoo! announced its production search 
 indexwas being generated by a 10,000 core Hadoop
 cluster
- 2008 Hadoop made its own top-level project at 
 Apache
30 Hadoop History
- April 2008 Won the 1 terabyte sort benchmark in 
 209 seconds on 900 nodes
- April 2009 Won the minute sort by sorting500 GB 
 in 59 seconds (on 1,400 nodes)and the 100
 terabyte sort in 173 minutes(on 3,400 nodes)
31 Apache Hadoop
- Open-source implementation of Map-Reduce 
- The storage is provided by HDFS and analysis 
 byMap-Reduce
- Other parts like Pig, Hive,  But above 
 capabilitiesare its kernel
- Pig A data flow language and execution 
 environmentfor exploring very large datasets.
- Hive A distributed data warehouse which 
 managesdata stored in HDFS and provides a query
 languagebased on SQL for querying the data
32 Data Model
- A file  a bag of (key, value) pairs 
- A map-reduce program 
- Input a bag of (inputkey, value) pairs 
- Output a bag of (outputkey, value) pairs 
33 Step 1 The MAP Phase
- User provides the MAP function 
- Input (input key, value) 
- Outputbag of (intermediate key, value) 
- System applies the map function in parallelto 
 all (input key, value) pairs in theinput file.
34 Step 2 The REDUCE Phase
- User provides the REDUCE function 
- Input(intermediate key, bag of values) 
- Output bag of output (values) 
- The system will group all pairs with the 
 sameintermediate key, and passes the bag
 ofvalues to the REDUCE function.
35 MapReduce Programming Model
- Input  Output each a set of key/value pairs 
- Programmer specifics two functions 
- map(in_key, in_value) -gt list(out_key, 
 intermediate_value)
- Processes input key/value pair 
- Produces set of intermediate pairs 
- reduce(out_key, list(intermediate_value)) -gt 
 list(out_value)
- Combines all intermediate values for a particular 
 key
- Produces a set of merged output values (usually 
 just one)
36 Outline
- Parallel Processing Patterns 
- MapReduce Abstraction 
- MapReduce Pseudocode 
- MapReduce Examples 
- Relational Join 
- Matrix Multiplication 
- MapReduce Implementation Overview 
37 Example What does this do?
- map(String input_key, String input_value) 
- // input_key document name// input_value 
 document content
- For each word w in input_value 
- EmitIntermediate(w, 1) 
- reduce(String intermediate_key, Iterator 
 intermediate_values)
- // intermediate_key word// intermediate_values 
 ???
- int result  0 
- For each v in intermediate_values 
- Result  v 
- EmitFinal(intermediate_key, result)
38 Outline
- Parallel Processing Patterns 
- MapReduce Abstraction 
- MapReduce Pseudocode 
- MapReduce Examples 
- Relational Join 
- Matrix Multiplication 
- MapReduce Implementation Overview 
39 Natural Join
- Join of R(A, B) with S(B, C) is the set of 
 tuples(a, b, c) such that (a, b) is in R and (b,
 c) is in S.
- Mappers need to send R(a, b) and S(b, c) to 
 thesame reducer, so they can be joined there.
- Mapper output key  B-value, value  
 relationand other component (A, C).
- Example 
- R(1, 2) ? (2, (R, 1)) 
- S(2, 3) ? (2, (S, 3)) 
40 Mapping Tuples
Mapper for R(1,2)
R(1,2)
(2, (R,1))
Mapper for R(4,2)
R(4,2)
Mapper for S(2,3)
S(2,3)
Mapper for S(5,6)
S(5,6) 
 41 Grouping Phase
- There is a reducer for each key. 
- Every key-value pair generated by any mapperis 
 sent to the reducer for its key.
42 Mapping Tuples
Mapper for R(1,2)
(2, (R,1))
Reducer for B  2
Mapper for R(4,2)
(2, (R,4))
Reducer for B  5
Mapper for S(2,3)
(2, (S,3))
Mapper for S(5,6)
(5, (S,6)) 
 43 Constructing Value-Lists
- The input to each reducer is organized by 
 thesystem into a pair
- - The key. 
- - The list of values associated with that key.
44 The Value-List Format
Reducer for B  2
(2, (R,1), (R,4), (S,3))
Reducer for B  5
(5, (S,6)) 
 45 The Reduce Function for Join
- Given key b and a list of values that are 
 either(R, ai) or (S, cj), output each triple
 (ai, b, cj).
- Thus, the number of outputs made by a reduceris 
 the product of the number of Rs on the listand
 the number of Ss on the list.
46 Output of the Reducers
Reducer for B  2
(2, (R,1), (R,4), (S,3))
(1,2,3), (4,2,3)
Reducer for B  5
(5, (S,6)) 
 47 Outline
- Parallel Processing Patterns 
- MapReduce Abstraction 
- MapReduce Pseudocode 
- MapReduce Examples 
- Relational Join 
- Matrix Multiplication 
- MapReduce Implementation Overview 
48 Matrix Multiplication
1 -2
4 3
-3 -2
0 4
1 3 4 -2
6 2 -3 1
1 -9
23 4
X 
 49 Matrix Multiply in MapReduce
  50 Matrix Multiply in MapReduce
X
B
A
AB  
 51 Outline
- Parallel Processing Patterns 
- MapReduce Abstraction 
- MapReduce Pseudocode 
- MapReduce Examples 
- Relational Join 
- Matrix Multiplication 
- MapReduce Implementation Overview 
52 Cluster Computing
- Large number of commodity servers,connected by 
 high speed, commoditynetwork
- Rack holds a small number of servers 
- Data center holds many racks
53 Cluster Computing
- Massive parallelism 
- - 100s, or 1000s, or 10000s servers 
- Many hours 
- Failure 
- If medium-time-between-failure is 1 year 
- Then 10000 servers have failure / hour
54 Distributed File System (DFS)
- For every large files TBs, PTs 
- Each file is partitioned into chunks,typically 
 64MB
- Each chunk is replicated several times(3), on 
 different racks, for fault tolerance
- Implementations 
- Googles DFS GFS, proprietary 
- Hadoops DFS HDFS, open source
55 MapReduce Phases
Map Task
Reduce Task
P 3
P 1
P 2
P 4
P 5
Split
Record Reader ? Map ? Combine
Copy
Sort
Reduce
file
file
Local Storage
HDFS
HDFS 
 56 Combiner
Same word appears twice.Why not just send (w1, 
2)?
MAP
REDUCE
Shuffle
(did1, v1)
(did2, v2)
(did3, v3)
. . .
(w1, 1)
(w2, 1)
(w1, 1)
. . . 
(w1, 1)
(w2, 1)
. . . 
(w1, (1, 1, 1, , 1))
(w1, (1, 1, ))
(w1, (1, ))
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . . 
 57 Adding a Combiner
- map(String input_key, String input_value) 
- // input_key document name// input_value 
 document content
- For each word w in input_value 
- EmitIntermediate(w, 1) 
- combine(String intermediate_key, Iterator 
 intermediate values)
- returns (intermediate_key, intermediate_value) 
- reduce(String intermediate_key, Iterator 
 intermediate_values)
- // intermediate_key word// intermediate_values 
 ???
- int result  0 
- For each v in intermediate_values 
- Result  v 
- Emit(result)
58 Apache Hadoop Architecture
INPUT PARTITION
 MAP
 MAP
 MAP
 MAP
SHUFFLING
SORT IN PARALLEL
 REDUCE
 REDUCE
OUTPUT PARTITION
DATA ON HDFS 
 59