Title: Distributed Computations MapReduce/Dryad
1Distributed ComputationsMapReduce/Dryad
- M/R slides adapted from those of Jeff Deans
- Dryad slides adapted from those of Michael Isard
2What weve learnt so far
- Basic distributed systems concepts
- Consistency (sequential, eventual)
- Concurrency
- Fault tolerance (recoverability, availability)
- What are distributed systems good for?
- Better fault tolerance
- Better security?
- Increased storage/serving capacity
- Storage systems, email clusters
- Parallel (distributed) computation (Todays topic)
3Why distributed computations?
- How long to sort 1 TB on one computer?
- One computer can read 60MB from disk
- Takes 1 days!!
- Google indexes 100 billion web pages
- 100 109 pages 20KB/page 2 PB
- Large Hadron Collider is expected to produce 15
PB every year!
4Solution use many nodes!
- Cluster computing
- Hundreds or thousands of PCs connected by high
speed LANs - Grid computing
- Hundreds of supercomputers connected by high
speed net - 1000 nodes potentially give 1000X speedup
5Distributed computations are difficult to program
- Sending data to/from nodes
- Coordinating among nodes
- Recovering from node failure
- Optimizing for locality
- Debugging
6MapReduce
- A programming model for large-scale computations
- Process large amounts of input, produce output
- No side-effects or persistent state (unlike file
system) - MapReduce is implemented as a runtime library
- automatic parallelization
- load balancing
- locality optimization
- handling of machine failures
7MapReduce design
- Input data is partitioned into M splits
- Map extract information on each split
- Each Map produces R partitions
- Shuffle and sort
- Bring M partitions to the same reducer
- Reduce aggregate, summarize, filter or transform
- Output is in R result files
8More specifically
- Programmer specifies two methods
- map(k, v) ? ltk', v'gt
- reduce(k', ltv'gt) ? ltk', v'gt
- All v' with same k' are reduced together, in
order. - Usually also specify
- partition(k, total partitions) -gt partition for
k - often a simple hash of the key
- allows reduce operations for different k to be
parallelized
9Example Count word frequencies in web pages
- Input is files with one doc per record
- Map parses documents into words
- key document URL
- value document contents
- Output of map
doc1, to be or not to be
10Example word frequencies
- Reduce computes sum for a key
- Output of reduce saved
key be values 1, 1
be, 2 not, 1 or, 1 to, 2
11Example Pseudo-code
- Map(String input_key, String input_value)
//input_key document name //input_value
document contents for each word w in
input_values EmitIntermediate(w, "1") - Reduce(String key, Iterator intermediate_values)
//key a word, same for input and output
//intermediate_values a list of counts int
result 0 for each v in intermediate_values
result ParseInt(v) Emit(AsString(result))
12MapReduce is widely applicable
- Distributed grep
- Document clustering
- Web link graph reversal
- Detecting approx. duplicate web pages
13MapReduce implementation
- Input data is partitioned into M splits
- Map extract information on each split
- Each Map produces R partitions
- Shuffle and sort
- Bring M partitions to the same reducer
- Reduce aggregate, summarize, filter or transform
- Output is in R result files
14MapReduce scheduling
- One master, many workers
- Input data split into M map tasks (e.g. 64 MB)
- R reduce tasks
- Tasks are assigned to workers dynamically
- Often M200,000 R4,000 workers2,000
15MapReduce scheduling
- Master assigns a map task to a free worker
- Prefers close-by workers when assigning task
- Worker reads task input (often from local disk!)
- Worker produces R local files containing
intermediate k/v pairs - Master assigns a reduce task to a free worker
- Worker reads intermediate k/v pairs from map
workers - Worker sorts applies users Reduce op to
produce the output
16Parallel MapReduce
Input data
Map
Map
Map
Map
Master
Partitioned output
17WordCount Internals
- Input data is split into M map jobs
- Each map job generates in R local partitions
to, 1 be, 1 or, 1 not, 1 to, 1
doc1, to be or not to be
18WordCount Internals
- Shuffle brings same partitions to same reducer
to,1,1
be,1
R local partitions
not,1 or, 1
do,1
R local partitions
be,1
not,1
19WordCount Internals
- Reduce aggregates sorted key values pairs
do,1
to,1,1
be,1,1
not,1,1
or, 1
20The importance of partition function
- partition(k, total partitions) -gt partition for
k - e.g. hash(k) R
- What is the partition function for sort?
21Load Balance and Pipelining
- Fine granularity tasks many more map tasks than
machines - Minimizes time for fault recovery
- Can pipeline shuffling with map execution
- Better dynamic load balancing
- Often use 200,000 map/5000 reduce tasks w/ 2000
machines
22Fault tolerance via re-execution
- On worker failure
- Re-execute completed and in-progress map tasks
- Re-execute in progress reduce tasks
- Task completion committed through master
- On master failure
- State is checkpointed to GFS new master recovers
continues
23Avoid straggler using backup tasks
- Slow workers significantly lengthen completion
time - Other jobs consuming resources on machine
- Bad disks with soft errors transfer data very
slowly - Weird things processor caches disabled (!!)
- An unusually large reduce partition?
- Solution Near end of phase, spawn backup copies
of tasks - Whichever one finishes first "wins"
- Effect Dramatically shortens job completion time
24MapReduce Sort Performance
- 1TB (100-byte record) data to be sorted
- 1700 machines
- M15000 R4000
25MapReduce Sort Performance
When can shuffle start?
When can reduce start?
26Dryad
- Slides adapted from those of Yuan Yu and Michael
Isard
27Dryad
- Similar goals as MapReduce
- focus on throughput, not latency
- Automatic management of scheduling, distribution,
fault tolerance - Computations expressed as a graph
- Vertices are computations
- Edges are communication channels
- Each vertex has several input and output edges
28WordCount in Dryad
Count Wordn
MergeSort Wordn
Distribute Wordn
Count Wordn
29Why using a dataflow graph?
- Many programs can be represented as a distributed
dataflow graph - The programmer may not have to know this
- SQL-like queries LINQ
- Dryad will run them for you
30Runtime
- Vertices (V) run arbitrary app code
- Vertices exchange data through
- files, TCP pipes etc.
- Vertices communicate with JM to report
- status
- Daemon process (D)
- executes vertices
- Job Manager (JM) consults name server(NS)
- to discover available machines.
- JM maintains job graph and schedules vertices
31Job Directed Acyclic Graph
Outputs
Processing vertices
Channels (file, pipe, shared memory)
Inputs
32Scheduling at JM
- General scheduling rules
- Vertex can run anywhere once all its inputs are
ready - Prefer executing a vertex near its inputs
- Fault tolerance
- If A fails, run it again
- If As inputs are gone, run upstream vertices
again (recursively) - If A is slow, run another copy elsewhere and use
output from whichever finishes first
33Advantages of DAG over MapReduce
- Big jobs more efficient with Dryad
- MapReduce big job runs gt1 MR stages
- reducers of each stage write to replicated
storage - Output of reduce 2 network copies, 3 disks
- Dryad each job is represented with a DAG
- intermediate vertices write to local file
34Advantages of DAG over MapReduce
- Dryad provides explicit join
- MapReduce mapper (or reducer) needs to read from
shared table(s) as a substitute for join - Dryad explicit join combines inputs of different
types - Dryad Split produces outputs of different types
- Parse a document, output text and references
35DAG optimizations merge tree
36DAG optimizations merge tree
37Dryad Optimizations data-dependent
re-partitioning
Distribute to equal-sized ranges
Sample to estimate histogram
Randomly partitioned inputs
38Dryad example 1SkyServer Query
- 3-way join to find gravitational lens effect
- Table U (objId, color) 11.8GB
- Table N (objId, neighborId) 41.8GB
- Find neighboring stars with similar colors
- Join UN to find
- T N.neighborID where U.objID N.objID, U.color
- Join UT to find
- U.objID where U.objID T.neighborID
- and U.color T.color
39SkyServer query
40(No Transcript)
41Dryad example 2 Query histogram computation
- Input log file (n partitions)
- Extract queries from log partitions
- Re-partition by hash of query (k buckets)
- Compute histogram within each bucket
42Naïve histogram topology
P parse lines D hash distribute S quicksort C
count occurrences MS merge sort
43Efficient histogram topology
P parse lines D hash distribute S quicksort C
count occurrences MS merge sort M
non-deterministic merge
Q'
is
Each
k
Each
T
k
C
R
R
is
Each
R
S
D
is
T
C
P
C
Q'
MS
M
MS
n
44MS?C
R
R
R
MS?C?D
T
M?P?S?C
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
45MS?C
R
R
R
MS?C?D
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
46MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
47MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
48MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
49MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
50Final histogram refinement
1,800 computers 43,171 vertices 11,072
processes 11.5 minutes