Distributed Computations MapReduce/Dryad - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Computations MapReduce/Dryad

Description:

Distributed Computations MapReduce/Dryad M/R s adapted from those of Jeff Dean s Dryad s adapted from those of Michael Isard ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 51
Provided by: Jinya1
Learn more at: https://news.cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Distributed Computations MapReduce/Dryad


1
Distributed ComputationsMapReduce/Dryad
  • M/R slides adapted from those of Jeff Deans
  • Dryad slides adapted from those of Michael Isard

2
What weve learnt so far
  • Basic distributed systems concepts
  • Consistency (sequential, eventual)
  • Concurrency
  • Fault tolerance (recoverability, availability)
  • What are distributed systems good for?
  • Better fault tolerance
  • Better security?
  • Increased storage/serving capacity
  • Storage systems, email clusters
  • Parallel (distributed) computation (Todays topic)

3
Why distributed computations?
  • How long to sort 1 TB on one computer?
  • One computer can read 60MB from disk
  • Takes 1 days!!
  • Google indexes 100 billion web pages
  • 100 109 pages 20KB/page 2 PB
  • Large Hadron Collider is expected to produce 15
    PB every year!

4
Solution use many nodes!
  • Cluster computing
  • Hundreds or thousands of PCs connected by high
    speed LANs
  • Grid computing
  • Hundreds of supercomputers connected by high
    speed net
  • 1000 nodes potentially give 1000X speedup

5
Distributed computations are difficult to program
  • Sending data to/from nodes
  • Coordinating among nodes
  • Recovering from node failure
  • Optimizing for locality
  • Debugging

6
MapReduce
  • A programming model for large-scale computations
  • Process large amounts of input, produce output
  • No side-effects or persistent state (unlike file
    system)
  • MapReduce is implemented as a runtime library
  • automatic parallelization
  • load balancing
  • locality optimization
  • handling of machine failures

7
MapReduce design
  • Input data is partitioned into M splits
  • Map extract information on each split
  • Each Map produces R partitions
  • Shuffle and sort
  • Bring M partitions to the same reducer
  • Reduce aggregate, summarize, filter or transform
  • Output is in R result files

8
More specifically
  • Programmer specifies two methods
  • map(k, v) ? ltk', v'gt
  • reduce(k', ltv'gt) ? ltk', v'gt
  • All v' with same k' are reduced together, in
    order.
  • Usually also specify
  • partition(k, total partitions) -gt partition for
    k
  • often a simple hash of the key
  • allows reduce operations for different k to be
    parallelized

9
Example Count word frequencies in web pages
  • Input is files with one doc per record
  • Map parses documents into words
  • key document URL
  • value document contents
  • Output of map

doc1, to be or not to be
10
Example word frequencies
  • Reduce computes sum for a key
  • Output of reduce saved

key be values 1, 1
be, 2 not, 1 or, 1 to, 2
11
Example Pseudo-code
  • Map(String input_key, String input_value)
    //input_key document name //input_value
    document contents for each word w in
    input_values EmitIntermediate(w, "1")
  • Reduce(String key, Iterator intermediate_values)
    //key a word, same for input and output
    //intermediate_values a list of counts int
    result 0 for each v in intermediate_values
    result ParseInt(v) Emit(AsString(result))

12
MapReduce is widely applicable
  • Distributed grep
  • Document clustering
  • Web link graph reversal
  • Detecting approx. duplicate web pages

13
MapReduce implementation
  • Input data is partitioned into M splits
  • Map extract information on each split
  • Each Map produces R partitions
  • Shuffle and sort
  • Bring M partitions to the same reducer
  • Reduce aggregate, summarize, filter or transform
  • Output is in R result files

14
MapReduce scheduling
  • One master, many workers
  • Input data split into M map tasks (e.g. 64 MB)
  • R reduce tasks
  • Tasks are assigned to workers dynamically
  • Often M200,000 R4,000 workers2,000

15
MapReduce scheduling
  • Master assigns a map task to a free worker
  • Prefers close-by workers when assigning task
  • Worker reads task input (often from local disk!)
  • Worker produces R local files containing
    intermediate k/v pairs
  • Master assigns a reduce task to a free worker
  • Worker reads intermediate k/v pairs from map
    workers
  • Worker sorts applies users Reduce op to
    produce the output

16
Parallel MapReduce
Input data
Map
Map
Map
Map
Master
Partitioned output
17
WordCount Internals
  • Input data is split into M map jobs
  • Each map job generates in R local partitions

to, 1 be, 1 or, 1 not, 1 to, 1
doc1, to be or not to be
18
WordCount Internals
  • Shuffle brings same partitions to same reducer

to,1,1
be,1
R local partitions
not,1 or, 1
do,1
R local partitions
be,1
not,1
19
WordCount Internals
  • Reduce aggregates sorted key values pairs

do,1
to,1,1
be,1,1
not,1,1
or, 1
20
The importance of partition function
  • partition(k, total partitions) -gt partition for
    k
  • e.g. hash(k) R
  • What is the partition function for sort?

21
Load Balance and Pipelining
  • Fine granularity tasks many more map tasks than
    machines
  • Minimizes time for fault recovery
  • Can pipeline shuffling with map execution
  • Better dynamic load balancing
  • Often use 200,000 map/5000 reduce tasks w/ 2000
    machines

22
Fault tolerance via re-execution
  • On worker failure
  • Re-execute completed and in-progress map tasks
  • Re-execute in progress reduce tasks
  • Task completion committed through master
  • On master failure
  • State is checkpointed to GFS new master recovers
    continues

23
Avoid straggler using backup tasks
  • Slow workers significantly lengthen completion
    time
  • Other jobs consuming resources on machine
  • Bad disks with soft errors transfer data very
    slowly
  • Weird things processor caches disabled (!!)
  • An unusually large reduce partition?
  • Solution Near end of phase, spawn backup copies
    of tasks
  • Whichever one finishes first "wins"
  • Effect Dramatically shortens job completion time

24
MapReduce Sort Performance
  • 1TB (100-byte record) data to be sorted
  • 1700 machines
  • M15000 R4000

25
MapReduce Sort Performance
When can shuffle start?
When can reduce start?
26
Dryad
  • Slides adapted from those of Yuan Yu and Michael
    Isard

27
Dryad
  • Similar goals as MapReduce
  • focus on throughput, not latency
  • Automatic management of scheduling, distribution,
    fault tolerance
  • Computations expressed as a graph
  • Vertices are computations
  • Edges are communication channels
  • Each vertex has several input and output edges

28
WordCount in Dryad
Count Wordn
MergeSort Wordn
Distribute Wordn
Count Wordn
29
Why using a dataflow graph?
  • Many programs can be represented as a distributed
    dataflow graph
  • The programmer may not have to know this
  • SQL-like queries LINQ
  • Dryad will run them for you

30
Runtime
  • Vertices (V) run arbitrary app code
  • Vertices exchange data through
  • files, TCP pipes etc.
  • Vertices communicate with JM to report
  • status
  • Daemon process (D)
  • executes vertices
  • Job Manager (JM) consults name server(NS)
  • to discover available machines.
  • JM maintains job graph and schedules vertices

31
Job Directed Acyclic Graph
Outputs
Processing vertices
Channels (file, pipe, shared memory)
Inputs
32
Scheduling at JM
  • General scheduling rules
  • Vertex can run anywhere once all its inputs are
    ready
  • Prefer executing a vertex near its inputs
  • Fault tolerance
  • If A fails, run it again
  • If As inputs are gone, run upstream vertices
    again (recursively)
  • If A is slow, run another copy elsewhere and use
    output from whichever finishes first

33
Advantages of DAG over MapReduce
  • Big jobs more efficient with Dryad
  • MapReduce big job runs gt1 MR stages
  • reducers of each stage write to replicated
    storage
  • Output of reduce 2 network copies, 3 disks
  • Dryad each job is represented with a DAG
  • intermediate vertices write to local file

34
Advantages of DAG over MapReduce
  • Dryad provides explicit join
  • MapReduce mapper (or reducer) needs to read from
    shared table(s) as a substitute for join
  • Dryad explicit join combines inputs of different
    types
  • Dryad Split produces outputs of different types
  • Parse a document, output text and references

35
DAG optimizations merge tree
36
DAG optimizations merge tree
37
Dryad Optimizations data-dependent
re-partitioning
Distribute to equal-sized ranges
Sample to estimate histogram
Randomly partitioned inputs
38
Dryad example 1SkyServer Query
  • 3-way join to find gravitational lens effect
  • Table U (objId, color) 11.8GB
  • Table N (objId, neighborId) 41.8GB
  • Find neighboring stars with similar colors
  • Join UN to find
  • T N.neighborID where U.objID N.objID, U.color
  • Join UT to find
  • U.objID where U.objID T.neighborID
  • and U.color T.color

39
SkyServer query
40
(No Transcript)
41
Dryad example 2 Query histogram computation
  • Input log file (n partitions)
  • Extract queries from log partitions
  • Re-partition by hash of query (k buckets)
  • Compute histogram within each bucket

42
Naïve histogram topology
P parse lines D hash distribute S quicksort C
count occurrences MS merge sort
43
Efficient histogram topology
P parse lines D hash distribute S quicksort C
count occurrences MS merge sort M
non-deterministic merge
Q'
is

Each
k
Each
T
k
C
R
R
is

Each
R
S
D
is

T
C
P
C
Q'
MS
M
MS
n
44
MS?C
R
R
R
MS?C?D
T
M?P?S?C
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
45
MS?C
R
R
R
MS?C?D
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
46
MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
47
MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
48
MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
49
MS?C
R
R
R
MS?C?D
T
T
M?P?S?C
Q
Q
Q
Q
P parse lines D hash distribute S quicksort MS mer
ge sort C count occurrences M non-deterministic
merge
50
Final histogram refinement
1,800 computers 43,171 vertices 11,072
processes 11.5 minutes
Write a Comment
User Comments (0)
About PowerShow.com