Principles of High Performance Computing ICS 632 - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Principles of High Performance Computing ICS 632

Description:

Scheduling is the art of assigning work to resources throughout time in a way ... Many of them come down to this notion of packing boxes in a Gantt Chart. 2-PARTITION ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 90
Provided by: henrica
Category:

less

Transcript and Presenter's Notes

Title: Principles of High Performance Computing ICS 632


1
Principles of High Performance Computing (ICS
632)
  • Notions of
  • Scheduling

2
Scheduling
  • Scheduling is the art of assigning work to
    resources throughout time in a way that optimizes
    some metric of performance
  • e.g., factory workers to machines so that the
    largest number of cars can be produces in a day
  • e.g., processors to computations and disks to
    data items so that a given application can finish
    before a deadline
  • e.g., packets to network links so that overall
    network throughput is maximized
  • It is a very broad field, used in many domains
  • Many scheduling problems are known to be very
    difficult (i.e., intractable)

3
Scheduling
  • In the area of high performance computing, most
    scheduling problems are for taskgraphs
  • Taskgraphs are graphs of tasks where edges
    corresponds to precedence constraints

4
Where do DAGs come from?
  • Consider a (lower) triangular linear system solve
  • What you would need to do after an LU
    factorization

Ax b
  • Simple Algorithm
  • for (i 0 i lt n i)
  • xi bi / ai,i
  • for (ji1 iltn i)
  • bj bi - aj,i xi

5
Where do DAGs come from?
  • Consider a (lower) triangular linear system solve
  • What you would need to do after an LU
    factorization

Ax b
  • Simple Algorithm
  • for (i 0 i lt n i)
  • Ti,i xi bi / ai,i
  • for (ji1 iltn i)
  • Ti,j bj bi - aj,i xi

6
Tasks, Dependencies, etc.
  • for (i 0 i lt n i)
  • Ti,i xi bi / ai,i
  • for (ji1 iltn i)
  • Ti,j bj bi - aj,i xi
  • All tasks Ti, are executed at iteration i of the
    outer loop
  • There is a simple sequential order of the tasks
  • T0,0 lt T0,1 lt ... lt T0,n-1 lt T1,0 lt T1,1 lt ... lt
    T1,n-1 lt ...
  • Of course, when considering a parallel execution,
    one tries to find independent tasks
  • To see if tasks are independent one must examine
    their input (In) and their output (Out)

7
Tasks, Dependencies, etc.
  • for (i 0 i lt n i)
  • Ti,i xi bi / ai,i
  • for (ji1 iltn i)
  • Ti,j bj bi - aj,i xi
  • Input and Output
  • In(Ti,i) bi, ai,i
  • Out(Ti,i) xi
  • In(Ti,j) bi, aj,i, xi for j gt i
  • Out(Ti,j) bj for j gt i
  • Bernstein Conditions
  • T and T are independent if all 3 conditions are
    met
  • In(T) ? Out(T) ?
  • Out(T) ? In(T) ?
  • Out(T) ? Out(T) ?

8
Task Graph
  • for (i 0 i lt n i)
  • Ti,i xi bi / ai,i
  • for (ji1 iltn i)
  • Ti,j bj bi - aj,i xi
  • It is easy to see that
  • for all i, all Ti,j are independent of each other
    for j gt i
  • for all i, all Ti,j depend on Ti,i, for j gt i
  • for all i, all Ti,j depend on Ti-1,j for j gt i
    and i gt 0
  • Hence the task graph

9
Task Graph
0,0
0,1
0,2
0,3
0,4
0,5
1,1
1,2
1,3
1,4
1,5
2,2
2,3
2,4
2,5
  • for (i 0 i lt n i)
  • Ti,i xi bi / ai,i
  • for (ji1 iltn i)
  • Ti,j bj bi - aj,i xi

3,3
3,4
3,5
4,4
4,5
5,5
10
More taskgraphs
  • The previous taskgraph comes from a low-level
    analysis of the code
  • It probably makes little sense to do a parallel
    implementation with MPI with such a low task
    granularity
  • Can totally make sense with OpenMP
  • Such task graphs can also be used by compilers to
    do code optimization by exploiting multiple
    functional units, pipelines functional units,
    etc.
  • With blocking these tasks could become MPI
    tasks
  • Other taskgraphs are really how the application
    was build

11
Scientific Workflows
  • A popular way in which many scientific
    applications are constructed is as workflows
  • A scientists conceptually drags and drops
    computational kernels and connects their
    input-output
  • The result is a DAG (actually more general than a
    DAG) that does something useful
  • Example Application Montage
  • Produce Mosaic of the Sky
  • Based on multiple data sources
  • Given angle, coordinates, size, etc.
  • 10s of thousands of tasks
  • Example M101 galaxy images

12
Sample Montage DAG
13
Many levels of parallelisms
  • A scientific workflow

14
Many levels of parallelisms
  • A scientific workflow

15
Many levels of parallelisms
  • A scientific workflow

OpenMP Threads
16
Back to Basics
  • Definition A task system is a directed graph G
    (V,E)
  • V set of vertices
  • E set of edges
  • (u,v), such that both u and v are in V
  • denotes precedence task u must be executed
    before task v
  • A (integer) weight w may be assigned to each
    vertex
  • e.g., computation duration on some reference
    platform
  • A schedule is a mapping of each vertex to
    available resources so that precedence
    constraints are not violated
  • a resource can only run a task at a time
  • otherwise consider it to be multiple resources
  • There is a lot of obvious and intricate formalism
    we can use to describe all this rigorously, and
    Ill try to stay away from it

17
Gantt Chart with 3 processors
10
processors
5
2
4
4
7
8
2
time
18
Acyclic Graphs
  • Theorem There exists a valid schedule if and
    only if there is no cycle in the graph
  • Proof a little less obvious than one would think
  • Uses formalisms we havent really introduced.
  • But intuitively its pretty clear that the
    theorem should hold true.
  • Therefore we only consider DAGs
  • Directed Acyclic Graphs

19
Makespan
  • The makespan is defined as the overall execution
    time

makespan
20
Lower bound on the makespan
  • Let F be a path in the DAG
  • That is a sequence of vertices so that the second
    vertex depends on the first vertex, the third
    vertex depends on the second vertex, etc.
  • Define the length of a path as the sum of the
    vertex weights along the path
  • Then, for each possible path, the makespan is
    longer than the length of that path
  • Therefore, the makespan is longer than the length
    of the longest path
  • The longest path is called the critical path

21
Two Scheduling Problems
  • Pb(p) Given a DAG G (V,E), find the schedule
    that achieves the smallest makespan using p
    processors
  • MSOpt(p) the smallest makespan
  • Pb(inf) Given a DAG D (V,E), find the schedule
    that achieves the smallest makespan using an
    unbounded number of processors
  • MSOpt(inf) the smallest makespan

22
Solving Pb(inf)
  • If one has an infinite number of processors,
    obtaining the optimal schedule is actually very
    simple
  • Assign each task to a different processor
  • Start a task whenever it is ready, i.e., when
    its parent tasks have completed)
  • Sketch of a proof
  • No (unnecessary) idle time occurs between tasks
    on any path
  • Consider the tasks on the critical path
  • The last tasks of the DAG is on the critical path
  • If not, add a dummy task
  • The makespan is equal to the length of the
    critical path
  • Therefore its optimal
  • Another obvious proof that may look complex in
    its full formal version

23
Solving Pb(p)
  • Associated decision problem X Given a DAG, p
    processors, and a time T, can we execute the
    whole DAG before time T?
  • Theorem Problem X is NP-complete
  • Proof
  • Its in NP (a guess solution can be checked in
    polynomial time)
  • Reduction to 2-PARTITION Given n positive
    integers a1, ..., an, can we find I, a subset
    of 1,...,n, such that

24
Reduction to 2-PARTITION
  • Consider an instance of 2-PARTITION
  • a1, ..., an
  • We construct an instance of Pb(p) as follows
  • n vertices v1, ..., vn
  • w(vi) ai
  • no dependences,
  • p 2 processors
  • T
  • If the instance of 2-PARTITION has a solution,
    then the instance of Pb(p) has a solution
  • If the instance of Pb(p) has a solution, then the
    instance of 2-PARTITION has a solution
  • The reduction is in polynomial time
  • Therefore, Pb(p) is NP-complete
  • Even when there are only 2 processors and all
    tasks are independent!!!

25
Scheduling Independent Tasks
tasks
processors
26
Other complexity results
  • There are many such complexity results
  • Many of them come down to this notion of packing
    boxes in a Gantt Chart
  • 2-PARTITION
  • KNAPSACK
  • BINPACKING
  • etc.
  • For instance
  • Scheduling a DAG on 2 processors is NP-complete
    even if all tasks have only weight 1 or 2
  • Well see a taxonomy of scheduling problems when
    we talk about batch scheduling in the next set of
    slides

27
What about communications?
  • If the processors are on a network (as opposed to
    in a shared memory machine), then we need to
    account for the cost of communication of data
    among tasks
  • Each edge in the DAG now has a weight
  • e.g., data transfer time on a reference network
  • Common Assumption If two tasks are scheduled on
    the same processor the edge weight is ignored
  • Or at least its made very small
  • There is now a notion of network topology as
    well, which may be regular or irregular
  • Accounting for communication costs makes things
    much more complicated
  • Pb(inf) becomes NP-complete!

28
So where are we?
  • Scheduling is an area rife with NP-completeness
    problems
  • If you work on a scheduling problem, chances are
    high that it is intractable
  • What do we do when we face intractable problems?
  • Try to come up with good approximation algorithms
  • i.e., guaranteed to be a factor X from optimal,
    where X is not too large
  • Try to come up with heuristics that do a decent
    job in practice
  • many have no guarantees, or they all have the
    same loose guarantee

29
List Scheduling Heuristics
  • A list scheduling algorithm works as follows
  • At each instant, look whether at least one of the
    processors is idle
  • If so, pick one of the ready tasks, if any
  • Assign to one of the idle processors
  • Repeat
  • This is a greedy algorithm that amounts to
    aggressively limiting idle time
  • Of course, the two questions are
  • How do we prioritize ready tasks when there are
    more than one to choose from?
  • Which host do we assign that task to if there are
    multiple idle hosts?
  • Based on answers to these questions, one can have
    a worse or a better heuristic

30
Guarantee
  • Here is a very powerful result regarding list
    scheduling heuristics in general
  • Theorem Consider a DAG G to schedule onto p
    identical processors, with no communication. Let
    MSOpt(p) be the optimal makespan. Let MS(S,p) be
    the makespan achieved by ANY list scheduling
    heuristic S. We have
  • MS(S,p) lt (2 - 1/p) MSOpt(p)
  • In other terms, a list heuristic is at worst 2
    factors away from the optimal
  • Note that there can still be good and bad
    heuristics
  • But this result says that bad isnt that awful
  • Such results are always intriguing because we
    dont know the optimal schedule (the problem is
    NP-complete), but we can say how far from it we
    are!
  • Lets look at the sketch of the proof

31
Sketch of the proof
  • Consider a task that ends last

32
Sketch of the proof
  • Consider the latest time before that tasks
    beginning such that at least one processor is
    idle at that time

33
Sketch of the proof
  • Why isnt the red task running earlier?
  • It has to be because one of its parents is
    running
  • Otherwise, it wouldnt be list scheduling!

34
Sketch of the proof
  • Lets look at the tasks parent

35
Sketch of the proof
  • And at the latest time before the parents
    beginning at which there is an idle processor
  • We ask the same question as before

36
Sketch of the proof
  • Repeat the process

37
Sketch of the proof
  • And again

38
Sketch of the proof
  • In the end we have found a path in the DAG such
    that there cannot be any idle processor when the
    tasks on this path are not running

39
Sketch of the proof
  • In the end we have found a path in the DAG such
    that there cannot be any idle processor when the
    tasks on this path are not running
  • Let L be the length of that path
  • The most idle time that can happen is when ALL
    processors are idle when the tasks on the path
    are running, but for the processor that executes
    the task on the path
  • Let Idle be the total amount of processor Idle
    time
  • We have Idle (p-1) L

40
Counting the Boxes
  • If we add up all the boxed (white and gray)
    together, we get p MS(S,p)
  • The area of the big rectangle
  • The white boxes correspond to Idle
  • The grey boxed correspond to the sequential
    execution time, Seq

41
Counting the Boxes
  • We have p MS(S,p) Idle Seq
  • We had before that Idle (p-1) L
  • But L MSOpt(p)
  • And MSOpt(p) Seq / p
  • Therefore Seq MSOpt(p) p
  • We obtain that
  • p MS(S,p) (p-1) MSOpt(p) p MSOpt(p)
  • which means that MS(S,p) (2 - 1/p) MSOpt(p)
  • The theorem is proven!
  • In fact, it can be proven that this is the best
    bound!

42
List Scheduling Heuristics
  • A typical scheduling algorithm
  • while tasks left to schedule
  • Determine the set of ready tasks
  • Pick one of the ready tasks
  • Pick one of the available hosts
  • Assign the task to the host
  • end while
  • This algorithm works offline by putting slots in
    a Gantt chart for all tasks
  • Once all tasks have been assigned to processors
    throughout time, then one can just follow the
    schedule
  • Each processor computes a subset of the tasks in
    some order

43
Independent Tasks
  • Assume that there are no edges in the DAG
  • Then any task can be schedule at any time at
    which a processor is idle
  • The algorithm becomes
  • Compute a priority for each task
  • while tasks left to schedule
  • pick the task with the highest priority
  • schedule it on the first available host
  • end while
  • Remaining question how do we define the priority?

44
Independent Tasks
  • One probably reasonable idea is to give higher
    priority to the longer tasks

45
Independent Tasks
  • One possibility is to give higher priority to
    the longer tasks

3
2
8
3 processors
1
4
7
9
0
5
6
time
46
Heterogeneous Processors
  • What if not all processors are identical??
  • Heterogeneous compute speeds
  • The algorithm can be modified as follows
  • while tasks left to schedule
  • for all unscheduled tasks Ti
  • for all hosts Hj
  • compute the completion time of Ti on Hj
    CTi,j
  • end for
  • compute the priority as a function of all
    CTi,j Pi
  • end for
  • pick task Tk with the best Pk
  • pick host Hl that minimizes CTk,l
  • schedule task Tk on host Hl
  • end while

47
Heterogeneous Processors
  • What if not all processors are identical??
  • Heterogeneous compute speeds
  • The algorithm can be modified as follows
  • while tasks left to schedule
  • for all unscheduled tasks Ti
  • for all hosts Hj
  • compute the completion time of Ti on Hj
    CTi,j
  • end for
  • compute the priority as a function of all
    CTi,j Pi
  • end for
  • pick task Tk with the best Pk
  • pick host Hl that minimizes CTk,l
  • schedule task Tk on host Hl
  • end while

Priority computation inside the loop (dynamic
priorities)
48
Two parameters
  • while tasks left to schedule
  • for all unscheduled tasks Ti
  • for all hosts Hj
  • compute the completion time of Ti on Hj
    CTi,j
  • end for
  • compute the priority as a function of all
    CTi,j Pi
  • end for
  • pick task Tk with the best Pk
  • pick host Hl that minimizes CTk,l
  • schedule task Tk on host Hl
  • end while

Define the priority
Define best
49
Two parameters
  • while tasks left to schedule
  • for all unscheduled tasks Ti
  • for all hosts Hj
  • compute the completion time of Ti on Hj
    CTi,j
  • end for
  • compute the priority as a function of all
    CTi,j Pi
  • end for
  • pick task Tk with the best Pk
  • pick host Hl that minimizes CTk,l
  • schedule task Tk on host Hl
  • end while

Max Min
Min of the CTi,j Max of the CTi,j Difference
between the two largest CTi,j
50
List Scheduling
  • MinMin (aggressively pick the task that can be
    done soonest)
  • for each task T pick the host H that achieves the
    smallest CT for task T
  • pick the task with the smallest such CT
  • schedule T on H
  • MaxMin (pick the largest tasks first)
  • for each task T pick the host H that achieves the
    smallest CT for task T
  • pick the task with the largest such CT
  • schedule T on H
  • Sufferage (pick the task that would suffer the
    most if not picked)
  • for each task T pick the host H that achieves the
    smallest CT for task T
  • for each task T pick the host H that achieves
    the second smallest CT for task T
  • pick the task with the largest (CT - CT) value
  • schedule T on H

51
Heterogeneity?
  • Uniform Heterogeneity If task A takes time TA
    and task B takes time TB on a processor p, then
    task A takes time ?TA and task B takes time ?TB
    on another processor p, for all tasks and
    processors
  • Otherwise we have Non-Uniform Heterogeneity

Uniform
Non-Uniform
52
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323

53
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323
  • Pick T2, schedule it on H2

54
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323
  • Pick T2, schedule it on H2
  • Update matrix



tasks
10 23 24 38 70 27
machines
55
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323
  • Pick T2, schedule it on H2
  • Update matrix
  • P110, P323



tasks
10 23 24 38 70 27
machines
56
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323
  • Pick T2, schedule it on H2
  • Update matrix
  • P110, P323
  • Pick T1, schedule it on H1



tasks
10 23 24 38 70 27
machines
57
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323
  • Pick T2, schedule it on H2
  • Update matrix
  • P110, P323
  • Pick T1, schedule it on H1
  • Update matrix



tasks
10 23 24 38 70 27
machines
tasks


33 38 27
machines
58
Example (MinMin)
tasks

  • 3 tasks, 3 machines

10 24 23 16 8 30 70 12 27
machines
  • MinMin algorithm
  • P110, P28, P323
  • Pick T2, schedule it on H2
  • Update matrix
  • P110, P323
  • Pick T1, schedule it on H1
  • Update matrix
  • P3 27
  • Pick T3, schedule it on H3
  • makespan 27 seconds



tasks
10 23 24 38 70 27
machines
tasks


31 38 27
machines
59
Example (MaxMin)
tasks


10 24 23 16 8 30 70 12 27
  • 3 tasks, 3 machines

machines
  • MaxMin algorithm
  • P110, P28, P323
  • Pick T3, schedule it on H1
  • Update matrix
  • P116, P28
  • Pick T1, schedule it on H2
  • Update matrix
  • P2 12
  • Pick T2, schedule it on H3
  • Makespan 23 seconds

tasks


33 47 16 8 70 12
machines
tasks


47 24 12
machines
60
Resulting Schedules
MinMin
machine 1
Task 1
machine 2
Task 2
machine 3
Task 3
machine 1
Task 3
machine 2
Task 1
MaxMin
machine 3
Task 2
61
What if we add dependencies?
  • One simple way to modify the proposed heuristics
    is to only consider ready tasks
  • while tasks left to schedule
  • for all READY tasks Ti
  • for all hosts Hj
  • compute the completion time of Ti on Hj
    CTi,j
  • end for
  • compute the priority as a function of all
    CTi,j Pi
  • end for
  • pick task Tk with the best Pk
  • pick host Hl that minimizes CTk,l
  • schedule task Tk on host Hl
  • end while

62
What about communications?
  • Our MaxMin, MinMin, and Sufferage heuristics do
    not really work well for DAGs with dependencies
    and communications
  • They are typically used for independent tasks
  • Instead, many believe that the key to good DAG
    scheduling with communication is to consider the
    critical path
  • The goal should be to give priority to the tasks
    on the critical path to reduce the length of the
    critical path
  • Since the length of the critical path is a lower
    bound on the makespan, making that bound as low
    as possible is probably a good idea

63
Critical path?
  • How can we tell that a task is on the critical
    path?
  • The difficulty here is that as scheduling
    decisions are made, the critical path changes
  • If a tasks successor is schedule on the same
    host as that tasks no communication
  • Otherwise communication
  • So one task that could be on the critical path
    due to heavy communication may end up not being
    on the critical path once that communication has
    been nullified

64
Example
1
1
5
1
3
4
5
2
2
1.5
1.5
2
1
1
1
65
Example
  • By just looking at the DAG, the critical path is
    the red one
  • So I give priority to the red tasks
  • Clearly I allocate the second task on the path on
    the same host as the first task
  • Because I want to minimize the critical path
  • At that point the remainder of the red path will
    take at most 521 8 time units
  • And the other paths would take at least 5211
    9 time units
  • Therefore the red path is no longer the critical
    path!

1
1
10
5
3
4
5
2
2
2
2
2
1
1
1
66
Bottom-Level
  • To deal with the previous situation we define the
    bottom-level of a task as
  • The sum of the weights along the longest (i.e.,
    heaviest) path from that task to the end of the
    DAG, including the tasks execution time
  • Assuming that ALL communications will take place
    (as opposed to being set to zero)
  • Schedule tasks in decreasing order of their
    bottom-level
  • At each step schedule the task with the largest
    bottom level on the host that can complete that
    task the soonest
  • Accounts for previous scheduling decisions

67
Example
BL19
1
1
10
BL16
5
3
4
BL8
5
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
68
Example
1

10
BL16
5
3
4
BL8
5
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
69
Example
1

10

3
4
BL8
5
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
70
Example
1

10

3
4

2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
71
Example
1

10

3
4


2 processors
2
BL7
2
2
2
1
BL3
1
1
BL1
72
Example
1

10

3
4


2 processors
2
2
2
2
1
BL3
1
1
BL1
73
Example
1

10

3
4


2 processors
2
2
2
2
1
1
1
BL1
74
Critical Path Scheduling
  • The above algorithm is typically referred to as
    Modified Critical Path (MCP)
  • There are many proposed scheduling algorithms
    that use similar ideas
  • In practice theyve been showed to be reasonable
  • Survey article by Kwok and Ahmad in JPDC
  • Next question what if the resources are
    heterogeneous???
  • In this case, how can we compute the bottom level
    at all since the times will depend on which
    resources are picked?

75
MCP on heterogeneous resources
  • The typical approach when computing the bottom
    level of a task is to compute is using averages
    over all resources
  • For each task (computation or communication)
    compute its average execution time (over all
    processors or over all network links)
  • Other issue Irregular network topologies?
  • What if there is no direct communication link
    between two processors and one must hop through a
    third one
  • And many other variations that require
    modifications of the list scheduling heuristic
  • Many modification have been proposed

76
Mixed Parallelism
  • So now that we have some sort of a handle on
    scheduling DAGs of sequential tasks, what about
    scheduling DAGs of parallel tasks
  • Also called mixed parallelism

77
The problem
  • Assume you have a cluster
  • p homogeneous compute nodes
  • homogeneous network
  • e.g., a switch
  • I have a DAG of tasks
  • Each task can run on 1 to p processors
  • e.g., MPI tasks
  • Question
  • how many processors to allocate to each task?
  • how to schedule the tasks?

78
Trade-offs
  • If we give small numbers of processors to each
    task we have
  • many small and long tasks
  • ability to run many of them in parallel
  • If we give large numbers of processors to each
    task we have
  • many large and short tasks
  • we cannot run many of them in parallel
  • The question is whats the best trade-off?
  • Known to be NP-complete
  • As usual

79
Example 2 Schedules
  • The schedule should consider the speedup curve
    of the parallel tasks
  • Is it worth it to give a task more processors?

p processors
p processors
80
The CPA algorithm
  • A number of scheduling algorithms for mixed
    parallel applications have been proposed
  • Most of the algorithms proceed in two phases
  • phase 1 Determine how many processors each task
    should receive
  • phase 2 Schedule the tasks with some MCP-like
    algorithm
  • Phase 1 is the more interesting one, but lets
    discuss Phase 2 briefly

81
CPA Phase 2
  • The only difficult issue here is the
    communication among tasks because of data
    redistribution
  • Consider two tasks
  • T2 depends on T1
  • T1 is a matrix multiplication using 2D data
    distribution
  • T2 is an LU factorization using 2D data
    distribution
  • T1 was allocated 16 processors
  • T2 was allocated 4 processors

82
CPA Phase 2
  • Data redistribution can be much more complicated
    that in this example
  • The scheduling phase must consider the
    redistribution cost when computing bottom-levels
  • A lot of research in good redistribution
    algorithms
  • A lot of research in redistribution cost
    estimation

83
CPA Phase 1
  • The goal of phase one is to find the best
    allocation of processors to tasks
  • Assumption For each task I can predict its
    execution time on any given number of processors
  • I have previous benchmarking results
  • I have a good performance model
  • The CPA heuristic relies on the fact that we have
    two lower bounds on makespan
  • Length of the critical path
  • Execution time assuming no idle time

84
CPA Phase 1
  • Consider an allocation
  • that is a number of processors for each task
  • Critical path length TCP
  • Computed ignoring data redistribution costs
  • Not accurate, but a lower bound on the overall
    makespan
  • Ideal makespan TA
  • For each task, compute its execution time times
    its number of processors
  • Take the sum over all tasks, and divide by the
    total number of processors, p
  • This is a lower bound on the overall makespan

85
CPA Phase 1
  • Consider an allocation
  • that is a number of processors for each task
  • Critical path length TCP
  • Computed ignoring data redistribution costs
  • Not accurate, but a lower bound on the overall
    makespan
  • Ideal makespan TA
  • For each task, compute its execution time times
    its number of processors
  • Take the sum over all tasks, and divide by the
    total number of processors, p
  • This is a lower bound on the overall makespan

p
makespan
86
CPA Phase 1
  • Phase 1 starts by giving one processor to each
    task
  • TCP is large
  • TA is small
  • At each step, give one more processor to one task
    on the (current) critical path
  • Give a processor to the task that would benefit
    the most from it
  • i.e., the task that would achieve the highest
    speedup
  • Each time we do this, TCP diminishes and TA
    increases
  • Stop when TCP lt TA

87
CPA rationale
  • By picking the allocations that make both lower
    bounds equal, one maximizes the chances that the
    makespan is as low as possible
  • Not a true justification
  • Just an intuitive notion why the heuristic should
    work in practice

Makespan
TA
TCP
algorithm steps
88
CPA
  • Note that Phase 2 gets stuck with the
    allocations chosen in Phase 1 and has to
    schedule them
  • Still, as far as we know, a 2-phase approach is
    the best weve got so far
  • What about a heterogeneous cluster or a set of
    different homogeneous clusters?
  • An actively pursued research question
  • How about accounting for redistribution costs in
    phase 1?
  • Still an open research question

89
Conclusion
  • Scheduling is a difficult problem
  • One is left coming up with heuristics
  • Typically based on some type of justifiable
    intuition
  • Difficult to have theoretical comparison of
    different heuristics
  • Just try them and see what works for the type of
    DAGs that one needs to execute
  • A few empirical results like If your DAGs have
    these characteristics heuristic 1 tends to be
    better than heuristic 2
Write a Comment
User Comments (0)
About PowerShow.com