Principles of High Performance Computing ICS 632

About This Presentation

Title:

Principles of High Performance Computing ICS 632

Description:

Scheduling is the art of assigning work to resources throughout time in a way ... Many of them come down to this notion of packing boxes in a Gantt Chart. 2-PARTITION ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 90

Provided by: henrica

Category:

more less

Transcript and Presenter's Notes

Title: Principles of High Performance Computing ICS 632

1
Principles of High Performance Computing (ICS
632)

Notions of
Scheduling

2
Scheduling

Scheduling is the art of assigning work to
resources throughout time in a way that optimizes
some metric of performance
e.g., factory workers to machines so that the
largest number of cars can be produces in a day
e.g., processors to computations and disks to
data items so that a given application can finish
before a deadline
e.g., packets to network links so that overall
network throughput is maximized
It is a very broad field, used in many domains
Many scheduling problems are known to be very
difficult (i.e., intractable)

3
Scheduling

In the area of high performance computing, most
scheduling problems are for taskgraphs
Taskgraphs are graphs of tasks where edges
corresponds to precedence constraints

4
Where do DAGs come from?

Consider a (lower) triangular linear system solve
What you would need to do after an LU
factorization

Ax b

Simple Algorithm
for (i 0 i lt n i)
xi bi / ai,i
for (ji1 iltn i)
bj bi - aj,i xi

5
Where do DAGs come from?

Consider a (lower) triangular linear system solve
What you would need to do after an LU
factorization

Ax b

Simple Algorithm
for (i 0 i lt n i)
Ti,i xi bi / ai,i
for (ji1 iltn i)
Ti,j bj bi - aj,i xi

6
Tasks, Dependencies, etc.

for (i 0 i lt n i)
Ti,i xi bi / ai,i
for (ji1 iltn i)
Ti,j bj bi - aj,i xi

All tasks Ti, are executed at iteration i of the
outer loop
There is a simple sequential order of the tasks
T0,0 lt T0,1 lt ... lt T0,n-1 lt T1,0 lt T1,1 lt ... lt
T1,n-1 lt ...
Of course, when considering a parallel execution,
one tries to find independent tasks
To see if tasks are independent one must examine
their input (In) and their output (Out)

7
Tasks, Dependencies, etc.

for (i 0 i lt n i)
Ti,i xi bi / ai,i
for (ji1 iltn i)
Ti,j bj bi - aj,i xi

Input and Output
In(Ti,i) bi, ai,i
Out(Ti,i) xi
In(Ti,j) bi, aj,i, xi for j gt i
Out(Ti,j) bj for j gt i
Bernstein Conditions
T and T are independent if all 3 conditions are
met
In(T) ? Out(T) ?
Out(T) ? In(T) ?
Out(T) ? Out(T) ?

8
Task Graph

for (i 0 i lt n i)
Ti,i xi bi / ai,i
for (ji1 iltn i)
Ti,j bj bi - aj,i xi

It is easy to see that
for all i, all Ti,j are independent of each other
for j gt i
for all i, all Ti,j depend on Ti,i, for j gt i
for all i, all Ti,j depend on Ti-1,j for j gt i
and i gt 0
Hence the task graph

9
Task Graph
0,0
0,1
0,2
0,3
0,4
0,5
1,1
1,2
1,3
1,4
1,5
2,2
2,3
2,4
2,5

for (i 0 i lt n i)
Ti,i xi bi / ai,i
for (ji1 iltn i)
Ti,j bj bi - aj,i xi

3,3
3,4
3,5
4,4
4,5
5,5
10
More taskgraphs

The previous taskgraph comes from a low-level
analysis of the code
It probably makes little sense to do a parallel
implementation with MPI with such a low task
granularity
Can totally make sense with OpenMP
Such task graphs can also be used by compilers to
do code optimization by exploiting multiple
functional units, pipelines functional units,
etc.
With blocking these tasks could become MPI
tasks
Other taskgraphs are really how the application
was build

11
Scientific Workflows

A popular way in which many scientific
applications are constructed is as workflows
A scientists conceptually drags and drops
computational kernels and connects their
input-output
The result is a DAG (actually more general than a
DAG) that does something useful
Example Application Montage
Produce Mosaic of the Sky
Based on multiple data sources
Given angle, coordinates, size, etc.
10s of thousands of tasks
Example M101 galaxy images

12
Sample Montage DAG
13
Many levels of parallelisms

A scientific workflow

14
Many levels of parallelisms

A scientific workflow

15
Many levels of parallelisms

A scientific workflow

OpenMP Threads
16
Back to Basics

Definition A task system is a directed graph G
(V,E)
V set of vertices
E set of edges
(u,v), such that both u and v are in V
denotes precedence task u must be executed
before task v
A (integer) weight w may be assigned to each
vertex
e.g., computation duration on some reference
platform
A schedule is a mapping of each vertex to
available resources so that precedence
constraints are not violated
a resource can only run a task at a time
otherwise consider it to be multiple resources
There is a lot of obvious and intricate formalism
we can use to describe all this rigorously, and
Ill try to stay away from it

17
Gantt Chart with 3 processors
10
processors
5
2
4
4
7
8
2
time
18
Acyclic Graphs

Theorem There exists a valid schedule if and
only if there is no cycle in the graph
Proof a little less obvious than one would think
Uses formalisms we havent really introduced.
But intuitively its pretty clear that the
theorem should hold true.
Therefore we only consider DAGs
Directed Acyclic Graphs

19
Makespan

The makespan is defined as the overall execution
time

makespan
20
Lower bound on the makespan

Let F be a path in the DAG
That is a sequence of vertices so that the second
vertex depends on the first vertex, the third
vertex depends on the second vertex, etc.
Define the length of a path as the sum of the
vertex weights along the path
Then, for each possible path, the makespan is
longer than the length of that path
Therefore, the makespan is longer than the length
of the longest path
The longest path is called the critical path

21
Two Scheduling Problems

Pb(p) Given a DAG G (V,E), find the schedule
that achieves the smallest makespan using p
processors
MSOpt(p) the smallest makespan
Pb(inf) Given a DAG D (V,E), find the schedule
that achieves the smallest makespan using an
unbounded number of processors
MSOpt(inf) the smallest makespan

22
Solving Pb(inf)

If one has an infinite number of processors,
obtaining the optimal schedule is actually very
simple
Assign each task to a different processor
Start a task whenever it is ready, i.e., when
its parent tasks have completed)
Sketch of a proof
No (unnecessary) idle time occurs between tasks
on any path
Consider the tasks on the critical path
The last tasks of the DAG is on the critical path
If not, add a dummy task
The makespan is equal to the length of the
critical path
Therefore its optimal
Another obvious proof that may look complex in
its full formal version

23
Solving Pb(p)

Associated decision problem X Given a DAG, p
processors, and a time T, can we execute the
whole DAG before time T?
Theorem Problem X is NP-complete
Proof
Its in NP (a guess solution can be checked in
polynomial time)
Reduction to 2-PARTITION Given n positive
integers a1, ..., an, can we find I, a subset
of 1,...,n, such that

24
Reduction to 2-PARTITION

Consider an instance of 2-PARTITION
a1, ..., an
We construct an instance of Pb(p) as follows
n vertices v1, ..., vn
w(vi) ai
no dependences,
p 2 processors
T
If the instance of 2-PARTITION has a solution,
then the instance of Pb(p) has a solution
If the instance of Pb(p) has a solution, then the
instance of 2-PARTITION has a solution
The reduction is in polynomial time
Therefore, Pb(p) is NP-complete
Even when there are only 2 processors and all
tasks are independent!!!

25
Scheduling Independent Tasks
tasks
processors
26
Other complexity results

There are many such complexity results
Many of them come down to this notion of packing
boxes in a Gantt Chart
2-PARTITION
KNAPSACK
BINPACKING
etc.
For instance
Scheduling a DAG on 2 processors is NP-complete
even if all tasks have only weight 1 or 2
Well see a taxonomy of scheduling problems when
we talk about batch scheduling in the next set of
slides

27
What about communications?

If the processors are on a network (as opposed to
in a shared memory machine), then we need to
account for the cost of communication of data
among tasks
Each edge in the DAG now has a weight
e.g., data transfer time on a reference network
Common Assumption If two tasks are scheduled on
the same processor the edge weight is ignored
Or at least its made very small
There is now a notion of network topology as
well, which may be regular or irregular
Accounting for communication costs makes things
much more complicated
Pb(inf) becomes NP-complete!

28
So where are we?

Scheduling is an area rife with NP-completeness
problems
If you work on a scheduling problem, chances are
high that it is intractable
What do we do when we face intractable problems?
Try to come up with good approximation algorithms
i.e., guaranteed to be a factor X from optimal,
where X is not too large
Try to come up with heuristics that do a decent
job in practice
many have no guarantees, or they all have the
same loose guarantee

29
List Scheduling Heuristics

A list scheduling algorithm works as follows
At each instant, look whether at least one of the
processors is idle
If so, pick one of the ready tasks, if any
Assign to one of the idle processors
Repeat
This is a greedy algorithm that amounts to
aggressively limiting idle time
Of course, the two questions are
How do we prioritize ready tasks when there are
more than one to choose from?
Which host do we assign that task to if there are
multiple idle hosts?
Based on answers to these questions, one can have
a worse or a better heuristic

30
Guarantee

Here is a very powerful result regarding list
scheduling heuristics in general
Theorem Consider a DAG G to schedule onto p
identical processors, with no communication. Let
MSOpt(p) be the optimal makespan. Let MS(S,p) be
the makespan achieved by ANY list scheduling
heuristic S. We have
MS(S,p) lt (2 - 1/p) MSOpt(p)
In other terms, a list heuristic is at worst 2
factors away from the optimal
Note that there can still be good and bad
heuristics
But this result says that bad isnt that awful
Such results are always intriguing because we
dont know the optimal schedule (the problem is
NP-complete), but we can say how far from it we
are!
Lets look at the sketch of the proof

31
Sketch of the proof

Consider a task that ends last

32
Sketch of the proof

Consider the latest time before that tasks
beginning such that at least one processor is
idle at that time

33
Sketch of the proof

Why isnt the red task running earlier?
It has to be because one of its parents is
running
Otherwise, it wouldnt be list scheduling!

34
Sketch of the proof

Lets look at the tasks parent

35
Sketch of the proof

And at the latest time before the parents
beginning at which there is an idle processor
We ask the same question as before

36
Sketch of the proof

Repeat the process

37
Sketch of the proof

And again

38
Sketch of the proof

In the end we have found a path in the DAG such
that there cannot be any idle processor when the
tasks on this path are not running

39
Sketch of the proof

In the end we have found a path in the DAG such
that there cannot be any idle processor when the
tasks on this path are not running
Let L be the length of that path
The most idle time that can happen is when ALL
processors are idle when the tasks on the path
are running, but for the processor that executes
the task on the path
Let Idle be the total amount of processor Idle
time
We have Idle (p-1) L

40
Counting the Boxes

If we add up all the boxed (white and gray)
together, we get p MS(S,p)
The area of the big rectangle
The white boxes correspond to Idle
The grey boxed correspond to the sequential
execution time, Seq

41
Counting the Boxes

We have p MS(S,p) Idle Seq
We had before that Idle (p-1) L
But L MSOpt(p)
And MSOpt(p) Seq / p
Therefore Seq MSOpt(p) p
We obtain that
p MS(S,p) (p-1) MSOpt(p) p MSOpt(p)
which means that MS(S,p) (2 - 1/p) MSOpt(p)
The theorem is proven!
In fact, it can be proven that this is the best
bound!

42
List Scheduling Heuristics

A typical scheduling algorithm
while tasks left to schedule
Determine the set of ready tasks
Pick one of the ready tasks
Pick one of the available hosts
Assign the task to the host
end while
This algorithm works offline by putting slots in
a Gantt chart for all tasks
Once all tasks have been assigned to processors
throughout time, then one can just follow the
schedule
Each processor computes a subset of the tasks in
some order

43
Independent Tasks

Assume that there are no edges in the DAG
Then any task can be schedule at any time at
which a processor is idle
The algorithm becomes
Compute a priority for each task
while tasks left to schedule
pick the task with the highest priority
schedule it on the first available host
end while
Remaining question how do we define the priority?

44
Independent Tasks

One probably reasonable idea is to give higher
priority to the longer tasks

45
Independent Tasks

One possibility is to give higher priority to
the longer tasks

3
2
8
3 processors
1
4
7
9
0
5
6
time
46
Heterogeneous Processors

What if not all processors are identical??
Heterogeneous compute speeds
The algorithm can be modified as follows
while tasks left to schedule
for all unscheduled tasks Ti
for all hosts Hj
compute the completion time of Ti on Hj
CTi,j
end for
compute the priority as a function of all
CTi,j Pi
end for
pick task Tk with the best Pk
pick host Hl that minimizes CTk,l
schedule task Tk on host Hl
end while

47
Heterogeneous Processors

What if not all processors are identical??
Heterogeneous compute speeds
The algorithm can be modified as follows
while tasks left to schedule
for all unscheduled tasks Ti
for all hosts Hj
compute the completion time of Ti on Hj
CTi,j
end for
compute the priority as a function of all
CTi,j Pi
end for
pick task Tk with the best Pk
pick host Hl that minimizes CTk,l
schedule task Tk on host Hl
end while

Priority computation inside the loop (dynamic
priorities)
48
Two parameters

while tasks left to schedule
for all unscheduled tasks Ti
for all hosts Hj
compute the completion time of Ti on Hj
CTi,j
end for
compute the priority as a function of all
CTi,j Pi
end for
pick task Tk with the best Pk
pick host Hl that minimizes CTk,l
schedule task Tk on host Hl
end while

Define the priority
Define best
49
Two parameters

while tasks left to schedule
for all unscheduled tasks Ti
for all hosts Hj
compute the completion time of Ti on Hj
CTi,j
end for
compute the priority as a function of all
CTi,j Pi
end for
pick task Tk with the best Pk
pick host Hl that minimizes CTk,l
schedule task Tk on host Hl
end while

Max Min
Min of the CTi,j Max of the CTi,j Difference
between the two largest CTi,j
50
List Scheduling

MinMin (aggressively pick the task that can be
done soonest)
for each task T pick the host H that achieves the
smallest CT for task T
pick the task with the smallest such CT
schedule T on H
MaxMin (pick the largest tasks first)
for each task T pick the host H that achieves the
smallest CT for task T
pick the task with the largest such CT
schedule T on H
Sufferage (pick the task that would suffer the
most if not picked)
for each task T pick the host H that achieves the
smallest CT for task T
for each task T pick the host H that achieves
the second smallest CT for task T
pick the task with the largest (CT - CT) value
schedule T on H

51
Heterogeneity?

Uniform Heterogeneity If task A takes time TA
and task B takes time TB on a processor p, then
task A takes time ?TA and task B takes time ?TB
on another processor p, for all tasks and
processors
Otherwise we have Non-Uniform Heterogeneity

Uniform
Non-Uniform
52
Example (MinMin)
tasks

3 tasks, 3 machines