Title: Principles of High Performance Computing ICS 632
1Principles of High Performance Computing (ICS
632)
2Scheduling
- Scheduling is the art of assigning work to
resources throughout time in a way that optimizes
some metric of performance - e.g., factory workers to machines so that the
largest number of cars can be produces in a day - e.g., processors to computations and disks to
data items so that a given application can finish
before a deadline - e.g., packets to network links so that overall
network throughput is maximized - It is a very broad field, used in many domains
- Many scheduling problems are known to be very
difficult (i.e., intractable)
3Scheduling
- In the area of high performance computing, most
scheduling problems are for taskgraphs - Taskgraphs are graphs of tasks where edges
corresponds to precedence constraints
4Where do DAGs come from?
- Consider a (lower) triangular linear system solve
- What you would need to do after an LU
factorization
Ax b
- Simple Algorithm
- for (i 0 i lt n i)
- xi bi / ai,i
- for (ji1 iltn i)
- bj bi - aj,i xi
-
-
5Where do DAGs come from?
- Consider a (lower) triangular linear system solve
- What you would need to do after an LU
factorization
Ax b
- Simple Algorithm
- for (i 0 i lt n i)
- Ti,i xi bi / ai,i
- for (ji1 iltn i)
- Ti,j bj bi - aj,i xi
-
-
6Tasks, Dependencies, etc.
- for (i 0 i lt n i)
- Ti,i xi bi / ai,i
- for (ji1 iltn i)
- Ti,j bj bi - aj,i xi
-
-
- All tasks Ti, are executed at iteration i of the
outer loop - There is a simple sequential order of the tasks
- T0,0 lt T0,1 lt ... lt T0,n-1 lt T1,0 lt T1,1 lt ... lt
T1,n-1 lt ... - Of course, when considering a parallel execution,
one tries to find independent tasks - To see if tasks are independent one must examine
their input (In) and their output (Out)
7Tasks, Dependencies, etc.
- for (i 0 i lt n i)
- Ti,i xi bi / ai,i
- for (ji1 iltn i)
- Ti,j bj bi - aj,i xi
-
-
- Input and Output
- In(Ti,i) bi, ai,i
- Out(Ti,i) xi
- In(Ti,j) bi, aj,i, xi for j gt i
- Out(Ti,j) bj for j gt i
- Bernstein Conditions
- T and T are independent if all 3 conditions are
met - In(T) ? Out(T) ?
- Out(T) ? In(T) ?
- Out(T) ? Out(T) ?
8Task Graph
- for (i 0 i lt n i)
- Ti,i xi bi / ai,i
- for (ji1 iltn i)
- Ti,j bj bi - aj,i xi
-
-
- It is easy to see that
- for all i, all Ti,j are independent of each other
for j gt i - for all i, all Ti,j depend on Ti,i, for j gt i
- for all i, all Ti,j depend on Ti-1,j for j gt i
and i gt 0 - Hence the task graph
9Task Graph
0,0
0,1
0,2
0,3
0,4
0,5
1,1
1,2
1,3
1,4
1,5
2,2
2,3
2,4
2,5
- for (i 0 i lt n i)
- Ti,i xi bi / ai,i
- for (ji1 iltn i)
- Ti,j bj bi - aj,i xi
-
-
3,3
3,4
3,5
4,4
4,5
5,5
10More taskgraphs
- The previous taskgraph comes from a low-level
analysis of the code - It probably makes little sense to do a parallel
implementation with MPI with such a low task
granularity - Can totally make sense with OpenMP
- Such task graphs can also be used by compilers to
do code optimization by exploiting multiple
functional units, pipelines functional units,
etc. - With blocking these tasks could become MPI
tasks - Other taskgraphs are really how the application
was build
11Scientific Workflows
- A popular way in which many scientific
applications are constructed is as workflows - A scientists conceptually drags and drops
computational kernels and connects their
input-output - The result is a DAG (actually more general than a
DAG) that does something useful - Example Application Montage
- Produce Mosaic of the Sky
- Based on multiple data sources
- Given angle, coordinates, size, etc.
- 10s of thousands of tasks
- Example M101 galaxy images
12Sample Montage DAG
13Many levels of parallelisms
14Many levels of parallelisms
15Many levels of parallelisms
OpenMP Threads
16Back to Basics
- Definition A task system is a directed graph G
(V,E) - V set of vertices
- E set of edges
- (u,v), such that both u and v are in V
- denotes precedence task u must be executed
before task v - A (integer) weight w may be assigned to each
vertex - e.g., computation duration on some reference
platform - A schedule is a mapping of each vertex to
available resources so that precedence
constraints are not violated - a resource can only run a task at a time
- otherwise consider it to be multiple resources
- There is a lot of obvious and intricate formalism
we can use to describe all this rigorously, and
Ill try to stay away from it
17Gantt Chart with 3 processors
10
processors
5
2
4
4
7
8
2
time
18Acyclic Graphs
- Theorem There exists a valid schedule if and
only if there is no cycle in the graph - Proof a little less obvious than one would think
- Uses formalisms we havent really introduced.
- But intuitively its pretty clear that the
theorem should hold true. - Therefore we only consider DAGs
- Directed Acyclic Graphs
19Makespan
- The makespan is defined as the overall execution
time
makespan
20Lower bound on the makespan
- Let F be a path in the DAG
- That is a sequence of vertices so that the second
vertex depends on the first vertex, the third
vertex depends on the second vertex, etc. - Define the length of a path as the sum of the
vertex weights along the path - Then, for each possible path, the makespan is
longer than the length of that path - Therefore, the makespan is longer than the length
of the longest path - The longest path is called the critical path
21Two Scheduling Problems
- Pb(p) Given a DAG G (V,E), find the schedule
that achieves the smallest makespan using p
processors - MSOpt(p) the smallest makespan
- Pb(inf) Given a DAG D (V,E), find the schedule
that achieves the smallest makespan using an
unbounded number of processors - MSOpt(inf) the smallest makespan
22Solving Pb(inf)
- If one has an infinite number of processors,
obtaining the optimal schedule is actually very
simple - Assign each task to a different processor
- Start a task whenever it is ready, i.e., when
its parent tasks have completed) - Sketch of a proof
- No (unnecessary) idle time occurs between tasks
on any path - Consider the tasks on the critical path
- The last tasks of the DAG is on the critical path
- If not, add a dummy task
- The makespan is equal to the length of the
critical path - Therefore its optimal
- Another obvious proof that may look complex in
its full formal version
23Solving Pb(p)
- Associated decision problem X Given a DAG, p
processors, and a time T, can we execute the
whole DAG before time T? - Theorem Problem X is NP-complete
- Proof
- Its in NP (a guess solution can be checked in
polynomial time) - Reduction to 2-PARTITION Given n positive
integers a1, ..., an, can we find I, a subset
of 1,...,n, such that
24Reduction to 2-PARTITION
- Consider an instance of 2-PARTITION
- a1, ..., an
- We construct an instance of Pb(p) as follows
- n vertices v1, ..., vn
- w(vi) ai
- no dependences,
- p 2 processors
- T
- If the instance of 2-PARTITION has a solution,
then the instance of Pb(p) has a solution - If the instance of Pb(p) has a solution, then the
instance of 2-PARTITION has a solution - The reduction is in polynomial time
- Therefore, Pb(p) is NP-complete
- Even when there are only 2 processors and all
tasks are independent!!!
25Scheduling Independent Tasks
tasks
processors
26Other complexity results
- There are many such complexity results
- Many of them come down to this notion of packing
boxes in a Gantt Chart - 2-PARTITION
- KNAPSACK
- BINPACKING
- etc.
- For instance
- Scheduling a DAG on 2 processors is NP-complete
even if all tasks have only weight 1 or 2 - Well see a taxonomy of scheduling problems when
we talk about batch scheduling in the next set of
slides
27What about communications?
- If the processors are on a network (as opposed to
in a shared memory machine), then we need to
account for the cost of communication of data
among tasks - Each edge in the DAG now has a weight
- e.g., data transfer time on a reference network
- Common Assumption If two tasks are scheduled on
the same processor the edge weight is ignored - Or at least its made very small
- There is now a notion of network topology as
well, which may be regular or irregular - Accounting for communication costs makes things
much more complicated - Pb(inf) becomes NP-complete!
28So where are we?
- Scheduling is an area rife with NP-completeness
problems - If you work on a scheduling problem, chances are
high that it is intractable - What do we do when we face intractable problems?
- Try to come up with good approximation algorithms
- i.e., guaranteed to be a factor X from optimal,
where X is not too large - Try to come up with heuristics that do a decent
job in practice - many have no guarantees, or they all have the
same loose guarantee
29List Scheduling Heuristics
- A list scheduling algorithm works as follows
- At each instant, look whether at least one of the
processors is idle - If so, pick one of the ready tasks, if any
- Assign to one of the idle processors
- Repeat
- This is a greedy algorithm that amounts to
aggressively limiting idle time - Of course, the two questions are
- How do we prioritize ready tasks when there are
more than one to choose from? - Which host do we assign that task to if there are
multiple idle hosts? - Based on answers to these questions, one can have
a worse or a better heuristic
30Guarantee
- Here is a very powerful result regarding list
scheduling heuristics in general - Theorem Consider a DAG G to schedule onto p
identical processors, with no communication. Let
MSOpt(p) be the optimal makespan. Let MS(S,p) be
the makespan achieved by ANY list scheduling
heuristic S. We have - MS(S,p) lt (2 - 1/p) MSOpt(p)
- In other terms, a list heuristic is at worst 2
factors away from the optimal - Note that there can still be good and bad
heuristics - But this result says that bad isnt that awful
- Such results are always intriguing because we
dont know the optimal schedule (the problem is
NP-complete), but we can say how far from it we
are! - Lets look at the sketch of the proof
31Sketch of the proof
- Consider a task that ends last
32Sketch of the proof
- Consider the latest time before that tasks
beginning such that at least one processor is
idle at that time
33Sketch of the proof
- Why isnt the red task running earlier?
- It has to be because one of its parents is
running - Otherwise, it wouldnt be list scheduling!
34Sketch of the proof
- Lets look at the tasks parent
35Sketch of the proof
- And at the latest time before the parents
beginning at which there is an idle processor - We ask the same question as before
36Sketch of the proof
37Sketch of the proof
38Sketch of the proof
- In the end we have found a path in the DAG such
that there cannot be any idle processor when the
tasks on this path are not running
39Sketch of the proof
- In the end we have found a path in the DAG such
that there cannot be any idle processor when the
tasks on this path are not running - Let L be the length of that path
- The most idle time that can happen is when ALL
processors are idle when the tasks on the path
are running, but for the processor that executes
the task on the path - Let Idle be the total amount of processor Idle
time - We have Idle (p-1) L
40Counting the Boxes
- If we add up all the boxed (white and gray)
together, we get p MS(S,p) - The area of the big rectangle
- The white boxes correspond to Idle
- The grey boxed correspond to the sequential
execution time, Seq
41Counting the Boxes
- We have p MS(S,p) Idle Seq
- We had before that Idle (p-1) L
- But L MSOpt(p)
- And MSOpt(p) Seq / p
- Therefore Seq MSOpt(p) p
- We obtain that
- p MS(S,p) (p-1) MSOpt(p) p MSOpt(p)
- which means that MS(S,p) (2 - 1/p) MSOpt(p)
- The theorem is proven!
- In fact, it can be proven that this is the best
bound!
42List Scheduling Heuristics
- A typical scheduling algorithm
- while tasks left to schedule
- Determine the set of ready tasks
- Pick one of the ready tasks
- Pick one of the available hosts
- Assign the task to the host
- end while
- This algorithm works offline by putting slots in
a Gantt chart for all tasks - Once all tasks have been assigned to processors
throughout time, then one can just follow the
schedule - Each processor computes a subset of the tasks in
some order
43Independent Tasks
- Assume that there are no edges in the DAG
- Then any task can be schedule at any time at
which a processor is idle - The algorithm becomes
- Compute a priority for each task
- while tasks left to schedule
- pick the task with the highest priority
- schedule it on the first available host
- end while
- Remaining question how do we define the priority?
44Independent Tasks
- One probably reasonable idea is to give higher
priority to the longer tasks
45Independent Tasks
- One possibility is to give higher priority to
the longer tasks
3
2
8
3 processors
1
4
7
9
0
5
6
time
46Heterogeneous Processors
- What if not all processors are identical??
- Heterogeneous compute speeds
- The algorithm can be modified as follows
- while tasks left to schedule
- for all unscheduled tasks Ti
- for all hosts Hj
- compute the completion time of Ti on Hj
CTi,j - end for
- compute the priority as a function of all
CTi,j Pi - end for
- pick task Tk with the best Pk
- pick host Hl that minimizes CTk,l
- schedule task Tk on host Hl
- end while
47Heterogeneous Processors
- What if not all processors are identical??
- Heterogeneous compute speeds
- The algorithm can be modified as follows
- while tasks left to schedule
- for all unscheduled tasks Ti
- for all hosts Hj
- compute the completion time of Ti on Hj
CTi,j - end for
- compute the priority as a function of all
CTi,j Pi - end for
- pick task Tk with the best Pk
- pick host Hl that minimizes CTk,l
- schedule task Tk on host Hl
- end while
Priority computation inside the loop (dynamic
priorities)
48Two parameters
- while tasks left to schedule
- for all unscheduled tasks Ti
- for all hosts Hj
- compute the completion time of Ti on Hj
CTi,j - end for
- compute the priority as a function of all
CTi,j Pi - end for
- pick task Tk with the best Pk
- pick host Hl that minimizes CTk,l
- schedule task Tk on host Hl
- end while
Define the priority
Define best
49Two parameters
- while tasks left to schedule
- for all unscheduled tasks Ti
- for all hosts Hj
- compute the completion time of Ti on Hj
CTi,j - end for
- compute the priority as a function of all
CTi,j Pi - end for
- pick task Tk with the best Pk
- pick host Hl that minimizes CTk,l
- schedule task Tk on host Hl
- end while
Max Min
Min of the CTi,j Max of the CTi,j Difference
between the two largest CTi,j
50List Scheduling
- MinMin (aggressively pick the task that can be
done soonest) - for each task T pick the host H that achieves the
smallest CT for task T - pick the task with the smallest such CT
- schedule T on H
- MaxMin (pick the largest tasks first)
- for each task T pick the host H that achieves the
smallest CT for task T - pick the task with the largest such CT
- schedule T on H
- Sufferage (pick the task that would suffer the
most if not picked) - for each task T pick the host H that achieves the
smallest CT for task T - for each task T pick the host H that achieves
the second smallest CT for task T - pick the task with the largest (CT - CT) value
- schedule T on H
51Heterogeneity?
- Uniform Heterogeneity If task A takes time TA
and task B takes time TB on a processor p, then
task A takes time ?TA and task B takes time ?TB
on another processor p, for all tasks and
processors - Otherwise we have Non-Uniform Heterogeneity
Uniform
Non-Uniform
52Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
53Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
54Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
tasks
10 23 24 38 70 27
machines
55Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
tasks
10 23 24 38 70 27
machines
56Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
- Pick T1, schedule it on H1
tasks
10 23 24 38 70 27
machines
57Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
- Pick T1, schedule it on H1
- Update matrix
tasks
10 23 24 38 70 27
machines
tasks
33 38 27
machines
58Example (MinMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MinMin algorithm
- P110, P28, P323
- Pick T2, schedule it on H2
- Update matrix
- P110, P323
- Pick T1, schedule it on H1
- Update matrix
- P3 27
- Pick T3, schedule it on H3
- makespan 27 seconds
tasks
10 23 24 38 70 27
machines
tasks
31 38 27
machines
59Example (MaxMin)
tasks
10 24 23 16 8 30 70 12 27
machines
- MaxMin algorithm
- P110, P28, P323
- Pick T3, schedule it on H1
- Update matrix
- P116, P28
- Pick T1, schedule it on H2
- Update matrix
- P2 12
- Pick T2, schedule it on H3
- Makespan 23 seconds
tasks
33 47 16 8 70 12
machines
tasks
47 24 12
machines
60Resulting Schedules
MinMin
machine 1
Task 1
machine 2
Task 2
machine 3
Task 3
machine 1
Task 3
machine 2
Task 1
MaxMin
machine 3
Task 2
61What if we add dependencies?
- One simple way to modify the proposed heuristics
is to only consider ready tasks - while tasks left to schedule
- for all READY tasks Ti
- for all hosts Hj
- compute the completion time of Ti on Hj
CTi,j - end for
- compute the priority as a function of all
CTi,j Pi - end for
- pick task Tk with the best Pk
- pick host Hl that minimizes CTk,l
- schedule task Tk on host Hl
- end while
62What about communications?
- Our MaxMin, MinMin, and Sufferage heuristics do
not really work well for DAGs with dependencies
and communications - They are typically used for independent tasks
- Instead, many believe that the key to good DAG
scheduling with communication is to consider the
critical path - The goal should be to give priority to the tasks
on the critical path to reduce the length of the
critical path - Since the length of the critical path is a lower
bound on the makespan, making that bound as low
as possible is probably a good idea
63Critical path?
- How can we tell that a task is on the critical
path? - The difficulty here is that as scheduling
decisions are made, the critical path changes - If a tasks successor is schedule on the same
host as that tasks no communication - Otherwise communication
- So one task that could be on the critical path
due to heavy communication may end up not being
on the critical path once that communication has
been nullified
64Example
1
1
5
1
3
4
5
2
2
1.5
1.5
2
1
1
1
65Example
- By just looking at the DAG, the critical path is
the red one - So I give priority to the red tasks
- Clearly I allocate the second task on the path on
the same host as the first task - Because I want to minimize the critical path
- At that point the remainder of the red path will
take at most 521 8 time units - And the other paths would take at least 5211
9 time units - Therefore the red path is no longer the critical
path!
1
1
10
5
3
4
5
2
2
2
2
2
1
1
1
66Bottom-Level
- To deal with the previous situation we define the
bottom-level of a task as - The sum of the weights along the longest (i.e.,
heaviest) path from that task to the end of the
DAG, including the tasks execution time - Assuming that ALL communications will take place
(as opposed to being set to zero) - Schedule tasks in decreasing order of their
bottom-level - At each step schedule the task with the largest
bottom level on the host that can complete that
task the soonest - Accounts for previous scheduling decisions
67Example
BL19
1
1
10
BL16
5
3
4
BL8
5
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
68Example
1
10
BL16
5
3
4
BL8
5
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
69Example
1
10
3
4
BL8
5
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
70Example
1
10
3
4
2 processors
2
2
BL7
BL7
2
2
2
1
BL3
1
1
BL1
71Example
1
10
3
4
2 processors
2
BL7
2
2
2
1
BL3
1
1
BL1
72Example
1
10
3
4
2 processors
2
2
2
2
1
BL3
1
1
BL1
73Example
1
10
3
4
2 processors
2
2
2
2
1
1
1
BL1
74Critical Path Scheduling
- The above algorithm is typically referred to as
Modified Critical Path (MCP) - There are many proposed scheduling algorithms
that use similar ideas - In practice theyve been showed to be reasonable
- Survey article by Kwok and Ahmad in JPDC
- Next question what if the resources are
heterogeneous??? - In this case, how can we compute the bottom level
at all since the times will depend on which
resources are picked?
75MCP on heterogeneous resources
- The typical approach when computing the bottom
level of a task is to compute is using averages
over all resources - For each task (computation or communication)
compute its average execution time (over all
processors or over all network links) - Other issue Irregular network topologies?
- What if there is no direct communication link
between two processors and one must hop through a
third one - And many other variations that require
modifications of the list scheduling heuristic - Many modification have been proposed
76Mixed Parallelism
- So now that we have some sort of a handle on
scheduling DAGs of sequential tasks, what about
scheduling DAGs of parallel tasks - Also called mixed parallelism
77The problem
- Assume you have a cluster
- p homogeneous compute nodes
- homogeneous network
- e.g., a switch
- I have a DAG of tasks
- Each task can run on 1 to p processors
- e.g., MPI tasks
- Question
- how many processors to allocate to each task?
- how to schedule the tasks?
78Trade-offs
- If we give small numbers of processors to each
task we have - many small and long tasks
- ability to run many of them in parallel
- If we give large numbers of processors to each
task we have - many large and short tasks
- we cannot run many of them in parallel
- The question is whats the best trade-off?
- Known to be NP-complete
- As usual
79Example 2 Schedules
- The schedule should consider the speedup curve
of the parallel tasks - Is it worth it to give a task more processors?
p processors
p processors
80The CPA algorithm
- A number of scheduling algorithms for mixed
parallel applications have been proposed - Most of the algorithms proceed in two phases
- phase 1 Determine how many processors each task
should receive - phase 2 Schedule the tasks with some MCP-like
algorithm - Phase 1 is the more interesting one, but lets
discuss Phase 2 briefly
81CPA Phase 2
- The only difficult issue here is the
communication among tasks because of data
redistribution - Consider two tasks
- T2 depends on T1
- T1 is a matrix multiplication using 2D data
distribution - T2 is an LU factorization using 2D data
distribution - T1 was allocated 16 processors
- T2 was allocated 4 processors
82CPA Phase 2
- Data redistribution can be much more complicated
that in this example - The scheduling phase must consider the
redistribution cost when computing bottom-levels - A lot of research in good redistribution
algorithms - A lot of research in redistribution cost
estimation
83CPA Phase 1
- The goal of phase one is to find the best
allocation of processors to tasks - Assumption For each task I can predict its
execution time on any given number of processors - I have previous benchmarking results
- I have a good performance model
- The CPA heuristic relies on the fact that we have
two lower bounds on makespan - Length of the critical path
- Execution time assuming no idle time
84CPA Phase 1
- Consider an allocation
- that is a number of processors for each task
- Critical path length TCP
- Computed ignoring data redistribution costs
- Not accurate, but a lower bound on the overall
makespan - Ideal makespan TA
- For each task, compute its execution time times
its number of processors - Take the sum over all tasks, and divide by the
total number of processors, p - This is a lower bound on the overall makespan
85CPA Phase 1
- Consider an allocation
- that is a number of processors for each task
- Critical path length TCP
- Computed ignoring data redistribution costs
- Not accurate, but a lower bound on the overall
makespan - Ideal makespan TA
- For each task, compute its execution time times
its number of processors - Take the sum over all tasks, and divide by the
total number of processors, p - This is a lower bound on the overall makespan
p
makespan
86CPA Phase 1
- Phase 1 starts by giving one processor to each
task - TCP is large
- TA is small
- At each step, give one more processor to one task
on the (current) critical path - Give a processor to the task that would benefit
the most from it - i.e., the task that would achieve the highest
speedup - Each time we do this, TCP diminishes and TA
increases - Stop when TCP lt TA
87CPA rationale
- By picking the allocations that make both lower
bounds equal, one maximizes the chances that the
makespan is as low as possible - Not a true justification
- Just an intuitive notion why the heuristic should
work in practice
Makespan
TA
TCP
algorithm steps
88CPA
- Note that Phase 2 gets stuck with the
allocations chosen in Phase 1 and has to
schedule them - Still, as far as we know, a 2-phase approach is
the best weve got so far - What about a heterogeneous cluster or a set of
different homogeneous clusters? - An actively pursued research question
- How about accounting for redistribution costs in
phase 1? - Still an open research question
89Conclusion
- Scheduling is a difficult problem
- One is left coming up with heuristics
- Typically based on some type of justifiable
intuition - Difficult to have theoretical comparison of
different heuristics - Just try them and see what works for the type of
DAGs that one needs to execute - A few empirical results like If your DAGs have
these characteristics heuristic 1 tends to be
better than heuristic 2