Title: Adapting Convergent Scheduling Using Machine Learning
1Adapting Convergent Scheduling Using Machine
Learning
- Diego Puppin, Mark Stephenson, Una-May
OReilly, Martin Martin, and Saman Amarasinghe
Institute for Information Science and
Technologies, Italy Massachusetts Institute of
Technology, USA
2Outline
- This talk shows how one can apply machine
learning techniques to find good phase orderings
for an instruction scheduler - First, Ill introduce the scheduler that we are
interested in improving - Then, Ill discuss genetic programming
- Then, Ill present experimental results
3Clustered Architectures
- Memory and registers separated into clusters
- RAW
- Clustered VLIWs
- When scheduling, we try to co-locate data with
computation
4Convergent Scheduling
- Convergent scheduling passes are symmetric
- Each pass takes as input a preference map and
outputs a preference map - Passes are modular and can be applied in any
order
5Convergent SchedulingPreference Maps
- Each entry is a weight
- The weights correspond to the confidence of a
space-time assignment for a given instruction
6Example Dependence Graph
- Four clusters
- High confidence
- Low confidence
7Placement Propagation
8Critical Path Strengthening
9Path Propagation
10Parallelism Distribute
11Path Propagation
12Communication Reduction
13Path Propagation
14Final Schedule
15Convergent Scheduling
- Classical scheduling passes make absolute
decisions that cant be undone - Convergent scheduling passes make soft decisions
in the form of preferences - Mistakes made early on can be undone
- Passes dont impose order!
Pass
Pass
16Double-Edged Sword
- The good news convergent scheduling does not
constrain phase order - Nice interface makes writing and integrating
passes easy - The bad news convergent scheduling does not
constrain phase order - Limitless number of phase orders to consider,
some of which are much better than others
17Our Proposal
- Use genetic programming to automatically search
for a phase ordering thats catered to a given - Architecture
- Compiler
- Our inspiration comes from Coopers work Cooper
et al., LCTES 1999
18Genetic Programming
- Searching algorithm analogous to Darwinian
evolution - Maintain a population of expressions
(sequence INITTIME (sequence PLACE (if
imbalanced LOAD COMM)))
19Genetic Programming
- Searching algorithm analogous to Darwinian
evolution - Maintain a population of expressions
- Selection
- The fittest expressions in the population are
more likely to reproduce - Reproduction
- Crossing over subexpressions of two expressions
- Mutation
20General Flow
- Randomly generated initial population
Create initial population (initial solutions)
Evaluation
done?
Selection
Create Variants
21General Flow
- Compiler is modified to use the given expression
as the phase ordering - Each expression is evaluated by compiling and
running the benchmark(s) - Fitness is the relative speedup over our original
phase ordering on the benchmark(s)
Create initial population (initial solutions)
Evaluation
done?
Selection
Create Variants
22General Flow
- Just as with Natural Selection, the fittest
individuals are more likely to survive
Create initial population (initial solutions)
Evaluation
done?
Selection
Create Variants
23General Flow
- Use crossover and mutation to generate new
expressions - And thus, generate new and hopefully improved
phase orderings
Create initial population (initial solutions)
Evaluation
done?
Selection
Create Variants
24Experimental Setup
- We use an in-house VLIW compiler (SUIF,
MachSUIF) and simulator - Compiler and simulator are parameterized so we
can easily change VLIW configurations - Experiments presented here are for clustered
architectures - Details of the architectures are in the paper
25Convergent Scheduling Heuristics
- Noise Introduction
- Initial Time Assignment
- Preplacement
- Critical Path Strengthening
- Communication Minimization
- Parallelism Distribution
- Load Balance
- Dependence Enforcement
- Assignment Strengthening
- Functional Unit Distribution
- Push to first cluster
- Critical Path Distance
- Cluster Creation
- Register Pressure Reduction in Time
- Register Pressure Reduction in Space
26Hand-Tuned Results4-cluster VLIW, Rich
Interconnect
27Results4-cluster VLIW, Limited Interconnect
28Training an Improved Sequence
- Goal find a sequence that works well for all the
benchmarks in the last graph (vmul, rbsorf, yuv,
etc.) - Train a sequence using these benchmarks then
- For each expression in the population compile and
run all the benchmarks, take the average speedup
as fitness
29The Schedule
- Evolved sequence is much more conservative in
communication - inittime ?func ?dep ?func ?load ?func ?dep ?func
?comm ?dep ?func ?comm ?place - func reduces weights of instructions on
overloaded clusters - dep increases probability that dependent
instruction scheduled nearby - comm tries to keep neighboring instructions in
same cluster
30Results4-cluster VLIW, Limited Interconnect
31ResultsLeave-One-Out Cross Validation
32Summary of Results
- When we changed the architecture, the hand-tuned
sequence failed - UAS and PCC outperform convergent scheduling
- Our GP system found a sequence that usually
outperforms UAS and PCC - Cross validation suggests that it is possible to
find a general-purpose sequence
33Running Time
- Using about 20 machines in a small cluster of
workstations it takes about 2 days to evolve a
sequence - This is a one-time process!
- Performed by the compiler vendor
34Disappointing Result
- Unfortunately, sequences with conditionals are
weeded out of the GP selection process - Our system rewards parsimony
- Convergent scheduling passes make soft decisions,
so running an extra pass may not be detrimental - Wed like to get to the bottom of this unexpected
result
35Conclusions
- Using GP were able to find architecture-specific,
application-independent sequences - We can quickly retune the compiler when
- The architecture changes
- The compiler itself changes
36(No Transcript)
37Implemented Tests