Title: John Cavazos
1Models to Predict Good Compiler Optimizations
- John Cavazos
- Dept of Computer Information Sciences
- University of Delaware
2Whole-Program Autotuning
STEPS
- Characterize each function
- Prediction models ranks optimization seqs
- Apply top sequences
3Train and Test Model
- Train on kernels
- Generate training data (inputs, outputs)
- Automatically construct a model
- Can be expensive, but can be done offline
- Supervised learning problem
- Test on whole programs
- Extract features (code characteristics)
- Model predicts optimizations to apply
4Constructing Models
- Trained on data from Random Search
- 200 evaluations for each benchmark
- Leave-one-out cross validation
- Regression and Support Vector Machines
5Are models predictive?
6Solution Overview
Static Program Features Source code or
IR Dynamic Program Features Running time
7Solution Overview
Static Program Features Source code or
IR Dynamic Program Features Running time
Static Program Features From source code or
IR Dynamic Program Features From runtime
- Sequence Predictor
- Speedup Predictor
- Tournament Predictor
8Characterization of 181.mcf
9Characterization of 181.mcf
Problem Greater number of memory accesses per
instruction than average
10Sequence Predictor
11Speedup Predictor
12Tournament Predictor
13Open64 Optimizations
- Control 63 Open64 optimizations
- Loop optimizations
- Unrolling
- Interchange
- Fussion / Fission
- Prefetching
- Traditional optimizations
- PRE, copy prop, strength reduction, CSE, etc.
14Experimental Setup
- HPCToolkit / PAPI 3.6
- Intel Quad _at_2GHz with 8GB RAM
- Open64 Compiler version 4.2
- Baseline -Ofast
- 200 randomly generated sequences
- Benchmarks
- 25 hot functions in kernels from Linpack, NAS,
and Polybench - 16 programs from MiBench
15Experimental Setup
- 3 Prediction Models
- Sequence Predictor
- Speedup Predictor
- Tournament Predictor
- Machine Learning Algorithms
- Regression
- Support Vector Machine (SVM)
16Regression (10 evals)
AVG Sequence 9, Tournament 21, Speedup 25
17SVM (10 evals)
AVG Sequence 9, Tournament 15, Speedup 23,
18Future work
- Apply to whole applications
- Applying to different compilers and architectures
- Compare different characterization methods
- Source code vs dynamic characterization
19Across Different Machines
- Machine 1
- 4 of Intel Core2 Quad CPU Q9650 _at_ 3.00GHz
- RAM 8 GB
- Cache size 6144 KB
- Machine 2
- 4 of Intel Core2 Quad CPU Q9300 _at_ 2.50GHz
- RAM 4 GB
- Cache size 3072 KB
- Machine 3
- 4 of Intel Xeon CPU E5335 _at_ 2.00GHz
- RAM 2 GB
- Cache size 4096 KB
20MiBench Across Machines
21Case Study PoCC
- Experimental setup
- Intel Xeon E5620 _at_2.4 GHz
- 16 hardware threads
- Baseline ICC fast
- 768 randomly generated sequences
- PoCC (Polyhedral Compiler Collection)
- Unrolling, Tiling, Loop fuse, Auto
parallelization - Polybench (28 kernels)
- Speedup Predictor (Regression / SVM)
22PoCC (10 evals)
18.26
20.67
14.44
AVERAGE Random 3.1X, SVM 5.8X, LR 5.9X, Best
6.1X
23Model Evaluation Summary
- Comparison of 3 prediction models with 2 machine
learning algorithms on kernels - Newly proposed and evaluated models (speedup
predictor and tournament predictor) outperformed
state-of-the-art predictor - Applying speedup predictor trained with kernels
to MiBench - For seen sequences 5.4 for regression, 4.6 for
SVM - For unseen sequences 5.1 for regression, 2.1
for SVM
Sequence Predictor
Speedup Predictor
Tournament Predictor
Regression
8.7
25.0
20.7
SVM
8.8
22.5
14.6
24Regression/SVM - MiBench (Speedup Predictor / 10
evals)
- Regression seen sequences 5.4, unseen
sequences 5.1 - SVM seen sequences 4.6, unseen sequences 2.2
25Dynamic Program Features
26Optimizations
Phase
List of Optimizations
OPT
align-padding, ptr-opt, swp, unroll-size
WOPT
aggcm, aggstr-reduction, const-pre,
copy-propagate, bdce, dce-aggressive,
dce-global, hoisting,iv-elimination, spre,
value-numbering, dse-aggressive,unroll,
canon-expr, aggcm-threshold, combine, intrinsic,
mem-opnds, fold2const
LNO
optimize-cache, lego, prefetch-stores,prefetch-ahe
ad, interchange, pure, fusion, hoistif,
blocking-size, ecspct,fission, fission-inner-regis
ter-limit, full-unroll, full-unroll-size,
fusion-peeling-limit, outer-unroll-max,
sclrze, outer-unroll-further, max-depth,
outer-unroll-prod-max, shackle, svr, cse,
preferred-doacross-tile size, prefetch-cache-fact
or, vintr, split-tile, olf-ub, unswitch,
lego-local, call-info, prefetch,
apply-illegal-xform-directives
CG
unroll-fully, gcm, loop-opt
IPA
aggr-cprop, cprop, dce, dve
27PoCC (1 evals)
AVERAGE- SVM 188.9, LR 256.53
28Whole-Program Autotuning
- Current Solution
- Outline hot functions
- Tune hot function in isolation
- Integrate tuned hot function back in application
- Disadvantage
- Does not account for code interactions
29Proposed solution
- Tune several hot functions at same time
- Cannot afford random exploration
- Performance model ranks variants
- Apply only predicted best optimizations