Title: Maximizing Intel
1Maximizing Intel Compiler Performance
usingIterative Feedback Directed Optimization
(IFDO)
- Ilya Cherny
- Software Manager
- Software and Services Group
- Thanks to Leonid Brusencov, Sergey Ermolaev,
Artem Chirtsov, Sergey Grebenkin! - October 23, 2008
2Agenda
- Why IFDO
- What is IFDO
- Alternative approaches
- Experimental Results
- Future plans
3A Performance Experiment
- Take 3 nested loops for matrix multiplication
- for(i 0 i lt N i)
- for(j 0 j lt N j)
- cij 0
- for(k 0 k lt N k)
- cij aik bkj
-
- Compile with maximum optimization
- icl O3 matmul.c
- Run and measure time 20 sec
- Compile with enforced vectorization
- icl -O3 mP2OPT_vec_alwaysT
- Run and measure time 11 sec
gticl help .. /O3 optimize for maximum speed
and enable high-level optimizations ..
Why dont switch on vectorization always?
4A Performance Experiment II
- Switch i and j loops, move initialization
loop outside - for(i 0 i lt N i)
- for(j 0 j lt N j)
- cij 0
- for(j 0 j lt N j)
- for(i 0 i lt N i)
- for(k 0 k lt N k)
- cij aik bkj
-
- Compiler with O3 optimization
- Measure run time 18 sec
- Compile with enforced vectorization
- Measure run time 21 sec
Thoughtless optimization loses 15
5Why Optimizations Do Not Always Win?
- Source code adding 4 integers
for(i0ilt4i) AiAiBi
Compile to scalar code
Compile to vector code
loop_start mov ebx, Bedi add Aedi,
ebx inc edi loop loop_start
movdqa xmm1, A paddd xmm1, B movdqa A, xmm1
- Vector code could be 2-4x times faster, but..
- ..what if array size is not a multiple of 4?
- ..what if A is not aligned by 16 bytes?
- ..what if A and B aligned differently?
- Final code has an overhead of these 3 if
6Heuristics for Parameterization of Compiler
Optimizations
- Compiler has hundreds of performance
optimizations - Each optimization has at least 1 Boolean
parameter - Compiler has heuristics to make a decision about
each optimization - But still code produced by the compiler is far
from optimal - Each application has unique behavior (loop
counts, memory accesses) - Each machine has unique characteristics
(latencies, caches) - It is impossible to target ALL compiler
heuristics for ALL machines for ALL applications
Compiler optimizations have considerableperforman
ce headroom if parameterized right
7Agenda
- Why IFDO
- What is IFDO
- Alternative approaches
- Experimental Results
- Future plans
8IFDO process
2.FeedbackRun executable with profiling
- 1. Optimization Compile executable from the
sources
3.DirectedAnalyze results, define options for
the next compilation
4.IterativeRepeat while time permits
9Search Algorithms
- Implemented 6 algorithms to do search in the
options space 20 - exhaustive search with priorities
- batch elimination
- iterative elimination
- combined elimination
- genetic algorithm
- statistical selection 23
Search algorithms find the maximum much faster
than after 2n iterations
10Compiler Optimization Options
- Selected 6 undocumented options which seem to
have the highest impact - loop vectorization
- loop fusion
- loop distribution
- loop unrolling
- data blocking
- memory prefetch
- Number of total independent option values is 13
- Search space for these 6 options has 7000
combinations - Extracted all compiler options from the sources
and just started the runs for more than 1000
options
11IFDO Tool
- User defines how to build and execute his
application - IFDO tool outputs the best binary and all
performance data - Based on IFDO results user can
- change compilation options or pragmas
- modify sources
- improve compiler
IFDO tool automates iteration process
12IFDO Sample Output - SPEC CPU2000 173.applu
- iter stat proc ticks impr noise
parameter val parameter val - --------------------------------------------------
----------------------------- - 1 ... Total 9571493442 0.00 0.21
- 1 ... _SSOR 7335205487 0.00 0.26
VectorizeP 1 BlockP 1 - 1 ... _RHS 2218915958 0.00 0.02
VectorizeP 1 BlockP 1 - 2 ... Total 6234529813 34.86 0.01
- 2 ... SSOR 4048766558 44.80 0.01
VectorizeP 1 BlockP 2 - 2 ... _RHS 2168605738 2.27 0.05
VectorizeP 1 BlockP 2 - 3 ... Total 12889959745 -34.67 0.06
- 3 ... SSOR 11309558116 -54.18 0.07
VectorizeP 2 BlockP 1 - 3 ... RHS 1563970212 29.52 0.01
VectorizeP 2 BlockP 1 - 4 ... Total 10806270474 -12.90 0.15
- 4 ... _SSOR 8657609979 -18.03 0.10
VectorizeP 3 BlockP 1 - 4 ... _RHS 2102104477 5.26 0.01
VectorizeP 3 BlockP 1 - 5 .. Total 5602632636 41.47 0.00
But Intel Compiler improved since 10.0 SPEC
173.applu has 8 only headroom now!
41 performance gain!
13Agenda
- Why IFDO
- What is IFDO
- Alternative approaches
- Experimental Results
- Future plans
14Comparison to other publications
- There are 23 references to the similar papers
- No publications have compiler with procedure
level granularity - All publications were limited by about 50
options, while we started exploring 1000 of
undocumented compiler options
Our work has two novel characteristics
15Comparison to other tools
Granularity Profiling Conclusion
Manual options search - whole application or source changes -/ any profiler, but manually user time consuming
PGOprof_gen/prof_use) basic block level - basic block counters only 2 iterations only
PathScale PathOpt2 - whole application - whole application 40 less results
IFDO function or loop instrumentation, VTune not available in the product
16Agenda
- Why IFDO
- What is IFDO
- Alternative approaches
- Experimental Results
- Future plans
17Search Algorithms Performance Growth dependency
on Iteration
- BE works for independent options only, but just
14 iterations - IE and CE get 99 in 30 iterations the most
effective
All algorithms gain 2.5-4 in CPU2000 total time
18Procedure vs. Application granularity
8 from 22 benchmarks gain from procedure level
19Options Values Contribution to Performance
Increase
Each option gives about 2 percent
201000 Options Impact on Performance
Only 600 from 3000 options have zero impact
21Agenda
- Why IFDO
- What is IFDO
- Alternative approaches
- Experimental Results
- Future plans
22Future Plans
- Make experiments with all undocumented options
- 3000 values if no impact on each other
- more than 23000 combinations!
- Implement storing of application properties
- number of FP expressions, number of loops, etc.
- Implement expert/machine learning system
- suggest options according to application
properties - may decrease number of iterations down to 1
- can substitute existing compiler heuristics?
Useful to both compiler developers and users
23Summary
- If performances critical, try at least icl O3!
24???????! Thanks!
25References 1-13
- 1 F. Bodin, T. Kisuki, P. Knijnenburg, M.
OBoyle, and E. Rohou, Iterative compilation in
a non-linear optimization space, In Proc. ACM
Workshop on Profile and Feedback Directed
Compilation, 1998, Organized in conjunction with
PACT98. - 2 K. Cooper, D. Subramanian, and L. Torczon,
Adaptive optimizing compilers for the 21st
century, J. of Supercomputing, 32(1), 2002. - 3 J. Bilmes, K. Asanovic, C. Chin, and J.
Demmel, Optimizing matrix multiply using PHiPAC
A portable, high- performance, ANSI C coding
methodology, In Proc. ICS, pages 340-347, 1997. - 4 M. Stephenson and S. Amarasinghe, Prediction
unroll factor using supervised classification,
In ERRR/ACM International Symposium on Code
Generation and Optimization (CGO 2005), ERRR
Computer Society, 2005. - 5 Yom-Toy, J. Thomson, O. Temam, A. Zaks, H.
Leather, C. Miranda, M. Namolaru, E. Bonilla,
Saclay, B. Mendelson, C. Williams, Haifa, M.
OBoyle, P. Barnard, E. Ashton, E. Courtois, F.
Bodin MILEPOST GCC machine learning based
research compiler, ARC, International, UK, CAPS
Enterprise, France, 2007. - 6 K. Hoste, L. Eeckhout, COLE Compiler
Optimization Level Exploration, ELIS Department,
Ghent University, Sing-Pietersnieuwstraat 41,
B-9000 Gent, Belgium, 2008. - 7 K. Deb, Multi-Objective Optimization using
Evolutionary Algorithms, Wiley, 2001. - 8 G. Fursin, J. Cavazos, M. OBoyle, and O.
Temam, MiDataSets Creating the Conditions for a
More Realistic Evaluation of Iterative
Optimization, ALCHEMY Group, INRIA Futurs and
LRI, Paris-Sud University, France, 2007. - 9 M. Byler, M. Wolfe, J.R.B. Davies, C. Huson,
and B. Leasure, Multiple version loops. In
ICPP, 1987, pages 312-318, 2005. - 10 K. D. Cooper, M. W. Hall, and K. Kennedy,
Procedure cloning, In Proceedings of the 1992
IEEE International Conference on Computer
Language, pages 99-105, 1992. - 11 P. Diniz and M. Rinard. Dynamic feedback
An effective technique for adaptive computing,
In Proc. PLDI, pages 71-84, 1997. - 12 G. Fursin, C. Miranda, S. Pop, A. Cohen, O.
Temam, Practical Run-time Adaptation with
Procedure Cloning to Enable Continuous Collective
Compilation, Alchemy group, INRIA Futurs and
LRI, Paris-Sud 11 University, Orsay, France,
2007. - 13 V. Bala, E. Duesterwald, and S. Banerjia,
Dynamo A transparent dynamic optimization
system, In ACM SIGPLAN Notices, 2000.
26References 14-23
- 14 R. H. Saavedra and D. Park, Improving the
effectiveness of software prefetching with
adaptive execution, In Conference on Parallel
Architectures and Compilation Techniques
(PACT96), 1996. - 15 M. Voss and R. Eigemann, High-level
adaptive program optimization with adapt, In
Proceedings of the Symposium on Principles and
practices of parallel programming, 2001. - 16 G. Fursin, A. Cohen, M. OBoyle, and O.
Temam, A Practical Method For Quickly Evaluating
Program Optimizations, Institute for Computing
Systems Architecture, University of Edinburgh,
UK, 2005. - 17 T. Sherwood, E. Perelman, G. Hamerly, and B.
Calder, Automatically characterizing large scale
program behavior, In 10th International
Conference on Architectural Support for
Programming Languages and Operating Systems,
2002. - 18 J. Lau, S. Schoenmackers, and B. Calder,
Transition phase classification and prediction,
In International Symposium on High Performance
Computer Architecture, 2005. - 19 Z. Pan, R. Eignmann, Fast and Effective
Orchestration of Compiler Optimizations for
Automatic Performance Tuning., Proceedings of
the International Symposium on Code Generation
and Optimization, 2006. - 20 S. Triantafyllis, M.J. Bridges, E. Raman, G.
Ottoni and D. August, A Framework for
Unrestricted Whole-Program Optimization.,
Proceedings of the 2006 ACM SIGPLAN Conference on
Programming Language Design and Implementation,
2006. - 21 Z. Pan, R. Eignmann, Fast, Automatic,
Procedure-Level Performance Tuning., Proceedings
of the 15th International Conference on Parallel
Architecture and Compilation Techniques, 2006. - 22 H. Feltl, Ein Genetischer Algorithmus fuer
das Generalized Assignment Problem,
Diplomarbeit, 2003. - 23 M. Haneda, P. Knijnenburg, H. Wijshoff
Automatic Selection of Compiler Options Using
Non-Parametristic Inferential Statistics.,
Proceedings of the 14th International Conference
on Parallel Architecture and Compilation
Techniques, 2005.
27Basic Foil with Take-Away Banner
- Use Verdana Bold for Main Body Subheadings
- Use Verdana regular for main body text.
- Use charcoal gray (RGB 51 51 51) color as the
default - Text size can vary. Use these minimum recommended
font sizes - Slide title 32 pt
- Main body subheadings 20 pt
- Bullet points 18 pt (with sub-bullets reducing
by 2 pt each 16 pt, 14 pt, etc) - Tables 14 pt
- Diagram and chart labels 12 pt
- Emphasize with italics, bold or color (blue)
- Use text boxes to highlight content
- Primary text should be on background color, not
photos, etc. - Use bullets same color as text
Standard Take-Away Banner. Add Text Here!