Maximizing Intel presentation

About This Presentation

Transcript and Presenter's Notes

Title: Maximizing Intel

1
Maximizing Intel Compiler Performance
usingIterative Feedback Directed Optimization
(IFDO)

Ilya Cherny
Software Manager
Software and Services Group
Thanks to Leonid Brusencov, Sergey Ermolaev,
Artem Chirtsov, Sergey Grebenkin!
October 23, 2008

2
Agenda

Why IFDO
What is IFDO
Alternative approaches
Experimental Results
Future plans

3
A Performance Experiment

Take 3 nested loops for matrix multiplication
for(i 0 i lt N i)
for(j 0 j lt N j)
cij 0
for(k 0 k lt N k)
cij aik bkj
Compile with maximum optimization
icl O3 matmul.c
Run and measure time 20 sec
Compile with enforced vectorization
icl -O3 mP2OPT_vec_alwaysT
Run and measure time 11 sec

gticl help .. /O3 optimize for maximum speed
and enable high-level optimizations ..
Why dont switch on vectorization always?
4
A Performance Experiment II

Switch i and j loops, move initialization
loop outside
for(i 0 i lt N i)
for(j 0 j lt N j)
cij 0
for(j 0 j lt N j)
for(i 0 i lt N i)
for(k 0 k lt N k)
cij aik bkj
Compiler with O3 optimization
Measure run time 18 sec
Compile with enforced vectorization
Measure run time 21 sec

Thoughtless optimization loses 15
5
Why Optimizations Do Not Always Win?

Source code adding 4 integers

for(i0ilt4i) AiAiBi
Compile to scalar code
Compile to vector code
loop_start mov ebx, Bedi add Aedi,
ebx inc edi loop loop_start
movdqa xmm1, A paddd xmm1, B movdqa A, xmm1

Vector code could be 2-4x times faster, but..
..what if array size is not a multiple of 4?
..what if A is not aligned by 16 bytes?
..what if A and B aligned differently?
Final code has an overhead of these 3 if

6
Heuristics for Parameterization of Compiler
Optimizations

Compiler has hundreds of performance
optimizations
Each optimization has at least 1 Boolean
parameter
Compiler has heuristics to make a decision about
each optimization
But still code produced by the compiler is far
from optimal
Each application has unique behavior (loop
counts, memory accesses)
Each machine has unique characteristics
(latencies, caches)
It is impossible to target ALL compiler
heuristics for ALL machines for ALL applications

Compiler optimizations have considerableperforman
ce headroom if parameterized right
7
Agenda

Why IFDO
What is IFDO
Alternative approaches
Experimental Results
Future plans

8
IFDO process
2.FeedbackRun executable with profiling

1. Optimization Compile executable from the
sources

3.DirectedAnalyze results, define options for
the next compilation
4.IterativeRepeat while time permits
9
Search Algorithms

Implemented 6 algorithms to do search in the
options space 20
exhaustive search with priorities
batch elimination
iterative elimination
combined elimination
genetic algorithm
statistical selection 23

Search algorithms find the maximum much faster
than after 2n iterations
10
Compiler Optimization Options

Selected 6 undocumented options which seem to
have the highest impact
loop vectorization
loop fusion
loop distribution
loop unrolling
data blocking
memory prefetch
Number of total independent option values is 13
Search space for these 6 options has 7000
combinations
Extracted all compiler options from the sources
and just started the runs for more than 1000
options

11
IFDO Tool

User defines how to build and execute his
application
IFDO tool outputs the best binary and all
performance data
Based on IFDO results user can
change compilation options or pragmas
modify sources
improve compiler

IFDO tool automates iteration process
12
IFDO Sample Output - SPEC CPU2000 173.applu

iter stat proc ticks impr noise
parameter val parameter val
--------------------------------------------------
-----------------------------
1 ... Total 9571493442 0.00 0.21
1 ... _SSOR 7335205487 0.00 0.26
VectorizeP 1 BlockP 1
1 ... _RHS 2218915958 0.00 0.02
VectorizeP 1 BlockP 1
2 ... Total 6234529813 34.86 0.01
2 ... SSOR 4048766558 44.80 0.01
VectorizeP 1 BlockP 2
2 ... _RHS 2168605738 2.27 0.05
VectorizeP 1 BlockP 2
3 ... Total 12889959745 -34.67 0.06
3 ... SSOR 11309558116 -54.18 0.07
VectorizeP 2 BlockP 1
3 ... RHS 1563970212 29.52 0.01
VectorizeP 2 BlockP 1
4 ... Total 10806270474 -12.90 0.15
4 ... _SSOR 8657609979 -18.03 0.10
VectorizeP 3 BlockP 1
4 ... _RHS 2102104477 5.26 0.01
VectorizeP 3 BlockP 1
5 .. Total 5602632636 41.47 0.00

But Intel Compiler improved since 10.0 SPEC
173.applu has 8 only headroom now!
41 performance gain!
13
Agenda

Why IFDO
What is IFDO
Alternative approaches
Experimental Results
Future plans

14
Comparison to other publications

There are 23 references to the similar papers
No publications have compiler with procedure
level granularity
All publications were limited by about 50
options, while we started exploring 1000 of
undocumented compiler options

Our work has two novel characteristics
15
Comparison to other tools
Granularity Profiling Conclusion
Manual options search - whole application or source changes -/ any profiler, but manually user time consuming
PGOprof_gen/prof_use) basic block level - basic block counters only 2 iterations only
PathScale PathOpt2 - whole application - whole application 40 less results
IFDO function or loop instrumentation, VTune not available in the product
16
Agenda

Why IFDO
What is IFDO
Alternative approaches
Experimental Results
Future plans

17
Search Algorithms Performance Growth dependency
on Iteration

BE works for independent options only, but just
14 iterations
IE and CE get 99 in 30 iterations the most
effective

All algorithms gain 2.5-4 in CPU2000 total time
18
Procedure vs. Application granularity
8 from 22 benchmarks gain from procedure level
19
Options Values Contribution to Performance
Increase
Each option gives about 2 percent
20
1000 Options Impact on Performance
Only 600 from 3000 options have zero impact
21
Agenda

Why IFDO
What is IFDO
Alternative approaches
Experimental Results
Future plans

22
Future Plans

Make experiments with all undocumented options
3000 values if no impact on each other
more than 23000 combinations!
Implement storing of application properties
number of FP expressions, number of loops, etc.
Implement expert/machine learning system
suggest options according to application
properties
may decrease number of iterations down to 1
can substitute existing compiler heuristics?

Useful to both compiler developers and users
23
Summary

If performances critical, try at least icl O3!

24
???????! Thanks!
25
References 1-13

1 F. Bodin, T. Kisuki, P. Knijnenburg, M.
OBoyle, and E. Rohou, Iterative compilation in
a non-linear optimization space, In Proc. ACM
Workshop on Profile and Feedback Directed
Compilation, 1998, Organized in conjunction with
PACT98.
2 K. Cooper, D. Subramanian, and L. Torczon,
Adaptive optimizing compilers for the 21st
century, J. of Supercomputing, 32(1), 2002.
3 J. Bilmes, K. Asanovic, C. Chin, and J.
Demmel, Optimizing matrix multiply using PHiPAC
A portable, high- performance, ANSI C coding
methodology, In Proc. ICS, pages 340-347, 1997.
4 M. Stephenson and S. Amarasinghe, Prediction
unroll factor using supervised classification,
In ERRR/ACM International Symposium on Code
Generation and Optimization (CGO 2005), ERRR
Computer Society, 2005.
5 Yom-Toy, J. Thomson, O. Temam, A. Zaks, H.
Leather, C. Miranda, M. Namolaru, E. Bonilla,
Saclay, B. Mendelson, C. Williams, Haifa, M.
OBoyle, P. Barnard, E. Ashton, E. Courtois, F.
Bodin MILEPOST GCC machine learning based
research compiler, ARC, International, UK, CAPS
Enterprise, France, 2007.
6 K. Hoste, L. Eeckhout, COLE Compiler
Optimization Level Exploration, ELIS Department,
Ghent University, Sing-Pietersnieuwstraat 41,
B-9000 Gent, Belgium, 2008.
7 K. Deb, Multi-Objective Optimization using
Evolutionary Algorithms, Wiley, 2001.
8 G. Fursin, J. Cavazos, M. OBoyle, and O.
Temam, MiDataSets Creating the Conditions for a
More Realistic Evaluation of Iterative
Optimization, ALCHEMY Group, INRIA Futurs and
LRI, Paris-Sud University, France, 2007.
9 M. Byler, M. Wolfe, J.R.B. Davies, C. Huson,
and B. Leasure, Multiple version loops. In
ICPP, 1987, pages 312-318, 2005.
10 K. D. Cooper, M. W. Hall, and K. Kennedy,
Procedure cloning, In Proceedings of the 1992
IEEE International Conference on Computer
Language, pages 99-105, 1992.
11 P. Diniz and M. Rinard. Dynamic feedback
An effective technique for adaptive computing,
In Proc. PLDI, pages 71-84, 1997.
12 G. Fursin, C. Miranda, S. Pop, A. Cohen, O.
Temam, Practical Run-time Adaptation with
Procedure Cloning to Enable Continuous Collective
Compilation, Alchemy group, INRIA Futurs and
LRI, Paris-Sud 11 University, Orsay, France,
2007.
13 V. Bala, E. Duesterwald, and S. Banerjia,
Dynamo A transparent dynamic optimization
system, In ACM SIGPLAN Notices, 2000.

26
References 14-23

14 R. H. Saavedra and D. Park, Improving the
effectiveness of software prefetching with
adaptive execution, In Conference on Parallel
Architectures and Compilation Techniques
(PACT96), 1996.
15 M. Voss and R. Eigemann, High-level
adaptive program optimization with adapt, In
Proceedings of the Symposium on Principles and
practices of parallel programming, 2001.
16 G. Fursin, A. Cohen, M. OBoyle, and O.
Temam, A Practical Method For Quickly Evaluating
Program Optimizations, Institute for Computing
Systems Architecture, University of Edinburgh,
UK, 2005.
17 T. Sherwood, E. Perelman, G. Hamerly, and B.
Calder, Automatically characterizing large scale
program behavior, In 10th International
Conference on Architectural Support for
Programming Languages and Operating Systems,
2002.
18 J. Lau, S. Schoenmackers, and B. Calder,
Transition phase classification and prediction,
In International Symposium on High Performance
Computer Architecture, 2005.
19 Z. Pan, R. Eignmann, Fast and Effective
Orchestration of Compiler Optimizations for
Automatic Performance Tuning., Proceedings of
the International Symposium on Code Generation
and Optimization, 2006.
20 S. Triantafyllis, M.J. Bridges, E. Raman, G.
Ottoni and D. August, A Framework for
Unrestricted Whole-Program Optimization.,
Proceedings of the 2006 ACM SIGPLAN Conference on
Programming Language Design and Implementation,
2006.
21 Z. Pan, R. Eignmann, Fast, Automatic,
Procedure-Level Performance Tuning., Proceedings
of the 15th International Conference on Parallel
Architecture and Compilation Techniques, 2006.
22 H. Feltl, Ein Genetischer Algorithmus fuer
das Generalized Assignment Problem,
Diplomarbeit, 2003.
23 M. Haneda, P. Knijnenburg, H. Wijshoff
Automatic Selection of Compiler Options Using
Non-Parametristic Inferential Statistics.,
Proceedings of the 14th International Conference
on Parallel Architecture and Compilation
Techniques, 2005.

27
Basic Foil with Take-Away Banner

Use Verdana Bold for Main Body Subheadings
Use Verdana regular for main body text.
Use charcoal gray (RGB 51 51 51) color as the
default
Text size can vary. Use these minimum recommended
font sizes
Slide title 32 pt
Main body subheadings 20 pt
Bullet points 18 pt (with sub-bullets reducing
by 2 pt each 16 pt, 14 pt, etc)
Tables 14 pt
Diagram and chart labels 12 pt
Emphasize with italics, bold or color (blue)
Use text boxes to highlight content
Primary text should be on background color, not
photos, etc.
Use bullets same color as text

Standard Take-Away Banner. Add Text Here!

Write a Comment

User Comments (0)

About PowerShow.com

Maximizing Intel PowerPoint PPT Presentation