Title: Iterative Compilation in Program Optimization
1Iterative Compilation in Program Optimization
- T.Kisuki P.M.W. Knijenburg M.F.P OBoyle H.A.G.
Wijshoff - Dept. of Computer Science Leiden University Neils
Bohrweg 1, 2333 CA Leiden, the Netherlands - Institute for Computing System Architecture the
Un University, Edinbugh EH9 3JZ, U.K.
2Compiler Optimization
Performance
Iterative Compilation
3Introduction
- Modern Compilers optimization
- Depends on STATIC program analysis
- Based on simplified machine models
- focused on loop transformation
- many cases good result BUT
- Machine models are inaccurate
- Transformations are not independent
- Based on averaging observed behavior
- THE NEW ERA...
4Iterative Compilation
- In this approach successive transformations are
applied to a program and there worth determined
by actual execution of the resulting code - Drawback
- Compilation time dramatically increases
- (in average for 400 iteration take 16 minutes)
5Cont...
- Advantages
- In case of Embedded Applications only one program
is to be executed - the cost of compilation can be amortized over the
number of system shipped and the lifetime of the
application - Do not suffer from undecidability issues
6Iterative Compilation
- The Experimental Setup Optimization
- Loop Tiling
- Loop Unrolling
- Array Padding
7Tilling (also call blocking)
- Dividing an iteration space into tiles for
improve cash reuse - each tile fits in the cash thereby exploiting the
available locality
8Loop Unrolling
- Unrolling replicates the body of the loop sum
numbers of times. - Called the unrolling factor (u)
- Iterate by step u instead of step 1
- Advantage
- Reducing loop overhead
- Increasing instruction parallelism
- Improving memory system performance
9Padding Array
- Padding is use to improve a number of memory
system conflict - by changing the size of the array
10- 5 Benchmarks
- 3 General purpose linear algebra routines
- Matrix-Matrix Multiplication (MM), data sizes
256, 300, 301. - Matrix-Vector Multiplication (MV), data size
2048, 2300, 2301. - Successive Over Relaxation (SOR), data sizes 128,
150, 151. - 2 Routines from Multimedia Application
- Forward Discrete Cosine Transform from mpeg2
(FDCT), data sizes 256, 300, 301. - Motion Compensation routine from H263 (RECO),
data sizes 2048, 2300, 2301.
11- Target Platform
- Pentium II at 233 MHz
- Compiler
- Fortran Compiler g77
12The Compilation Process
Driver- keep tracks of the different
transformation evaluated so far and decides
which transformation to apply next
List of transformations
SSL- strategy specification language specify the
order in which to apply certain transformation
Driver
SSL File
MT1 Compiler
TDL File
TDL- transformation definition language transforma
tion use by the driver specified in the TDL file
Execution time
Transform Program
F77
MT1- source to source compiler starts the
transformation process
Target Platform
13The Transformation Space
- The driver uses an N dimensional array when N
different optimizations need to be examined - represent the transformation space
- each point in this array corresponds to a
specific set of parameters for the
transformations
14The Algorithm
- The algorithm use by the driver to search the
transformation space is based on a grid over this
space - The grid search algorithm
- 1. Define a coarse grid on the search space.
Evaluate all points on this grid by generating
the transform programs and executing them - 2. Find the point with minimum execution time
and all points that are with in an allowable
distance from this minimum
15Cont...
- 3. Order these points in a priority queue
ordered by execution time - 4. For each point in the queue
- if the execution time associated with this point
is with in allowable distance from the minimum
found so far refine the grid around this point by
forming a new grid with half the spacing in each
dimension - if new points are found that are close to the
minimum found so far enqueue them in the priority
queue
16One step of the global driver
- 1. Decide the next set of parameters for the
transformation using its internal search space
and a search algorithm - 2. Construct an SSL file that correspond to this
new sequence - 3. Invoke MT1 that start the transformation
process by reading in a source program the SSL
file and the TDL file
17Cont ...
- 4. The transform program is compile for the
target architecture and executed - 5. The execution time is measured and reported
back to the global driver - 6. The global driver store this execution time
and starts the next step
18Iterative Compilation
- Experimental Setup - single data size
- 2 Transformations
- Loop Tiling with tile sizes 1-100
- Loop Unrolling with unroll factors 1-20
19- 5 Benchmarks
- 3 General purpose linear algebra routines
- Matrix-Matrix Multiplication (MM), data sizes
256, 300, 301. - Matrix-Vector Multiplication (MV), data size
2048, 2300, 2301. - Successive Over Relaxation (SOR), data sizes 128,
150, 151. - 2 Routines from Multimedia Application
- Forward Discrete Cosine Transform from mpeg2
(FDCT), data sizes 256, 300, 301. - Motion Compensation routine from H263 (RECO),
data sizes 2048, 2300, 2301.
20The Results
- Except in the case of SOR good results up to
speedup of 3.4 in the case of MV - Finds good parameters quickly
- with in 50 evaluation close to maximum (except
MM and SOR more than 100) - After 300 evaluation no improvement
- This correspond to 15 of the entire search space
21MM
Data size improvement 256 2.32
300 1.85 301 1.85
Number of iteration 100
22MV
Data size improvement 2048 3.4
2300 1.7 2301 1.8
Number of iteration 50
23SOR
Data size improvement 128
1.0202 150 1.0203 151
1.017
Number of iteration 100
24FDCT
Data size improvement 256 1.17
300 1.22 301
1.221
Number of iteration 50
25RECO
Data size improvement 2048
1.37 2300 1.53 2301
1.40
Number of iteration 50
26Iterative Compilation
- Experimental Setup - single data size
- 3 Transformations
- Loop Tiling with tile sizes 1-100
- Loop Unrolling with unroll factors 1-20
- Array Padding with pad sizes 1-10
27The Results
- Enlarges the transformation space by a factor of
10 - But speedups are obtained with in the same number
of iterations - In case of MV significantly larger speedup is
found - In other cases slightly smaller improvement
28Cont...
- 350 evaluation are required to obtain comparable
or better results - Only 1.75 of the entire search space
- No scaling up of the number of iteration
29MM
3 transformations
2 transformations
Number of iteration 350 Best improvement
2.19
Number of iteration 100 Best improvement
2.32
30MV
3 transformations
2 transformations
Number of iteration 350 Best improvement
3.8
Number of iteration 50 Best improvement
3.4
31SOR
3 transformation
2 transformation
Number of iteration 350 Best improvement
1.0208
Number of iteration 100 Best improvement 1.0203
32FDCT
2 transformations
3 transformation
Number of iteration 300 Best improvement
1.221
Number of iteration 50 Best improvement
1.221
33RECO
2 transformation
3 transformations
Number of iteration 200 Best improvement
1.53
Number of iteration 50 Best performance
1.53
34Best Parameters Values
Dependency on data size
Interference among transformations
35Iterative Compilation
- Experimental Setup - multiple data size
- Many cases profiling will yield a distribution of
input data sizes - Hence, finding the optimization that minimizes
the average execution time
36Experimental Setup - multiple data size cont...
- Use Unrolling, Tilling , Padding
- Use The 4 Benchmark (without the SOR)
- 3 Data sizes MM
- FDCT
- MV
- RECO
37Experimental Setup - multiple data size cont...
- 4 Data sizes MM
- FDCT
- MV
- RECO
250, 260, 290, 310
2000, 2100, 2200, 2400
38The Results
- Different optimization are found
- Still, yields significant speedup
- Many values for the parameters yield good
speedups - The driver always finds a set of these good
values - Hence, the technique (of searching) produces
stable results - The optimization it finds is effective for a
range of input data sizes
39Multiply Data Sizes Vs. Single Data Size
40Compilation Time
- Is it feasible ?
- Check running time of the approach
- For 400 iterations - time ranges from 7.7
minutes (MV) to 25.4 minutes (FDCT) - on average we need 16 minutes for 400 iterations
- Can be seen as an integral component of the total
development time of the embedded system - thus we can afford several hours to heavily
optimize the compute intensive routines
41Compilation Time cont...
Most cases execution time is the longest
Time in minutes
Number of iteration
7.7minutes (MV)
25.4 minutes (FDCT)
42Conclusions
- We described a new approach to program
optimization namely - - Iterative Compilation
- Find good optimization by searching relatively
small fraction of the optimization space - In the case where loop unrolling , tilling and
array padding 350 evaluation for satisfactory
optimization
43Cont...
- Which correspond to 1.75 of the entire search
space - On Pentium II at 233 MHz it took 60 minutes on
average to execute all 400 iteration - Very tolerable for embedded system
44What next ?
- Methods for reduce the number of iteration when
the running time of the routine is much larger - Improve the search algorithm by mix of static
analysis and run time information