Iterative Compilation in Program Optimization - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Iterative Compilation in Program Optimization

Description:

Matrix-Vector Multiplication (M*V), data size 2048, 2300, 2301. ... Evaluate all points on this grid by generating the transform programs and executing them ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 45

Provided by: csHai

Category:

more less

Transcript and Presenter's Notes

Title: Iterative Compilation in Program Optimization

1
Iterative Compilation in Program Optimization

T.Kisuki P.M.W. Knijenburg M.F.P OBoyle H.A.G.
Wijshoff
Dept. of Computer Science Leiden University Neils
Bohrweg 1, 2333 CA Leiden, the Netherlands
Institute for Computing System Architecture the
Un University, Edinbugh EH9 3JZ, U.K.

What
Why
How

Compiler Optimization
Performance
Iterative Compilation
3
Introduction

Modern Compilers optimization
Depends on STATIC program analysis
Based on simplified machine models
focused on loop transformation
many cases good result BUT
Machine models are inaccurate
Transformations are not independent
Based on averaging observed behavior
THE NEW ERA...

4
Iterative Compilation

In this approach successive transformations are
applied to a program and there worth determined
by actual execution of the resulting code
Drawback
Compilation time dramatically increases
(in average for 400 iteration take 16 minutes)

5
Cont...

Advantages
In case of Embedded Applications only one program
is to be executed
the cost of compilation can be amortized over the
number of system shipped and the lifetime of the
application
Do not suffer from undecidability issues

6
Iterative Compilation

The Experimental Setup Optimization
Loop Tiling
Loop Unrolling
Array Padding

7
Tilling (also call blocking)

Dividing an iteration space into tiles for
improve cash reuse
each tile fits in the cash thereby exploiting the
available locality

8
Loop Unrolling

Unrolling replicates the body of the loop sum
numbers of times.
Called the unrolling factor (u)
Iterate by step u instead of step 1
Advantage
Reducing loop overhead
Increasing instruction parallelism
Improving memory system performance

9
Padding Array

Padding is use to improve a number of memory
system conflict
by changing the size of the array

5 Benchmarks
3 General purpose linear algebra routines
Matrix-Matrix Multiplication (MM), data sizes
256, 300, 301.
Matrix-Vector Multiplication (MV), data size
2048, 2300, 2301.
Successive Over Relaxation (SOR), data sizes 128,
150, 151.
2 Routines from Multimedia Application
Forward Discrete Cosine Transform from mpeg2
(FDCT), data sizes 256, 300, 301.
Motion Compensation routine from H263 (RECO),
data sizes 2048, 2300, 2301.

Target Platform
Pentium II at 233 MHz
Compiler
Fortran Compiler g77

12
The Compilation Process
Driver- keep tracks of the different
transformation evaluated so far and decides
which transformation to apply next
List of transformations
SSL- strategy specification language specify the
order in which to apply certain transformation
Driver
SSL File
MT1 Compiler
TDL File
TDL- transformation definition language transforma
tion use by the driver specified in the TDL file
Execution time
Transform Program
F77
MT1- source to source compiler starts the
transformation process
Target Platform
13
The Transformation Space

The driver uses an N dimensional array when N
different optimizations need to be examined
represent the transformation space
each point in this array corresponds to a
specific set of parameters for the
transformations

14
The Algorithm

The algorithm use by the driver to search the
transformation space is based on a grid over this
space
The grid search algorithm
1. Define a coarse grid on the search space.
Evaluate all points on this grid by generating
the transform programs and executing them
2. Find the point with minimum execution time
and all points that are with in an allowable
distance from this minimum

15
Cont...

3. Order these points in a priority queue
ordered by execution time
4. For each point in the queue
if the execution time associated with this point
is with in allowable distance from the minimum
found so far refine the grid around this point by
forming a new grid with half the spacing in each
dimension
if new points are found that are close to the
minimum found so far enqueue them in the priority
queue

16
One step of the global driver

1. Decide the next set of parameters for the
transformation using its internal search space
and a search algorithm
2. Construct an SSL file that correspond to this
new sequence
3. Invoke MT1 that start the transformation
process by reading in a source program the SSL
file and the TDL file

17
Cont ...

4. The transform program is compile for the
target architecture and executed
5. The execution time is measured and reported
back to the global driver
6. The global driver store this execution time
and starts the next step

18
Iterative Compilation

Experimental Setup - single data size
2 Transformations
Loop Tiling with tile sizes 1-100
Loop Unrolling with unroll factors 1-20

5 Benchmarks
3 General purpose linear algebra routines
Matrix-Matrix Multiplication (MM), data sizes
256, 300, 301.
Matrix-Vector Multiplication (MV), data size
2048, 2300, 2301.
Successive Over Relaxation (SOR), data sizes 128,
150, 151.
2 Routines from Multimedia Application
Forward Discrete Cosine Transform from mpeg2
(FDCT), data sizes 256, 300, 301.
Motion Compensation routine from H263 (RECO),
data sizes 2048, 2300, 2301.

20
The Results

Except in the case of SOR good results up to
speedup of 3.4 in the case of MV
Finds good parameters quickly
with in 50 evaluation close to maximum (except
MM and SOR more than 100)
After 300 evaluation no improvement
This correspond to 15 of the entire search space

21
MM
Data size improvement 256 2.32
300 1.85 301 1.85
Number of iteration 100
22
MV
Data size improvement 2048 3.4
2300 1.7 2301 1.8
Number of iteration 50
23
SOR
Data size improvement 128
1.0202 150 1.0203 151
1.017
Number of iteration 100
24
FDCT
Data size improvement 256 1.17
300 1.22 301
1.221
Number of iteration 50
25
RECO
Data size improvement 2048
1.37 2300 1.53 2301
1.40
Number of iteration 50
26
Iterative Compilation

Experimental Setup - single data size
3 Transformations
Loop Tiling with tile sizes 1-100
Loop Unrolling with unroll factors 1-20
Array Padding with pad sizes 1-10

27
The Results

Enlarges the transformation space by a factor of
10
But speedups are obtained with in the same number
of iterations
In case of MV significantly larger speedup is
found
In other cases slightly smaller improvement

28
Cont...

350 evaluation are required to obtain comparable
or better results
Only 1.75 of the entire search space
No scaling up of the number of iteration

29
MM
3 transformations
2 transformations
Number of iteration 350 Best improvement
2.19
Number of iteration 100 Best improvement
2.32
30
MV
3 transformations
2 transformations
Number of iteration 350 Best improvement
3.8
Number of iteration 50 Best improvement
3.4
31
SOR
3 transformation
2 transformation
Number of iteration 350 Best improvement
1.0208
Number of iteration 100 Best improvement 1.0203
32
FDCT
2 transformations
3 transformation
Number of iteration 300 Best improvement
1.221
Number of iteration 50 Best improvement
1.221
33
RECO
2 transformation
3 transformations
Number of iteration 200 Best improvement
1.53
Number of iteration 50 Best performance
1.53
34
Best Parameters Values
Dependency on data size
Interference among transformations
35
Iterative Compilation

Experimental Setup - multiple data size
Many cases profiling will yield a distribution of
input data sizes
Hence, finding the optimization that minimizes
the average execution time

36
Experimental Setup - multiple data size cont...

Use Unrolling, Tilling , Padding
Use The 4 Benchmark (without the SOR)
3 Data sizes MM
FDCT
MV
RECO

37
Experimental Setup - multiple data size cont...

4 Data sizes MM
FDCT
MV
RECO

250, 260, 290, 310
2000, 2100, 2200, 2400
38
The Results

Different optimization are found
Still, yields significant speedup
Many values for the parameters yield good
speedups
The driver always finds a set of these good
values
Hence, the technique (of searching) produces
stable results
The optimization it finds is effective for a
range of input data sizes

39
Multiply Data Sizes Vs. Single Data Size
40
Compilation Time

Is it feasible ?
Check running time of the approach
For 400 iterations - time ranges from 7.7
minutes (MV) to 25.4 minutes (FDCT)
on average we need 16 minutes for 400 iterations
Can be seen as an integral component of the total
development time of the embedded system
thus we can afford several hours to heavily
optimize the compute intensive routines

41
Compilation Time cont...
Most cases execution time is the longest
Time in minutes
Number of iteration
7.7minutes (MV)
25.4 minutes (FDCT)
42
Conclusions

We described a new approach to program
optimization namely -
Iterative Compilation
Find good optimization by searching relatively
small fraction of the optimization space
In the case where loop unrolling , tilling and
array padding 350 evaluation for satisfactory
optimization

43
Cont...

Which correspond to 1.75 of the entire search
space
On Pentium II at 233 MHz it took 60 minutes on
average to execute all 400 iteration
Very tolerable for embedded system

44
What next ?

Methods for reduce the number of iteration when
the running time of the routine is much larger
Improve the search algorithm by mix of static
analysis and run time information

Write a Comment

User Comments (0)