Iterative Compilation in Program Optimization - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Iterative Compilation in Program Optimization

Description:

Matrix-Vector Multiplication (M*V), data size 2048, 2300, 2301. ... Evaluate all points on this grid by generating the transform programs and executing them ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 45
Provided by: csHai
Category:

less

Transcript and Presenter's Notes

Title: Iterative Compilation in Program Optimization


1
Iterative Compilation in Program Optimization
  • T.Kisuki P.M.W. Knijenburg M.F.P OBoyle H.A.G.
    Wijshoff
  • Dept. of Computer Science Leiden University Neils
    Bohrweg 1, 2333 CA Leiden, the Netherlands
  • Institute for Computing System Architecture the
    Un University, Edinbugh EH9 3JZ, U.K.

2
  • What
  • Why
  • How

Compiler Optimization
Performance
Iterative Compilation
3
Introduction
  • Modern Compilers optimization
  • Depends on STATIC program analysis
  • Based on simplified machine models
  • focused on loop transformation
  • many cases good result BUT
  • Machine models are inaccurate
  • Transformations are not independent
  • Based on averaging observed behavior
  • THE NEW ERA...

4
Iterative Compilation
  • In this approach successive transformations are
    applied to a program and there worth determined
    by actual execution of the resulting code
  • Drawback
  • Compilation time dramatically increases
  • (in average for 400 iteration take 16 minutes)

5
Cont...
  • Advantages
  • In case of Embedded Applications only one program
    is to be executed
  • the cost of compilation can be amortized over the
    number of system shipped and the lifetime of the
    application
  • Do not suffer from undecidability issues

6
Iterative Compilation
  • The Experimental Setup Optimization
  • Loop Tiling
  • Loop Unrolling
  • Array Padding

7
Tilling (also call blocking)
  • Dividing an iteration space into tiles for
    improve cash reuse
  • each tile fits in the cash thereby exploiting the
    available locality

8
Loop Unrolling
  • Unrolling replicates the body of the loop sum
    numbers of times.
  • Called the unrolling factor (u)
  • Iterate by step u instead of step 1
  • Advantage
  • Reducing loop overhead
  • Increasing instruction parallelism
  • Improving memory system performance

9
Padding Array
  • Padding is use to improve a number of memory
    system conflict
  • by changing the size of the array

10
  • 5 Benchmarks
  • 3 General purpose linear algebra routines
  • Matrix-Matrix Multiplication (MM), data sizes
    256, 300, 301.
  • Matrix-Vector Multiplication (MV), data size
    2048, 2300, 2301.
  • Successive Over Relaxation (SOR), data sizes 128,
    150, 151.
  • 2 Routines from Multimedia Application
  • Forward Discrete Cosine Transform from mpeg2
    (FDCT), data sizes 256, 300, 301.
  • Motion Compensation routine from H263 (RECO),
    data sizes 2048, 2300, 2301.

11
  • Target Platform
  • Pentium II at 233 MHz
  • Compiler
  • Fortran Compiler g77

12
The Compilation Process
Driver- keep tracks of the different
transformation evaluated so far and decides
which transformation to apply next
List of transformations
SSL- strategy specification language specify the
order in which to apply certain transformation
Driver
SSL File
MT1 Compiler
TDL File
TDL- transformation definition language transforma
tion use by the driver specified in the TDL file
Execution time
Transform Program
F77
MT1- source to source compiler starts the
transformation process
Target Platform
13
The Transformation Space
  • The driver uses an N dimensional array when N
    different optimizations need to be examined
  • represent the transformation space
  • each point in this array corresponds to a
    specific set of parameters for the
    transformations

14
The Algorithm
  • The algorithm use by the driver to search the
    transformation space is based on a grid over this
    space
  • The grid search algorithm
  • 1. Define a coarse grid on the search space.
    Evaluate all points on this grid by generating
    the transform programs and executing them
  • 2. Find the point with minimum execution time
    and all points that are with in an allowable
    distance from this minimum

15
Cont...
  • 3. Order these points in a priority queue
    ordered by execution time
  • 4. For each point in the queue
  • if the execution time associated with this point
    is with in allowable distance from the minimum
    found so far refine the grid around this point by
    forming a new grid with half the spacing in each
    dimension
  • if new points are found that are close to the
    minimum found so far enqueue them in the priority
    queue

16
One step of the global driver
  • 1. Decide the next set of parameters for the
    transformation using its internal search space
    and a search algorithm
  • 2. Construct an SSL file that correspond to this
    new sequence
  • 3. Invoke MT1 that start the transformation
    process by reading in a source program the SSL
    file and the TDL file

17
Cont ...
  • 4. The transform program is compile for the
    target architecture and executed
  • 5. The execution time is measured and reported
    back to the global driver
  • 6. The global driver store this execution time
    and starts the next step

18
Iterative Compilation
  • Experimental Setup - single data size
  • 2 Transformations
  • Loop Tiling with tile sizes 1-100
  • Loop Unrolling with unroll factors 1-20

19
  • 5 Benchmarks
  • 3 General purpose linear algebra routines
  • Matrix-Matrix Multiplication (MM), data sizes
    256, 300, 301.
  • Matrix-Vector Multiplication (MV), data size
    2048, 2300, 2301.
  • Successive Over Relaxation (SOR), data sizes 128,
    150, 151.
  • 2 Routines from Multimedia Application
  • Forward Discrete Cosine Transform from mpeg2
    (FDCT), data sizes 256, 300, 301.
  • Motion Compensation routine from H263 (RECO),
    data sizes 2048, 2300, 2301.

20
The Results
  • Except in the case of SOR good results up to
    speedup of 3.4 in the case of MV
  • Finds good parameters quickly
  • with in 50 evaluation close to maximum (except
    MM and SOR more than 100)
  • After 300 evaluation no improvement
  • This correspond to 15 of the entire search space

21
MM
Data size improvement 256 2.32
300 1.85 301 1.85
Number of iteration 100
22
MV
Data size improvement 2048 3.4
2300 1.7 2301 1.8
Number of iteration 50
23
SOR
Data size improvement 128
1.0202 150 1.0203 151
1.017
Number of iteration 100
24
FDCT
Data size improvement 256 1.17
300 1.22 301
1.221
Number of iteration 50
25
RECO
Data size improvement 2048
1.37 2300 1.53 2301
1.40
Number of iteration 50
26
Iterative Compilation
  • Experimental Setup - single data size
  • 3 Transformations
  • Loop Tiling with tile sizes 1-100
  • Loop Unrolling with unroll factors 1-20
  • Array Padding with pad sizes 1-10

27
The Results
  • Enlarges the transformation space by a factor of
    10
  • But speedups are obtained with in the same number
    of iterations
  • In case of MV significantly larger speedup is
    found
  • In other cases slightly smaller improvement

28
Cont...
  • 350 evaluation are required to obtain comparable
    or better results
  • Only 1.75 of the entire search space
  • No scaling up of the number of iteration

29
MM
3 transformations
2 transformations
Number of iteration 350 Best improvement
2.19
Number of iteration 100 Best improvement
2.32
30
MV
3 transformations
2 transformations
Number of iteration 350 Best improvement
3.8
Number of iteration 50 Best improvement
3.4
31
SOR
3 transformation
2 transformation
Number of iteration 350 Best improvement
1.0208
Number of iteration 100 Best improvement 1.0203
32
FDCT
2 transformations
3 transformation
Number of iteration 300 Best improvement
1.221
Number of iteration 50 Best improvement
1.221
33
RECO
2 transformation
3 transformations
Number of iteration 200 Best improvement
1.53
Number of iteration 50 Best performance
1.53
34
Best Parameters Values
Dependency on data size
Interference among transformations
35
Iterative Compilation
  • Experimental Setup - multiple data size
  • Many cases profiling will yield a distribution of
    input data sizes
  • Hence, finding the optimization that minimizes
    the average execution time

36
Experimental Setup - multiple data size cont...
  • Use Unrolling, Tilling , Padding
  • Use The 4 Benchmark (without the SOR)
  • 3 Data sizes MM
  • FDCT
  • MV
  • RECO

37
Experimental Setup - multiple data size cont...
  • 4 Data sizes MM
  • FDCT
  • MV
  • RECO

250, 260, 290, 310
2000, 2100, 2200, 2400
38
The Results
  • Different optimization are found
  • Still, yields significant speedup
  • Many values for the parameters yield good
    speedups
  • The driver always finds a set of these good
    values
  • Hence, the technique (of searching) produces
    stable results
  • The optimization it finds is effective for a
    range of input data sizes

39
Multiply Data Sizes Vs. Single Data Size
40
Compilation Time
  • Is it feasible ?
  • Check running time of the approach
  • For 400 iterations - time ranges from 7.7
    minutes (MV) to 25.4 minutes (FDCT)
  • on average we need 16 minutes for 400 iterations
  • Can be seen as an integral component of the total
    development time of the embedded system
  • thus we can afford several hours to heavily
    optimize the compute intensive routines

41
Compilation Time cont...
Most cases execution time is the longest
Time in minutes
Number of iteration
7.7minutes (MV)
25.4 minutes (FDCT)
42
Conclusions
  • We described a new approach to program
    optimization namely -
  • Iterative Compilation
  • Find good optimization by searching relatively
    small fraction of the optimization space
  • In the case where loop unrolling , tilling and
    array padding 350 evaluation for satisfactory
    optimization

43
Cont...
  • Which correspond to 1.75 of the entire search
    space
  • On Pentium II at 233 MHz it took 60 minutes on
    average to execute all 400 iteration
  • Very tolerable for embedded system

44
What next ?
  • Methods for reduce the number of iteration when
    the running time of the routine is much larger
  • Improve the search algorithm by mix of static
    analysis and run time information
Write a Comment
User Comments (0)
About PowerShow.com