Analytical models and intelligent search for program generation and optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Analytical models and intelligent search for program generation and optimization

Description:

Analytical models and intelligent search for program ... Model. Mini-MMM Performance. SGI Performance. TLB effects are important when matrix size is large. ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 120
Provided by: polaris
Category:

less

Transcript and Presenter's Notes

Title: Analytical models and intelligent search for program generation and optimization


1
Analytical models and intelligent search for
program generation and optimization
  • David Padua
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  •  

2
Program optimization today
  • The optimization phase of a compiler applies a
    series of transformations to achieve its
    objectives.
  • The compiler uses program analysis to determine
    which transformations are correctness-preserving.
  • Compiler transformation and analysis techniques
    are reasonably well-understood.
  • Since many of the compiler optimization problems
    have exponential complexity, heuristics are
    needed to drive the application of
    transformations.

3
Optimization drivers
  • Developing driving heuristics is laborious.
  • One reason for this is the lack of methodologies
    and tools to build optimization drivers.
  • As a result, although there is much in common
    among compilers, their optimization phases are
    usually re-implemented from scratch.

4
Optimization drivers (Cont.)
  • As a result, machines and languages not widely
    popular usually lack good compilers. (some
    popular systems too)
  • DSP, network processor, and embedded system
    programming is often done in assembly language.
  • Evaluation of new architectural features
    requiring compiler involvement is not always
    meaningful.
  • Languages such as APL, MATLAB, LISP, suffer
    from chronic low performance.
  • New languages difficult to introduce (although
    compilers are only a part of the problem).

5
A methodology based on the notion of search space
  • Program transformations often have several
    possible target versions.
  • Loop unrolling How many times
  • Loop tiling size of the tile.
  • Loop interchanging order of loop headers
  • Register allocation which registers are stored
    in memory to give room for new values.
  • The process of optimization can be seen as a
    search in the space of possible program versions.

6
Empirical searchIterative compilation
  • Perhaps the simplest application of the search
    space model is empirical search where several
    versions are generated and executed on the target
    machine. The fastest version is selected.

T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle,
and H.A.G. Wijshoff . Iterative compilation in
program optimization. In Proc. CPC2000, pages
35-44, 2000
7
Empirical search and traditional compilers
  • Searching is not a new approach and compilers
    have applied it in the past, but using
    architectural prediction models instead of actual
    runs
  • KAP searched for best loop header order
  • SGIs MIPS-pro and IBM PowerPC compilers select
    the best degree of unrolling.

8
Limitations of empirical search
  • Empirical search is conceptually simple and
    portable.
  • However,
  • the search space tends to be too large specially
    when several transformations are combined.
  • It is not clear how to apply this method when
    program behavior is a function of the input data
    set.
  • Need heuristics/search strategies.
  • Availability of performance formulas could help
    evaluate transformations across input data sets
    and facilitate search.

9
Program/library generators
  • An on-going effort at Illinois focuses on
    program generators.
  • The objectives are
  • To develop better program generators.
  • To improve our understanding the optimization
    process without the need to worry about program
    analysis.

10
Compilers and Library Generators
Algorithm
Program Generation
Internal representation
Program Transformation
Source Program
11
Empirical search in program/library generators
  • Examples
  • FFTW M. Frigo, S. Johnson
  • Spiral (FFT/signal processing) J. Moura (CMU),
    M. Veloso (CMU), J. Johnson (Drexel)
  • ATLAS (linear algebra)R. Whaley, A. Petitet, J.
    Dongarra
  • PHiPACJ. Demmel et al
  • Sorting X. Li, M. Garzaran (Illinois)

12
Techniques presented in the rest of the talk
  • Analytical models (ATLAS)
  • Pure
  • Combined with search
  • Pure search strategies
  • Data independent performance (Spiral)
  • Data dependent performance (Sorting)

13
I. Analytical models and ATLAS
  • Joint work with G. DeJong (Illinois), M.
    Garzaran, and K. Pingali (Cornell)

14
ATLAS
  • A Linear Algebra Library Generator. ATLAS
    Automatically Tuned linear Algebra Software
  • At installation time, searches for the best
    parameters of a Matrix-Matrix Multiplication
    routine.
  • We studied ATLAS and modified the system to
    replace the search with an analytical model that
    identifies the best MMM parameters without the
    need for search.

15
The modified version of ATLAS
  • Original ATLAS Infrastructure
  • Model-Based ATLAS Infrastructure

Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
ATLAS SearchEngine (MMSearch)
Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
Model
16
Detecting Machine Parameters
  • Micro-benchmarks
  • L1Size L1 Data Cache size
  • Similar to Hennessy-Patterson book
  • NR Number of registers
  • Use several FP temporaries repeatedly
  • MulAdd Fused Multiply Add (FMA)
  • cab as opposed to ct tab
  • Latency Latency of FP Multiplication
  • Needed for scheduling multiplies and adds in the
    absence of FMA

17
Compiler View
  • ATLAS Code Generation
  • Focus on MMM (as part of BLAS-3)
  • Very good reuse O(N2) data, O(N3) computation
  • Many optimization opportunities
  • Few real dependencies
  • Will run poorly on modern machines
  • Poor use of cache and registers
  • Poor use of processor pipelines

ATLAS MMCode Generator (MMCase)
for (int j 0 j lt N j) for (int i 0 i lt
N i) for (int k 0 k lt N k)
Cij Aik Bkj
18
Characteristics of the code as generated by ATLAS
  • Cache-level blocking (tiling)
  • Atlas blocks only for L1 cache
  • Register-level blocking
  • Highest level of memory hierarchy
  • Important to hold array values in registers
  • Software pipelining
  • Unroll and schedule operations
  • Versioning
  • Dynamically decide which way to compute
  • Back-end compiler optimizations
  • Scalar Optimizations
  • Instruction Scheduling

19
Cache Tiling for MMM
  • Tiling in ATLAS
  • Only square tiles NBxNBxNB
  • Working set of tile must fit in L1 cache
  • Tiles usually copied first into contiguous
    buffer
  • Special clean-up code generated for boundaries

B
k
Mini-MMM
for (int j 0 j lt NB j) for (int i 0
i lt NB i) for (int k 0 k lt NB k)
Cij Aik Bkj
k
j

i
NB
NB
A
C
NB
NB
Optimization parameter NB
20
IJK version (large cache)
  • DO I 1, N//row-major storage
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)

B
K
A
C
K
  • Large cache scenario
  • Matrices are small enough to fit into cache
  • Only cold misses, no capacity misses
  • Miss ratio
  • Data size 3 N2
  • Each miss brings in b floating-point numbers
  • Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
    4,N10)

21
IJK version (small cache)
  • DO I 1, N
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)

B
K
A
C
K
  • Small cache scenario
  • Matrices are large compared to cache
  • reuse distance is not O(1) gt miss
  • Cold and capacity misses
  • Miss ratio
  • C N2/b misses (good temporal locality)
  • A N3 /b misses (good spatial locality)
  • B N3 misses (poor temporal and spatial
    locality)
  • Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)

22
Register tiling for Mini-MMM
Micro-MMM MUx1 sub-matrix of A 1xNU
sub-matrix of B MUxNU sub-matrix of C
MUNUMUNU lt NR
Mini-MMM code after register tiling and unrolling
for (int j 0 j lt NB j NU) for (int i
0 i lt NB i MU) load Ci..iMU-1,
j..jNU-1 into registers for (int k 0 k lt
NB k) load Ai..iMU-1,k into
registers load Bk,j..jNU-1 into
registers multiply As and Bs and add to
Cs store Ci..iMU-1, j..jNU-1
Unroll k loop KU times
Optimization parameters MU,NU,KU
23
Scheduling
for (int k 0 k lt NB k KU) load
Ai..iMU-1,k into registers load
Bk,j..jNU-1 into registers multiply As
and Bs and add to Cs
Micro-MMM
  • If processor has combined Multiply-add, use it
  • Otherwise, schedule multiplies and adds first
  • interleave M1M2MMUNU and A1A2AMUNU after
    skewing additions by Latency
  • Schedule IFetch number of initial loads for one
    micro-MMM at the end of previous micro-MMM
  • Schedule remaining loads for micro-MMM in blocks
    of NFetch
  • memory pipeline can support only a small number
    of outstanding loads
  • Optimization parameters MulAdd, Latency, xFetch

24
High-level picture
  • Multi-dimensional optimization problem
  • Independent parameters NB,MU,NU,KU,
  • Dependent variable MFlops
  • Function from parameters to variables is given
    implicitly can be evaluated repeatedly
  • One optimization strategy orthogonal range
    search
  • Optimize along one dimension at a time, using
    reference values for parameters not yet optimized
  • Not guaranteed to find optimal point, but might
    come close

25
Specification of OR Search
  • Order in which dimensions are optimized
  • Reference values for un-optimized dimensions at
    any step
  • Interval in which range search is done for each
    dimension

26
Search strategy
  • Find Best NB
  • Find Best MU NU
  • Find Best KU
  • Find Best xFetch
  • Find Best Latency (lat)
  • Find non-copy version tile size (NCNB)

27
Find Best NB
  • Search in following range
  • 16 lt NB lt 80
  • NB2 lt L1Size
  • In this search, use simple estimates for other
    parameters
  • (eg) KU Test each candidate for
  • Full K unrolling (KU NB)
  • No K unrolling (KU 1)

28
Finding other parameters
  • Find best MU, NU try all MU NU that satisfy
  • In this step, use best NB from previous step
  • Find best KU
  • Find best Latency
  • values between 1 and 6
  • Find best xFetch
  • IFetch 2,MUNU,
    Nfetch1,MUNU-IFetch

1 ? MU,NU ? NB MUNU MU NU ? NR
29
Model-based estimation of optimization
parameters values
Execute

MFLOPS
Measure
NB
L1Size
ATLAS Search
ATLAS MM Code
MiniMMM
MU, NU, KU
Detect
Engine
Generator
Source
Latency
NR
Hardware
(MMSearch)
(MMCase)
Parameters
xFetch
MulAdd
MulAdd
Latency
NB
L1Size
Model Parameter
ATLAS MM Code
Detect
MU, NU, KU
MiniMMM
L1 I-Cache
Estimator
Generator
Latency
Source
NR
Hardware
(MMModel)
(MMCase)
xFetch
MulAdd
Parameters
MulAdd
Latency
30
High-level picture
  • NB hierarchy of models
  • Find largest NB for which there are no capacity
    or conflict misses
  • Find largest NB for which there are no capacity
    misses, assuming optimal replacement
  • Find largest NB for which there are no capacity
    misses, assuming LRU replacement
  • MU,NU estimate from number of registers, making
    them roughly equal
  • KU maximize subject to I-cache size
  • Latency from hardware parameter
  • xFetch set to 2

MUNU MU NU ? NR
31
Largest NB for no capacity/conflict misses
  • Tiles are copied into contiguous memory
  • Condition for cold misses only
  • 3NB2 ? L1Size

B
k
k
j

i
NB
NB
A
NB
NB
32
Largest NB for no capacity misses
  • MMM
  • for (int j 0 i lt N i) for (int i 0
    j lt N j) for (int k 0 k lt N k)
    cij aik bkj
  • Cache model
  • Fully associative
  • Line size 1 Word
  • Optimal Replacement
  • Bottom line
  • N2N1ltC
  • One full matrix
  • One row / column
  • One element

33
Extending the Model
  • Line Size gt 1
  • Spatial locality
  • Array layout in memory matters
  • Bottom line depending on loop order
  • either
  • or

34
Extending the Model (cont.)
  • LRU (not optimal replacement)
  • MMM sample
  • for (int j 0 i lt N i) for (int i 0
    j lt N j) for (int k 0 k lt N k)
    cij aik bkj
  • Bottom line

35
Experiments
  • Architectures
  • SGI R12K, 270MHz
  • Sun UltraSparcIII, 900MHz
  • Intel PIII, 550MHz
  • Measure
  • Mini-MMM performance
  • Complete MMM performance
  • Sensitivity to variations on parameters

36
Installation time of ATLAS and Model
37
Parameter values
ATLAS
Model
38
Mini-MMM Performance
39
SGI Performance
40
TLB effects are important when matrix size is
large.
41
Sun Performance
42
Pentium III Performance
43
Sensitivity to tile size (SGI)
L2 cache conflict-free tile
L2 cache Model tile
Higher levels of memory hierarchy cannot be
ignored.
44
Sensitivity to tile size Sun
45
But .. Results are not always perfect
  • We recently conducted several experiments on
    other machines.
  • We considered this a blind test to check the
    effectiveness of our approach.
  • In these experiments, the search strategy
    sometimes does better than the model.

46
Recent experiments Itanium 2
47
Recent experiments Pentium 4
48
Hybrid approaches
  • We are studying two strategies that combine model
    with search.
  • First, the model can be used to find a first
    approximation to the value of the parameters and
    then use hill climbing to refine this value.
  • Use a general shape of the performance curve and
    use curve fitting to find optimal point.

49
II. Intelligent Search and Sorting
  • Joint work with Xiaoming Li
    and M. Garzaran

50
Sorting
  • Generating sorting libraries is an interesting
    problem for several reasons.
  • It differs from the problems of ATLAS and Spiral
    in that performance depends on the
    characteristics of the input data.
  • It is not as clearly decomposable as the linear
    algebra problems

51
Outline Sorting
  • Part I Selecting one of several possible pure
    sorting algorithm at runtime
  • Motivation
  • Sorting Algorithms
  • Factors
  • Empirical Search and Runtime Adaptation
  • Experiment Results
  • Part II Building a hybrid sorting algorithm
  • Primitives
  • Searching approaches

52
Motivation
  • Theoretical complexity does not suffice to
    evaluate sorting algorithms
  • Cache effect
  • Instruction number
  • The performance of sorting algorithms depends on
    the characteristics of input
  • Number of records
  • Input distribution

53
What we accomplished in this work
  • Identified architectural and runtime factors that
    affect the performance of sorting algorithms.
  • Developed a empirical search strategy to identify
    the best shape and parameter values of each
    sorting algorithm.
  • Developed a strategy to choose at runtime the
    best sorting algorithm for a specific input data
    set.

54
Performance vs. Distribution
55
Performance vs. Distribution
56
Performance vs. Sdev
57
Performance vs. Sdev
58
Outline Sorting
  • Part I Select the best algorithm
  • Motivation
  • Sorting Algorithms
  • Factors
  • Empirical Search and Runtime Adaptation
  • Experiment Results
  • Part II Build the best algorithm
  • Primitives
  • Searching approaches

59
Quicksort
  • Set guardians at both ends of the input array.
  • Eliminate recursion.
  • Choose the median of three as the pivot.
  • Use insertion sort for small partitions.

60
Radix sort
  • Non comparison algorithm

Vector to sort
31 1 12 23 33 4
1 1 2 3 3 4
3 1 2 3
12 23 31 13 4 1
0 1 2 3 4 5
2 3 1 3 4 1
1 2 3 1
3
12
23
61
Cache Conscious Radix Sort
  • CC-radix(bucket)
  • if fits in cache L (bucket) then
  • Radix sort (bucket)
  • else
  • sub-buckets Reverse sorting(bucket)
  • For each sub-bucket in sub-buckets
  • CC-radix(sub-buckets)
  • endfor
  • endif
  • Pseudocode for CC-radix

62
Multiway Merge Sort
63
Sorting algorithms for small partitions
  • Insertion sort
  • Apply register blocking to sorting algorihtm -gt
    register sorting network

64
Outline
  • Part I Select the best algorithm
  • Motivation
  • Sorting Algorithms
  • Factors
  • Empirical Search and Runtime Adaptation
  • Experiment Results
  • Part II Build the best algorithm
  • Primitives
  • Searching approaches

65
Cache Size/TLB Size
  • Quicksort Using multiple pivots to tile
  • CC-radix
  • Fit each partition into cache
  • The number of active partitions lt TLB size
  • Multiway Merge Sort
  • The heap should fit in the cache
  • Sorted runs should fit in the cache

66
Number of Registers
  • Register Blocking

67
Cache Line Size
  • To optimize shift-down operation

68
Amount of Data to Sort
  • Quicksort
  • Cache misses will increase with the amount of
    data.
  • CC-radix
  • As amount of data increases, CC-radix needs more
    partitioning passes. After certain threshold, the
    performance drops dramatically.
  • Multiway Merge Sort
  • Only profitable for large amount of data when
    reduction in number of cache misses can
    compensate for the increased number of operations
    with respect to Quicksort.

69
Distribution of the Data
  • To goal is to distinguish the performance of the
    comparison based algorithms versus the radix
    based ones.
  • Distribution shapes Uniform, Normal,
    Exponential,
  • Not a good criteria.
  • Distribution width
  • Standard deviation (sdev)
  • Only good for one-peak distribution
  • Expensive to calculate
  • Entropy
  • Represents the distribution of each bit

70
Outline
  • Part I Select the best algorithm
  • Motivation
  • Sorting Algorithms
  • Factors
  • Empirical Search and Runtime Adaptation
  • Experiment Results
  • Part II Build the best algorithm
  • Primitives
  • Searching approaches

71
Library adaptation
  • Architectural Factors
  • Cache / TLB size
  • Number of Registers
  • Cache Line Size

Empirical Search
  • Runtime Factors
  • Distribution shape of the data
  • Amount of data to Sort
  • Distribution

Does not matter
Machine learning and runtime adaptation
72
The Library
  • Building the library ? Installation time
  • Empirical Search
  • Learning Procedure
  • Use of training data
  • Running the library ? Runtime
  • Runtime Procedure

Runtime Adaptation
73
Runtime Adaptation
  • Has two parts at installation time and at
    runtime
  • Goal function f(N,E) -gt Multiway Merge(sh,f)
    Sort, Quicksort, CC-radix
  • N amount of input data
  • E the entropy vector
  • For given (N,E), identify the best configuration
    for Multiway Merge Sort as a function of
    size_of_heap and fanout .

74
Runtime Adaptation
  • f(N,E) is linear separable problem.
  • A linear separable problem f(x1, x2, ,xn) is a
    decision problem that there exists a weight
    vector
  • The runtime adaptation code is generated at the
    end of installation to implement the learned
    f(N,E) and select the best configuration for
    Multiway Merge Sort.

75
Runtime Adaptation Learning Procedure
  • Goal function
  • f(N,E) ? Multiway Merge Sort, Quicksort,
    CC-radix
  • N amount of input data
  • E the entropy vector
  • Use the entropy to learn the best algorithm
    between CC-radix and one of the other two
  • Output weight vector ( ) and threshold (?) for
    each value of N
  • Then, use N to choose between Multiway Merge or
    Quicksort

76
Runtime AdaptationRuntime Procedure
  • Sample the input array
  • Compute the entropy vector
  • Compute S ?i wi entropyi
  • If S ?
  • choose CC-radix
  • else
  • choose others

77
Outline
  • Part I Select the best algorithm
  • Motivation
  • Sorting Algorithms
  • Factors
  • Empirical Search and Runtime Adaptation
  • Experiment Results
  • Part II Build the best algorithm
  • Primitives
  • Searching approaches

78
Setup
79
Performance Comparison
Pentium III Xeon, 16 M keys (float)
80
Sun UltraSparcIII
81
IBM Power3
82
Intel PIII Xeon
83
SGI R12000
84
Conclusion
  • Identified the architectural and runtime factors
  • Used empirical search to find the best parameters
    values
  • Our machine learning techniques proved to be
    quite effective
  • Always selects the best algorithm.
  • The wrong decision introduces a 37 average
    performance degradation
  • Overhead (average 5, worst case 7)

85
Outline
  • Part I Select the best algorithm
  • Motivation
  • Sorting Algorithms
  • Factors
  • Empirical Search and Runtime Adaptation
  • Experiment Results
  • Part II Build the best algorithm
  • Primitives
  • Searching approaches

86
Primitives
  • Categorize sorting algorithms
  • Partition by some pivots Quicksort, Bubble
    Sort,
  • Partition by size Merge Sort, Select Sort
  • Partition by radix Radix Sort, CC-Radix
  • Construct a sorting algorithm using these
    primitives.
  • Adapt the sorting algorithm for parallel
    environments and for specific applications.
  • Extend the machine learning approach to other
    algorithms.
  • Develop a search language to rapidly develop
    empirical optimization strategies for any
    algorithm.

DP
DV
DP
87
Searching approaches
  • The composite sorting algorithms are in the shape
    of trees.
  • Every primitive have parameters.
  • The searching mechanism must be able to search
    both the shape and the value.
  • Genetic algorithm is a good choice (may not be
    the only one).

88
Results
89
Results
90
(No Transcript)
91
SPIRAL
  • The approach
  • Mathematical formulation of signal processing
    algorithms
  • Automatically generate algorithm versions
  • A generalization of the well-known FFTW
  • Use compiler technique to translate formulas into
    implementations
  • Adapt to the target platform by searching for the
    optimal version

92
(No Transcript)
93
Fast DSP Algorithms As Matrix Factorizations
  • Computing y F4 x is carried out as
  • t1 A4 x ( permutation )
  • t2 A3 t1 ( two F2s )
  • t3 A2 t2 ( diagonal scaling )
  • y A1 t3 ( two F2s )
  • The cost is reduced because A1, A2, A3 and A4 are
    structured sparse matrices.

94
Tensor Product Formulation of Cooley-Tuckey
  • Theorem
  • Example

is a diagonal matrix
is a stride permutation
95
Formulas for Matrix Factorizations
R1
R2
where n n1nk, ni- n1ni-1, ni ni1nk
96
Factorization Trees
Different computation order Different data access
pattern
Different performance
97
Walsh-Hadamard Transform
98
Optimal Factorization Trees
  • Depend on the platform
  • Difficult to deduct
  • Can be found by empirical search
  • The search space is very large
  • Different search algorithms
  • Random, DP, GA, hill-climbing, exhaustive

99
(No Transcript)
100
(No Transcript)
101
Size of Search Space
N of formulas N of formulas
21 1 29 20793
22 1 210 103049
23 3 211 518859
24 11 212 2646723
25 45 213 13649969
26 197 214 71039373
27 903 215 372693519
28 4279 216 1968801519
102
(No Transcript)
103
(No Transcript)
104
More Search Choices
  • Programming
  • Loop unrolling
  • Memory allocation
  • In-lining
  • Platform choices
  • Compiler optimization options

105
The SPIRAL System
DSP Transform
Formula Generator
SPL Program
Search Engine
SPL Compiler
C/FORTRAN Programs
Performance Evaluation
DSP Library
Target machine
106
Spiral
  • Spiral does the factorization at installation
    time and generates one library routine for each
    size.
  • FFTW only generates codelets (input size ? 64)
    and at run time performs the factorization.

107
A Simple SPL Program
Definition
Directive
Formula
Comment
This is a simple SPL program (define A
(matrix(1 2)(2 1))) (define B (diagonal(3
3)) subname simple (tensor (I 2)(compose A
B)) This is an invisible comment
108
Templates
  • (template
  • (F n) n gt 1
  • ( do i0,n-1
  • y(i)0
  • do j0,n-1
  • y(i)y(i)W(n,ij)x(j)
  • end
  • end ))

Pattern
Condition
I-code
109
SPL Compiler
SPL Formula
Template Definition
Parsing
Template Table
Abstract Syntax Tree
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Optimization
I-Code
Target Code Generation
FORTRAN, C
110
Intermediate Code Restructuring
  • Loop unrolling
  • Degree of unrolling can be controlled globally or
    case by case
  • Scalar function evaluation
  • Replace scalar functions with constant value or
    array access
  • Type conversion
  • Type of input data real or complex
  • Type of arithmetic real or complex
  • Same SPL formula, different C/Fortran programs

111
(No Transcript)
112
Optimizations
High-level scheduling Loop transformation
Formula Generator
High-level optimizations - Constant folding -
Copy propagation - CSE - Dead code elimination
SPL Compiler
C/Fortran Compiler
Low-level optimizations - Instruction
scheduling - Register allocation
113
Basic Optimizations (FFT, N25, SPARC, f77
fast O5)
114
Basic Optimizations(FFT, N25, MIPS, f77 O3)
115
Basic Optimizations(FFT, N25, PII, g77 O6
malign-double)
116
Performance Evaluation
  • Evaluation the performance of the code generated
    by the SPL compiler
  • Platforms SPARC, MIPS, PII
  • Search strategy dynamic programming

117
Pseudo MFlops
  • Estimation of the of FP operations
  • FFT (radix-2) 5nlog2n 10 16

118
FFT Performance (N21 to 26)
SPARC
MIPS
PII
119
FFT Performance (N27 to 220)
SPARC
MIPS
PII
Write a Comment
User Comments (0)
About PowerShow.com