Title: Analytical models and intelligent search for program generation and optimization
1Analytical models and intelligent search for
program generation and optimization
- David Padua
- Department of Computer Science
- University of Illinois at Urbana-Champaign
-
2Program optimization today
- The optimization phase of a compiler applies a
series of transformations to achieve its
objectives. - The compiler uses program analysis to determine
which transformations are correctness-preserving. - Compiler transformation and analysis techniques
are reasonably well-understood. - Since many of the compiler optimization problems
have exponential complexity, heuristics are
needed to drive the application of
transformations.
3Optimization drivers
- Developing driving heuristics is laborious.
- One reason for this is the lack of methodologies
and tools to build optimization drivers. - As a result, although there is much in common
among compilers, their optimization phases are
usually re-implemented from scratch.
4Optimization drivers (Cont.)
- As a result, machines and languages not widely
popular usually lack good compilers. (some
popular systems too) - DSP, network processor, and embedded system
programming is often done in assembly language. - Evaluation of new architectural features
requiring compiler involvement is not always
meaningful. - Languages such as APL, MATLAB, LISP, suffer
from chronic low performance. - New languages difficult to introduce (although
compilers are only a part of the problem).
5A methodology based on the notion of search space
- Program transformations often have several
possible target versions. - Loop unrolling How many times
- Loop tiling size of the tile.
- Loop interchanging order of loop headers
- Register allocation which registers are stored
in memory to give room for new values. - The process of optimization can be seen as a
search in the space of possible program versions.
6Empirical searchIterative compilation
- Perhaps the simplest application of the search
space model is empirical search where several
versions are generated and executed on the target
machine. The fastest version is selected.
T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle,
and H.A.G. Wijshoff . Iterative compilation in
program optimization. In Proc. CPC2000, pages
35-44, 2000
7Empirical search and traditional compilers
- Searching is not a new approach and compilers
have applied it in the past, but using
architectural prediction models instead of actual
runs - KAP searched for best loop header order
- SGIs MIPS-pro and IBM PowerPC compilers select
the best degree of unrolling.
8Limitations of empirical search
- Empirical search is conceptually simple and
portable. - However,
- the search space tends to be too large specially
when several transformations are combined. - It is not clear how to apply this method when
program behavior is a function of the input data
set. - Need heuristics/search strategies.
- Availability of performance formulas could help
evaluate transformations across input data sets
and facilitate search.
9Program/library generators
- An on-going effort at Illinois focuses on
program generators. - The objectives are
- To develop better program generators.
- To improve our understanding the optimization
process without the need to worry about program
analysis.
10Compilers and Library Generators
Algorithm
Program Generation
Internal representation
Program Transformation
Source Program
11Empirical search in program/library generators
- Examples
- FFTW M. Frigo, S. Johnson
- Spiral (FFT/signal processing) J. Moura (CMU),
M. Veloso (CMU), J. Johnson (Drexel) - ATLAS (linear algebra)R. Whaley, A. Petitet, J.
Dongarra - PHiPACJ. Demmel et al
- Sorting X. Li, M. Garzaran (Illinois)
12Techniques presented in the rest of the talk
- Analytical models (ATLAS)
- Pure
- Combined with search
- Pure search strategies
- Data independent performance (Spiral)
- Data dependent performance (Sorting)
13I. Analytical models and ATLAS
- Joint work with G. DeJong (Illinois), M.
Garzaran, and K. Pingali (Cornell)
14ATLAS
- A Linear Algebra Library Generator. ATLAS
Automatically Tuned linear Algebra Software - At installation time, searches for the best
parameters of a Matrix-Matrix Multiplication
routine. - We studied ATLAS and modified the system to
replace the search with an analytical model that
identifies the best MMM parameters without the
need for search.
15The modified version of ATLAS
- Original ATLAS Infrastructure
- Model-Based ATLAS Infrastructure
Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
ATLAS SearchEngine (MMSearch)
Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
Model
16Detecting Machine Parameters
- Micro-benchmarks
- L1Size L1 Data Cache size
- Similar to Hennessy-Patterson book
- NR Number of registers
- Use several FP temporaries repeatedly
- MulAdd Fused Multiply Add (FMA)
- cab as opposed to ct tab
- Latency Latency of FP Multiplication
- Needed for scheduling multiplies and adds in the
absence of FMA
17Compiler View
- ATLAS Code Generation
- Focus on MMM (as part of BLAS-3)
- Very good reuse O(N2) data, O(N3) computation
- Many optimization opportunities
- Few real dependencies
- Will run poorly on modern machines
- Poor use of cache and registers
- Poor use of processor pipelines
ATLAS MMCode Generator (MMCase)
for (int j 0 j lt N j) for (int i 0 i lt
N i) for (int k 0 k lt N k)
Cij Aik Bkj
18Characteristics of the code as generated by ATLAS
- Cache-level blocking (tiling)
- Atlas blocks only for L1 cache
- Register-level blocking
- Highest level of memory hierarchy
- Important to hold array values in registers
- Software pipelining
- Unroll and schedule operations
- Versioning
- Dynamically decide which way to compute
- Back-end compiler optimizations
- Scalar Optimizations
- Instruction Scheduling
19Cache Tiling for MMM
- Tiling in ATLAS
- Only square tiles NBxNBxNB
- Working set of tile must fit in L1 cache
- Tiles usually copied first into contiguous
buffer - Special clean-up code generated for boundaries
B
k
Mini-MMM
for (int j 0 j lt NB j) for (int i 0
i lt NB i) for (int k 0 k lt NB k)
Cij Aik Bkj
k
j
i
NB
NB
A
C
NB
NB
Optimization parameter NB
20IJK version (large cache)
- DO I 1, N//row-major storage
- DO J 1, N
- DO K 1, N
- C(I,J) C(I,J) A(I,K)B(K,J)
B
K
A
C
K
- Large cache scenario
- Matrices are small enough to fit into cache
- Only cold misses, no capacity misses
- Miss ratio
- Data size 3 N2
- Each miss brings in b floating-point numbers
- Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
4,N10)
21IJK version (small cache)
- DO I 1, N
- DO J 1, N
- DO K 1, N
- C(I,J) C(I,J) A(I,K)B(K,J)
B
K
A
C
K
- Small cache scenario
- Matrices are large compared to cache
- reuse distance is not O(1) gt miss
- Cold and capacity misses
- Miss ratio
- C N2/b misses (good temporal locality)
- A N3 /b misses (good spatial locality)
- B N3 misses (poor temporal and spatial
locality) - Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)
22Register tiling for Mini-MMM
Micro-MMM MUx1 sub-matrix of A 1xNU
sub-matrix of B MUxNU sub-matrix of C
MUNUMUNU lt NR
Mini-MMM code after register tiling and unrolling
for (int j 0 j lt NB j NU) for (int i
0 i lt NB i MU) load Ci..iMU-1,
j..jNU-1 into registers for (int k 0 k lt
NB k) load Ai..iMU-1,k into
registers load Bk,j..jNU-1 into
registers multiply As and Bs and add to
Cs store Ci..iMU-1, j..jNU-1
Unroll k loop KU times
Optimization parameters MU,NU,KU
23Scheduling
for (int k 0 k lt NB k KU) load
Ai..iMU-1,k into registers load
Bk,j..jNU-1 into registers multiply As
and Bs and add to Cs
Micro-MMM
- If processor has combined Multiply-add, use it
- Otherwise, schedule multiplies and adds first
- interleave M1M2MMUNU and A1A2AMUNU after
skewing additions by Latency - Schedule IFetch number of initial loads for one
micro-MMM at the end of previous micro-MMM - Schedule remaining loads for micro-MMM in blocks
of NFetch - memory pipeline can support only a small number
of outstanding loads - Optimization parameters MulAdd, Latency, xFetch
-
-
24High-level picture
- Multi-dimensional optimization problem
- Independent parameters NB,MU,NU,KU,
- Dependent variable MFlops
- Function from parameters to variables is given
implicitly can be evaluated repeatedly - One optimization strategy orthogonal range
search - Optimize along one dimension at a time, using
reference values for parameters not yet optimized - Not guaranteed to find optimal point, but might
come close
25Specification of OR Search
- Order in which dimensions are optimized
- Reference values for un-optimized dimensions at
any step - Interval in which range search is done for each
dimension
26Search strategy
- Find Best NB
- Find Best MU NU
- Find Best KU
- Find Best xFetch
- Find Best Latency (lat)
- Find non-copy version tile size (NCNB)
27Find Best NB
- Search in following range
- 16 lt NB lt 80
- NB2 lt L1Size
- In this search, use simple estimates for other
parameters - (eg) KU Test each candidate for
- Full K unrolling (KU NB)
- No K unrolling (KU 1)
28Finding other parameters
- Find best MU, NU try all MU NU that satisfy
-
- In this step, use best NB from previous step
- Find best KU
- Find best Latency
- values between 1 and 6
- Find best xFetch
- IFetch 2,MUNU,
Nfetch1,MUNU-IFetch -
1 ? MU,NU ? NB MUNU MU NU ? NR
29Model-based estimation of optimization
parameters values
Execute
MFLOPS
Measure
NB
L1Size
ATLAS Search
ATLAS MM Code
MiniMMM
MU, NU, KU
Detect
Engine
Generator
Source
Latency
NR
Hardware
(MMSearch)
(MMCase)
Parameters
xFetch
MulAdd
MulAdd
Latency
NB
L1Size
Model Parameter
ATLAS MM Code
Detect
MU, NU, KU
MiniMMM
L1 I-Cache
Estimator
Generator
Latency
Source
NR
Hardware
(MMModel)
(MMCase)
xFetch
MulAdd
Parameters
MulAdd
Latency
30High-level picture
- NB hierarchy of models
- Find largest NB for which there are no capacity
or conflict misses - Find largest NB for which there are no capacity
misses, assuming optimal replacement - Find largest NB for which there are no capacity
misses, assuming LRU replacement - MU,NU estimate from number of registers, making
them roughly equal - KU maximize subject to I-cache size
- Latency from hardware parameter
- xFetch set to 2
MUNU MU NU ? NR
31Largest NB for no capacity/conflict misses
- Tiles are copied into contiguous memory
- Condition for cold misses only
- 3NB2 ? L1Size
B
k
k
j
i
NB
NB
A
NB
NB
32Largest NB for no capacity misses
- MMM
- for (int j 0 i lt N i) for (int i 0
j lt N j) for (int k 0 k lt N k)
cij aik bkj - Cache model
- Fully associative
- Line size 1 Word
- Optimal Replacement
- Bottom line
- N2N1ltC
- One full matrix
- One row / column
- One element
33Extending the Model
- Line Size gt 1
- Spatial locality
- Array layout in memory matters
- Bottom line depending on loop order
- either
- or
34Extending the Model (cont.)
- LRU (not optimal replacement)
- MMM sample
- for (int j 0 i lt N i) for (int i 0
j lt N j) for (int k 0 k lt N k)
cij aik bkj - Bottom line
35Experiments
- Architectures
- SGI R12K, 270MHz
- Sun UltraSparcIII, 900MHz
- Intel PIII, 550MHz
- Measure
- Mini-MMM performance
- Complete MMM performance
- Sensitivity to variations on parameters
36Installation time of ATLAS and Model
37Parameter values
ATLAS
Model
38Mini-MMM Performance
39SGI Performance
40TLB effects are important when matrix size is
large.
41Sun Performance
42Pentium III Performance
43Sensitivity to tile size (SGI)
L2 cache conflict-free tile
L2 cache Model tile
Higher levels of memory hierarchy cannot be
ignored.
44Sensitivity to tile size Sun
45But .. Results are not always perfect
- We recently conducted several experiments on
other machines. - We considered this a blind test to check the
effectiveness of our approach. - In these experiments, the search strategy
sometimes does better than the model.
46Recent experiments Itanium 2
47Recent experiments Pentium 4
48Hybrid approaches
- We are studying two strategies that combine model
with search. - First, the model can be used to find a first
approximation to the value of the parameters and
then use hill climbing to refine this value. - Use a general shape of the performance curve and
use curve fitting to find optimal point.
49II. Intelligent Search and Sorting
- Joint work with Xiaoming Li
and M. Garzaran
50Sorting
- Generating sorting libraries is an interesting
problem for several reasons. - It differs from the problems of ATLAS and Spiral
in that performance depends on the
characteristics of the input data. - It is not as clearly decomposable as the linear
algebra problems
51Outline Sorting
- Part I Selecting one of several possible pure
sorting algorithm at runtime - Motivation
- Sorting Algorithms
- Factors
- Empirical Search and Runtime Adaptation
- Experiment Results
- Part II Building a hybrid sorting algorithm
- Primitives
- Searching approaches
52Motivation
- Theoretical complexity does not suffice to
evaluate sorting algorithms - Cache effect
- Instruction number
- The performance of sorting algorithms depends on
the characteristics of input - Number of records
- Input distribution
-
53What we accomplished in this work
- Identified architectural and runtime factors that
affect the performance of sorting algorithms. - Developed a empirical search strategy to identify
the best shape and parameter values of each
sorting algorithm. - Developed a strategy to choose at runtime the
best sorting algorithm for a specific input data
set.
54Performance vs. Distribution
55Performance vs. Distribution
56Performance vs. Sdev
57Performance vs. Sdev
58Outline Sorting
- Part I Select the best algorithm
- Motivation
- Sorting Algorithms
- Factors
- Empirical Search and Runtime Adaptation
- Experiment Results
- Part II Build the best algorithm
- Primitives
- Searching approaches
59Quicksort
- Set guardians at both ends of the input array.
- Eliminate recursion.
- Choose the median of three as the pivot.
- Use insertion sort for small partitions.
60Radix sort
Vector to sort
31 1 12 23 33 4
1 1 2 3 3 4
3 1 2 3
12 23 31 13 4 1
0 1 2 3 4 5
2 3 1 3 4 1
1 2 3 1
3
12
23
61Cache Conscious Radix Sort
- CC-radix(bucket)
- if fits in cache L (bucket) then
- Radix sort (bucket)
- else
- sub-buckets Reverse sorting(bucket)
- For each sub-bucket in sub-buckets
- CC-radix(sub-buckets)
- endfor
- endif
- Pseudocode for CC-radix
62Multiway Merge Sort
63Sorting algorithms for small partitions
- Insertion sort
- Apply register blocking to sorting algorihtm -gt
register sorting network
64Outline
- Part I Select the best algorithm
- Motivation
- Sorting Algorithms
- Factors
- Empirical Search and Runtime Adaptation
- Experiment Results
- Part II Build the best algorithm
- Primitives
- Searching approaches
65Cache Size/TLB Size
- Quicksort Using multiple pivots to tile
- CC-radix
- Fit each partition into cache
- The number of active partitions lt TLB size
- Multiway Merge Sort
- The heap should fit in the cache
- Sorted runs should fit in the cache
66Number of Registers
67Cache Line Size
- To optimize shift-down operation
68Amount of Data to Sort
- Quicksort
- Cache misses will increase with the amount of
data. - CC-radix
- As amount of data increases, CC-radix needs more
partitioning passes. After certain threshold, the
performance drops dramatically. - Multiway Merge Sort
- Only profitable for large amount of data when
reduction in number of cache misses can
compensate for the increased number of operations
with respect to Quicksort.
69Distribution of the Data
- To goal is to distinguish the performance of the
comparison based algorithms versus the radix
based ones. - Distribution shapes Uniform, Normal,
Exponential, - Not a good criteria.
- Distribution width
- Standard deviation (sdev)
- Only good for one-peak distribution
- Expensive to calculate
- Entropy
- Represents the distribution of each bit
70Outline
- Part I Select the best algorithm
- Motivation
- Sorting Algorithms
- Factors
- Empirical Search and Runtime Adaptation
- Experiment Results
- Part II Build the best algorithm
- Primitives
- Searching approaches
71Library adaptation
- Architectural Factors
- Cache / TLB size
- Number of Registers
- Cache Line Size
Empirical Search
- Runtime Factors
- Distribution shape of the data
- Amount of data to Sort
- Distribution
Does not matter
Machine learning and runtime adaptation
72The Library
- Building the library ? Installation time
- Empirical Search
- Learning Procedure
- Use of training data
- Running the library ? Runtime
- Runtime Procedure
Runtime Adaptation
73Runtime Adaptation
- Has two parts at installation time and at
runtime - Goal function f(N,E) -gt Multiway Merge(sh,f)
Sort, Quicksort, CC-radix - N amount of input data
- E the entropy vector
- For given (N,E), identify the best configuration
for Multiway Merge Sort as a function of
size_of_heap and fanout .
74Runtime Adaptation
- f(N,E) is linear separable problem.
- A linear separable problem f(x1, x2, ,xn) is a
decision problem that there exists a weight
vector - The runtime adaptation code is generated at the
end of installation to implement the learned
f(N,E) and select the best configuration for
Multiway Merge Sort.
75Runtime Adaptation Learning Procedure
- Goal function
-
- f(N,E) ? Multiway Merge Sort, Quicksort,
CC-radix - N amount of input data
- E the entropy vector
- Use the entropy to learn the best algorithm
between CC-radix and one of the other two - Output weight vector ( ) and threshold (?) for
each value of N - Then, use N to choose between Multiway Merge or
Quicksort
76Runtime AdaptationRuntime Procedure
- Sample the input array
- Compute the entropy vector
- Compute S ?i wi entropyi
- If S ?
- choose CC-radix
- else
- choose others
77Outline
- Part I Select the best algorithm
- Motivation
- Sorting Algorithms
- Factors
- Empirical Search and Runtime Adaptation
- Experiment Results
- Part II Build the best algorithm
- Primitives
- Searching approaches
78Setup
79Performance Comparison
Pentium III Xeon, 16 M keys (float)
80Sun UltraSparcIII
81IBM Power3
82Intel PIII Xeon
83SGI R12000
84Conclusion
- Identified the architectural and runtime factors
- Used empirical search to find the best parameters
values - Our machine learning techniques proved to be
quite effective - Always selects the best algorithm.
- The wrong decision introduces a 37 average
performance degradation - Overhead (average 5, worst case 7)
-
85Outline
- Part I Select the best algorithm
- Motivation
- Sorting Algorithms
- Factors
- Empirical Search and Runtime Adaptation
- Experiment Results
- Part II Build the best algorithm
- Primitives
- Searching approaches
86Primitives
- Categorize sorting algorithms
- Partition by some pivots Quicksort, Bubble
Sort, - Partition by size Merge Sort, Select Sort
- Partition by radix Radix Sort, CC-Radix
- Construct a sorting algorithm using these
primitives.
- Adapt the sorting algorithm for parallel
environments and for specific applications. - Extend the machine learning approach to other
algorithms. - Develop a search language to rapidly develop
empirical optimization strategies for any
algorithm.
DP
DV
DP
87Searching approaches
- The composite sorting algorithms are in the shape
of trees. - Every primitive have parameters.
- The searching mechanism must be able to search
both the shape and the value. - Genetic algorithm is a good choice (may not be
the only one).
88Results
89Results
90(No Transcript)
91SPIRAL
- The approach
- Mathematical formulation of signal processing
algorithms - Automatically generate algorithm versions
- A generalization of the well-known FFTW
- Use compiler technique to translate formulas into
implementations - Adapt to the target platform by searching for the
optimal version
92(No Transcript)
93Fast DSP Algorithms As Matrix Factorizations
- Computing y F4 x is carried out as
- t1 A4 x ( permutation )
- t2 A3 t1 ( two F2s )
- t3 A2 t2 ( diagonal scaling )
- y A1 t3 ( two F2s )
- The cost is reduced because A1, A2, A3 and A4 are
structured sparse matrices.
94Tensor Product Formulation of Cooley-Tuckey
is a diagonal matrix
is a stride permutation
95Formulas for Matrix Factorizations
R1
R2
where n n1nk, ni- n1ni-1, ni ni1nk
96Factorization Trees
Different computation order Different data access
pattern
Different performance
97Walsh-Hadamard Transform
98Optimal Factorization Trees
- Depend on the platform
- Difficult to deduct
- Can be found by empirical search
- The search space is very large
- Different search algorithms
- Random, DP, GA, hill-climbing, exhaustive
99(No Transcript)
100(No Transcript)
101Size of Search Space
N of formulas N of formulas
21 1 29 20793
22 1 210 103049
23 3 211 518859
24 11 212 2646723
25 45 213 13649969
26 197 214 71039373
27 903 215 372693519
28 4279 216 1968801519
102(No Transcript)
103(No Transcript)
104More Search Choices
- Programming
- Loop unrolling
- Memory allocation
- In-lining
- Platform choices
- Compiler optimization options
105The SPIRAL System
DSP Transform
Formula Generator
SPL Program
Search Engine
SPL Compiler
C/FORTRAN Programs
Performance Evaluation
DSP Library
Target machine
106Spiral
- Spiral does the factorization at installation
time and generates one library routine for each
size. - FFTW only generates codelets (input size ? 64)
and at run time performs the factorization.
107A Simple SPL Program
Definition
Directive
Formula
Comment
This is a simple SPL program (define A
(matrix(1 2)(2 1))) (define B (diagonal(3
3)) subname simple (tensor (I 2)(compose A
B)) This is an invisible comment
108Templates
- (template
- (F n) n gt 1
- ( do i0,n-1
- y(i)0
- do j0,n-1
- y(i)y(i)W(n,ij)x(j)
- end
- end ))
Pattern
Condition
I-code
109SPL Compiler
SPL Formula
Template Definition
Parsing
Template Table
Abstract Syntax Tree
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Optimization
I-Code
Target Code Generation
FORTRAN, C
110Intermediate Code Restructuring
- Loop unrolling
- Degree of unrolling can be controlled globally or
case by case - Scalar function evaluation
- Replace scalar functions with constant value or
array access - Type conversion
- Type of input data real or complex
- Type of arithmetic real or complex
- Same SPL formula, different C/Fortran programs
111(No Transcript)
112Optimizations
High-level scheduling Loop transformation
Formula Generator
High-level optimizations - Constant folding -
Copy propagation - CSE - Dead code elimination
SPL Compiler
C/Fortran Compiler
Low-level optimizations - Instruction
scheduling - Register allocation
113Basic Optimizations (FFT, N25, SPARC, f77
fast O5)
114Basic Optimizations(FFT, N25, MIPS, f77 O3)
115Basic Optimizations(FFT, N25, PII, g77 O6
malign-double)
116Performance Evaluation
- Evaluation the performance of the code generated
by the SPL compiler - Platforms SPARC, MIPS, PII
- Search strategy dynamic programming
117Pseudo MFlops
- Estimation of the of FP operations
- FFT (radix-2) 5nlog2n 10 16
118FFT Performance (N21 to 26)
SPARC
MIPS
PII
119FFT Performance (N27 to 220)
SPARC
MIPS
PII