Analytical models and intelligent search for program generation and optimization

About This Presentation

Title:

Analytical models and intelligent search for program generation and optimization

Description:

Analytical models and intelligent search for program ... Model. Mini-MMM Performance. SGI Performance. TLB effects are important when matrix size is large. ... – PowerPoint PPT presentation

Number of Views:195

Avg rating:3.0/5.0

Slides: 120

Provided by: polaris

Learn more at: http://polaris.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Analytical models and intelligent search for program generation and optimization

1
Analytical models and intelligent search for
program generation and optimization

David Padua
Department of Computer Science
University of Illinois at Urbana-Champaign

2
Program optimization today

The optimization phase of a compiler applies a
series of transformations to achieve its
objectives.
The compiler uses program analysis to determine
which transformations are correctness-preserving.
Compiler transformation and analysis techniques
are reasonably well-understood.
Since many of the compiler optimization problems
have exponential complexity, heuristics are
needed to drive the application of
transformations.

3
Optimization drivers

Developing driving heuristics is laborious.
One reason for this is the lack of methodologies
and tools to build optimization drivers.
As a result, although there is much in common
among compilers, their optimization phases are
usually re-implemented from scratch.

4
Optimization drivers (Cont.)

As a result, machines and languages not widely
popular usually lack good compilers. (some
popular systems too)
DSP, network processor, and embedded system
programming is often done in assembly language.
Evaluation of new architectural features
requiring compiler involvement is not always
meaningful.
Languages such as APL, MATLAB, LISP, suffer
from chronic low performance.
New languages difficult to introduce (although
compilers are only a part of the problem).

5
A methodology based on the notion of search space

Program transformations often have several
possible target versions.
Loop unrolling How many times
Loop tiling size of the tile.
Loop interchanging order of loop headers
Register allocation which registers are stored
in memory to give room for new values.
The process of optimization can be seen as a
search in the space of possible program versions.

6
Empirical searchIterative compilation

Perhaps the simplest application of the search
space model is empirical search where several
versions are generated and executed on the target
machine. The fastest version is selected.

T. Kisuki, P.M.W. Knijnenburg, M.F.P. O'Boyle,
and H.A.G. Wijshoff . Iterative compilation in
program optimization. In Proc. CPC2000, pages
35-44, 2000
7
Empirical search and traditional compilers

Searching is not a new approach and compilers
have applied it in the past, but using
architectural prediction models instead of actual
runs
KAP searched for best loop header order
SGIs MIPS-pro and IBM PowerPC compilers select
the best degree of unrolling.

8
Limitations of empirical search

Empirical search is conceptually simple and
portable.
However,
the search space tends to be too large specially
when several transformations are combined.
It is not clear how to apply this method when
program behavior is a function of the input data
set.
Need heuristics/search strategies.
Availability of performance formulas could help
evaluate transformations across input data sets
and facilitate search.

9
Program/library generators

An on-going effort at Illinois focuses on
program generators.
The objectives are
To develop better program generators.
To improve our understanding the optimization
process without the need to worry about program
analysis.

10
Compilers and Library Generators
Algorithm
Program Generation
Internal representation
Program Transformation
Source Program
11
Empirical search in program/library generators

Examples
FFTW M. Frigo, S. Johnson
Spiral (FFT/signal processing) J. Moura (CMU),
M. Veloso (CMU), J. Johnson (Drexel)
ATLAS (linear algebra)R. Whaley, A. Petitet, J.
Dongarra
PHiPACJ. Demmel et al
Sorting X. Li, M. Garzaran (Illinois)

12
Techniques presented in the rest of the talk

Analytical models (ATLAS)
Pure
Combined with search
Pure search strategies
Data independent performance (Spiral)
Data dependent performance (Sorting)

13
I. Analytical models and ATLAS

Joint work with G. DeJong (Illinois), M.
Garzaran, and K. Pingali (Cornell)

14
ATLAS

A Linear Algebra Library Generator. ATLAS
Automatically Tuned linear Algebra Software
At installation time, searches for the best
parameters of a Matrix-Matrix Multiplication
routine.
We studied ATLAS and modified the system to
replace the search with an analytical model that
identifies the best MMM parameters without the
need for search.

15
The modified version of ATLAS

Original ATLAS Infrastructure
Model-Based ATLAS Infrastructure

Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
ATLAS SearchEngine (MMSearch)
Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
Model
16
Detecting Machine Parameters

Micro-benchmarks
L1Size L1 Data Cache size
Similar to Hennessy-Patterson book
NR Number of registers
Use several FP temporaries repeatedly
MulAdd Fused Multiply Add (FMA)
cab as opposed to ct tab
Latency Latency of FP Multiplication
Needed for scheduling multiplies and adds in the
absence of FMA

17
Compiler View

ATLAS Code Generation
Focus on MMM (as part of BLAS-3)
Very good reuse O(N2) data, O(N3) computation
Many optimization opportunities
Few real dependencies
Will run poorly on modern machines
Poor use of cache and registers
Poor use of processor pipelines

ATLAS MMCode Generator (MMCase)
for (int j 0 j lt N j) for (int i 0 i lt
N i) for (int k 0 k lt N k)
Cij Aik Bkj
18
Characteristics of the code as generated by ATLAS

Cache-level blocking (tiling)
Atlas blocks only for L1 cache
Register-level blocking
Highest level of memory hierarchy
Important to hold array values in registers
Software pipelining
Unroll and schedule operations
Versioning
Dynamically decide which way to compute
Back-end compiler optimizations
Scalar Optimizations
Instruction Scheduling

19
Cache Tiling for MMM

Tiling in ATLAS
Only square tiles NBxNBxNB
Working set of tile must fit in L1 cache
Tiles usually copied first into contiguous
buffer
Special clean-up code generated for boundaries

B
k
Mini-MMM
for (int j 0 j lt NB j) for (int i 0
i lt NB i) for (int k 0 k lt NB k)
Cij Aik Bkj
k
j

i
NB
NB
A
C
NB
NB
Optimization parameter NB
20
IJK version (large cache)

DO I 1, N//row-major storage
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

B
K
A
C
K

Large cache scenario
Matrices are small enough to fit into cache
Only cold misses, no capacity misses
Miss ratio
Data size 3 N2
Each miss brings in b floating-point numbers
Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
4,N10)

21
IJK version (small cache)

DO I 1, N
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

B
K
A
C
K

Small cache scenario
Matrices are large compared to cache
reuse distance is not O(1) gt miss
Cold and capacity misses
Miss ratio
C N2/b misses (good temporal locality)
A N3 /b misses (good spatial locality)
B N3 misses (poor temporal and spatial
locality)
Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)

22
Register tiling for Mini-MMM
Micro-MMM MUx1 sub-matrix of A 1xNU
sub-matrix of B MUxNU sub-matrix of C
MUNUMUNU lt NR
Mini-MMM code after register tiling and unrolling
for (int j 0 j lt NB j NU) for (int i
0 i lt NB i MU) load Ci..iMU-1,
j..jNU-1 into registers for (int k 0 k lt
NB k) load Ai..iMU-1,k into
registers load Bk,j..jNU-1 into
registers multiply As and Bs and add to
Cs store Ci..iMU-1, j..jNU-1
Unroll k loop KU times
Optimization parameters MU,NU,KU
23
Scheduling
for (int k 0 k lt NB k KU) load
Ai..iMU-1,k into registers load
Bk,j..jNU-1 into registers multiply As
and Bs and add to Cs
Micro-MMM

If processor has combined Multiply-add, use it
Otherwise, schedule multiplies and adds first
interleave M1M2MMUNU and A1A2AMUNU after
skewing additions by Latency
Schedule IFetch number of initial loads for one
micro-MMM at the end of previous micro-MMM
Schedule remaining loads for micro-MMM in blocks
of NFetch
memory pipeline can support only a small number
of outstanding loads
Optimization parameters MulAdd, Latency, xFetch

24
High-level picture

Multi-dimensional optimization problem
Independent parameters NB,MU,NU,KU,
Dependent variable MFlops
Function from parameters to variables is given
implicitly can be evaluated repeatedly
One optimization strategy orthogonal range
search
Optimize along one dimension at a time, using
reference values for parameters not yet optimized
Not guaranteed to find optimal point, but might
come close

25
Specification of OR Search

Order in which dimensions are optimized
Reference values for un-optimized dimensions at
any step
Interval in which range search is done for each
dimension

26
Search strategy

Find Best NB
Find Best MU NU
Find Best KU
Find Best xFetch
Find Best Latency (lat)
Find non-copy version tile size (NCNB)

27
Find Best NB

Search in following range
16 lt NB lt 80
NB2 lt L1Size
In this search, use simple estimates for other
parameters
(eg) KU Test each candidate for
Full K unrolling (KU NB)
No K unrolling (KU 1)

28
Finding other parameters

Find best MU, NU try all MU NU that satisfy
In this step, use best NB from previous step
Find best KU
Find best Latency
values between 1 and 6
Find best xFetch
IFetch 2,MUNU,
Nfetch1,MUNU-IFetch

1 ? MU,NU ? NB MUNU MU NU ? NR
29
Model-based estimation of optimization
parameters values
Execute

MFLOPS
Measure
NB
L1Size
ATLAS Search
ATLAS MM Code
MiniMMM
MU, NU, KU
Detect
Engine
Generator
Source
Latency
NR
Hardware
(MMSearch)
(MMCase)
Parameters
xFetch
MulAdd
MulAdd
Latency
NB
L1Size
Model Parameter
ATLAS MM Code
Detect
MU, NU, KU
MiniMMM
L1 I-Cache
Estimator
Generator
Latency
Source
NR
Hardware
(MMModel)
(MMCase)
xFetch
MulAdd
Parameters
MulAdd
Latency
30
High-level picture

NB hierarchy of models
Find largest NB for which there are no capacity
or conflict misses
Find largest NB for which there are no capacity
misses, assuming optimal replacement
Find largest NB for which there are no capacity
misses, assuming LRU replacement
MU,NU estimate from number of registers, making
them roughly equal
KU maximize subject to I-cache size
Latency from hardware parameter
xFetch set to 2

MUNU MU NU ? NR
31
Largest NB for no capacity/conflict misses

Tiles are copied into contiguous memory
Condition for cold misses only
3NB2 ? L1Size

B
k
k
j

i
NB
NB
A
NB
NB
32
Largest NB for no capacity misses

MMM
for (int j 0 i lt N i) for (int i 0
j lt N j) for (int k 0 k lt N k)
cij aik bkj
Cache model
Fully associative
Line size 1 Word
Optimal Replacement
Bottom line
N2N1ltC
One full matrix
One row / column
One element

33
Extending the Model

Line Size gt 1
Spatial locality
Array layout in memory matters
Bottom line depending on loop order
either
or

34
Extending the Model (cont.)

LRU (not optimal replacement)
MMM sample
for (int j 0 i lt N i) for (int i 0
j lt N j) for (int k 0 k lt N k)
cij aik bkj
Bottom line

35
Experiments

Architectures
SGI R12K, 270MHz
Sun UltraSparcIII, 900MHz
Intel PIII, 550MHz
Measure
Mini-MMM performance
Complete MMM performance
Sensitivity to variations on parameters

36
Installation time of ATLAS and Model
37
Parameter values
ATLAS
Model
38
Mini-MMM Performance
39
SGI Performance
40
TLB effects are important when matrix size is
large.
41
Sun Performance
42
Pentium III Performance
43
Sensitivity to tile size (SGI)
L2 cache conflict-free tile
L2 cache Model tile
Higher levels of memory hierarchy cannot be
ignored.
44
Sensitivity to tile size Sun
45
But .. Results are not always perfect

We recently conducted several experiments on
other machines.
We considered this a blind test to check the
effectiveness of our approach.
In these experiments, the search strategy
sometimes does better than the model.

46
Recent experiments Itanium 2
47
Recent experiments Pentium 4
48
Hybrid approaches

We are studying two strategies that combine model
with search.
First, the model can be used to find a first
approximation to the value of the parameters and
then use hill climbing to refine this value.
Use a general shape of the performance curve and
use curve fitting to find optimal point.

49
II. Intelligent Search and Sorting

Joint work with Xiaoming Li
and M. Garzaran

50
Sorting

Generating sorting libraries is an interesting
problem for several reasons.
It differs from the problems of ATLAS and Spiral
in that performance depends on the
characteristics of the input data.
It is not as clearly decomposable as the linear
algebra problems

51
Outline Sorting

Part I Selecting one of several possible pure
sorting algorithm at runtime
Motivation
Sorting Algorithms
Factors
Empirical Search and Runtime Adaptation
Experiment Results
Part II Building a hybrid sorting algorithm
Primitives
Searching approaches

52
Motivation

Theoretical complexity does not suffice to
evaluate sorting algorithms
Cache effect
Instruction number
The performance of sorting algorithms depends on
the characteristics of input
Number of records
Input distribution

53
What we accomplished in this work

Identified architectural and runtime factors that
affect the performance of sorting algorithms.
Developed a empirical search strategy to identify
the best shape and parameter values of each
sorting algorithm.
Developed a strategy to choose at runtime the
best sorting algorithm for a specific input data
set.

54
Performance vs. Distribution
55
Performance vs. Distribution
56
Performance vs. Sdev
57
Performance vs. Sdev
58
Outline Sorting

Part I Select the best algorithm
Motivation
Sorting Algorithms
Factors
Empirical Search and Runtime Adaptation
Experiment Results
Part II Build the best algorithm
Primitives
Searching approaches

59
Quicksort

Set guardians at both ends of the input array.
Eliminate recursion.
Choose the median of three as the pivot.
Use insertion sort for small partitions.

60
Radix sort

Non comparison algorithm

Vector to sort
31 1 12 23 33 4
1 1 2 3 3 4
3 1 2 3
12 23 31 13 4 1
0 1 2 3 4 5
2 3 1 3 4 1
1 2 3 1
3
12
23
61
Cache Conscious Radix Sort

CC-radix(bucket)
if fits in cache L (bucket) then
Radix sort (bucket)
else
sub-buckets Reverse sorting(bucket)
For each sub-bucket in sub-buckets
CC-radix(sub-buckets)
endfor
endif
Pseudocode for CC-radix

62
Multiway Merge Sort
63
Sorting algorithms for small partitions

Insertion sort
Apply register blocking to sorting algorihtm -gt
register sorting network

64
Outline

Part I Select the best algorithm
Motivation
Sorting Algorithms
Factors
Empirical Search and Runtime Adaptation
Experiment Results
Part II Build the best algorithm
Primitives
Searching approaches

65
Cache Size/TLB Size

Quicksort Using multiple pivots to tile
CC-radix
Fit each partition into cache
The number of active partitions lt TLB size
Multiway Merge Sort
The heap should fit in the cache
Sorted runs should fit in the cache

66
Number of Registers

67
Cache Line Size

To optimize shift-down operation

68
Amount of Data to Sort

Quicksort
Cache misses will increase with the amount of
data.
CC-radix
As amount of data increases, CC-radix needs more
partitioning passes. After certain threshold, the
performance drops dramatically.
Multiway Merge Sort
Only profitable for large amount of data when
reduction in number of cache misses can
compensate for the increased number of operations
with respect to Quicksort.

69
Distribution of the Data

To goal is to distinguish the performance of the
comparison based algorithms versus the radix
based ones.
Distribution shapes Uniform, Normal,
Exponential,
Not a good criteria.
Distribution width
Standard deviation (sdev)
Only good for one-peak distribution
Expensive to calculate
Entropy
Represents the distribution of each bit

70
Outline

Part I Select the best algorithm
Motivation
Sorting Algorithms
Factors
Empirical Search and Runtime Adaptation
Experiment Results
Part II Build the best algorithm
Primitives
Searching approaches

71
Library adaptation

Architectural Factors
Cache / TLB size
Number of Registers
Cache Line Size

Empirical Search

Runtime Factors
Distribution shape of the data
Amount of data to Sort
Distribution

Does not matter
Machine learning and runtime adaptation
72
The Library

Building the library ? Installation time
Empirical Search
Learning Procedure
Use of training data
Running the library ? Runtime
Runtime Procedure

Runtime Adaptation
73
Runtime Adaptation

Has two parts at installation time and at
runtime
Goal function f(N,E) -gt Multiway Merge(sh,f)
Sort, Quicksort, CC-radix
N amount of input data
E the entropy vector
For given (N,E), identify the best configuration
for Multiway Merge Sort as a function of
size_of_heap and fanout .

74
Runtime Adaptation

f(N,E) is linear separable problem.
A linear separable problem f(x1, x2, ,xn) is a
decision problem that there exists a weight
vector
The runtime adaptation code is generated at the
end of installation to implement the learned
f(N,E) and select the best configuration for
Multiway Merge Sort.

75
Runtime Adaptation Learning Procedure

Goal function
f(N,E) ? Multiway Merge Sort, Quicksort,
CC-radix
N amount of input data
E the entropy vector
Use the entropy to learn the best algorithm
between CC-radix and one of the other two
Output weight vector ( ) and threshold (?) for
each value of N
Then, use N to choose between Multiway Merge or
Quicksort

76
Runtime AdaptationRuntime Procedure

Sample the input array
Compute the entropy vector
Compute S ?i wi entropyi
If S ?
choose CC-radix
else
choose others

77
Outline

Part I Select the best algorithm
Motivation
Sorting Algorithms
Factors
Empirical Search and Runtime Adaptation
Experiment Results
Part II Build the best algorithm
Primitives
Searching approaches

78
Setup
79
Performance Comparison
Pentium III Xeon, 16 M keys (float)
80
Sun UltraSparcIII
81
IBM Power3
82
Intel PIII Xeon
83
SGI R12000
84
Conclusion

Identified the architectural and runtime factors
Used empirical search to find the best parameters
values
Our machine learning techniques proved to be
quite effective
Always selects the best algorithm.
The wrong decision introduces a 37 average
performance degradation
Overhead (average 5, worst case 7)

85
Outline

Part I Select the best algorithm
Motivation
Sorting Algorithms
Factors
Empirical Search and Runtime Adaptation
Experiment Results
Part II Build the best algorithm
Primitives
Searching approaches

86
Primitives

Categorize sorting algorithms
Partition by some pivots Quicksort, Bubble
Sort,
Partition by size Merge Sort, Select Sort
Partition by radix Radix Sort, CC-Radix
Construct a sorting algorithm using these
primitives.

Adapt the sorting algorithm for parallel
environments and for specific applications.
Extend the machine learning approach to other
algorithms.
Develop a search language to rapidly develop
empirical optimization strategies for any
algorithm.

DP
DV
DP
87
Searching approaches

The composite sorting algorithms are in the shape
of trees.
Every primitive have parameters.
The searching mechanism must be able to search
both the shape and the value.
Genetic algorithm is a good choice (may not be
the only one).

88
Results
89
Results
90
(No Transcript)
91
SPIRAL

The approach
Mathematical formulation of signal processing
algorithms
Automatically generate algorithm versions
A generalization of the well-known FFTW
Use compiler technique to translate formulas into
implementations
Adapt to the target platform by searching for the
optimal version

92
(No Transcript)
93
Fast DSP Algorithms As Matrix Factorizations

Computing y F4 x is carried out as
t1 A4 x ( permutation )
t2 A3 t1 ( two F2s )
t3 A2 t2 ( diagonal scaling )
y A1 t3 ( two F2s )
The cost is reduced because A1, A2, A3 and A4 are
structured sparse matrices.

94
Tensor Product Formulation of Cooley-Tuckey

Theorem
Example

is a diagonal matrix
is a stride permutation
95
Formulas for Matrix Factorizations
R1
R2
where n n1nk, ni- n1ni-1, ni ni1nk
96
Factorization Trees
Different computation order Different data access
pattern
Different performance
97
Walsh-Hadamard Transform
98
Optimal Factorization Trees

Depend on the platform
Difficult to deduct
Can be found by empirical search
The search space is very large
Different search algorithms
Random, DP, GA, hill-climbing, exhaustive

99
(No Transcript)
100
(No Transcript)
101
Size of Search Space
N of formulas N of formulas
21 1 29 20793
22 1 210 103049
23 3 211 518859
24 11 212 2646723
25 45 213 13649969
26 197 214 71039373
27 903 215 372693519
28 4279 216 1968801519
102
(No Transcript)
103
(No Transcript)
104
More Search Choices

Programming
Loop unrolling
Memory allocation
In-lining
Platform choices
Compiler optimization options

105
The SPIRAL System
DSP Transform
Formula Generator
SPL Program
Search Engine
SPL Compiler
C/FORTRAN Programs
Performance Evaluation
DSP Library
Target machine
106
Spiral

Spiral does the factorization at installation
time and generates one library routine for each
size.
FFTW only generates codelets (input size ? 64)
and at run time performs the factorization.

107
A Simple SPL Program
Definition
Directive
Formula
Comment
This is a simple SPL program (define A
(matrix(1 2)(2 1))) (define B (diagonal(3
3)) subname simple (tensor (I 2)(compose A
B)) This is an invisible comment
108
Templates

(template
(F n) n gt 1
( do i0,n-1
y(i)0
do j0,n-1
y(i)y(i)W(n,ij)x(j)
end
end ))

Pattern
Condition
I-code
109
SPL Compiler
SPL Formula
Template Definition
Parsing
Template Table
Abstract Syntax Tree
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Optimization
I-Code
Target Code Generation
FORTRAN, C
110
Intermediate Code Restructuring

Loop unrolling
Degree of unrolling can be controlled globally or
case by case
Scalar function evaluation
Replace scalar functions with constant value or
array access
Type conversion
Type of input data real or complex
Type of arithmetic real or complex
Same SPL formula, different C/Fortran programs

111
(No Transcript)
112
Optimizations
High-level scheduling Loop transformation
Formula Generator
High-level optimizations - Constant folding -
Copy propagation - CSE - Dead code elimination
SPL Compiler
C/Fortran Compiler
Low-level optimizations - Instruction
scheduling - Register allocation
113
Basic Optimizations (FFT, N25, SPARC, f77
fast O5)
114
Basic Optimizations(FFT, N25, MIPS, f77 O3)
115
Basic Optimizations(FFT, N25, PII, g77 O6
malign-double)
116
Performance Evaluation