Statistical Modeling of Feedback Data in an Automatic Tuning System presentation

About This Presentation

Transcript and Presenter's Notes

Title: Statistical Modeling of Feedback Data in an Automatic Tuning System

1
Statistical Modeling of Feedback Datain an
Automatic Tuning System

Richard Vuduc, James Demmel (U.C. Berkeley, EECS)
richie,demmel_at_cs.berkeley.edu
Jeff Bilmes (Univ. of Washington, EE)
bilmes_at_ee.washington.edu
December 10, 2000
Workshop on Feedback-Directed Dynamic Optimization

2
Context High Performance Libraries

Libraries can isolate performance issues
BLAS/LAPACK/ScaLAPACK (linear algebra)
VSIPL (signal and image processing)
MPI (distributed parallel communications)
Can we implement libraries
automatically and portably
incorporating machine-dependent features
matching performance of hand-tuned
implementations
leveraging compiler technology
using domain-specific knowledge

3
Generate and SearchAn Automatic Tuning
Methodology

Given a library routine
Write parameterized code generators
parameters
machine (e.g., registers, cache, pipeline)
input (e.g., problem size)
problem-specific transformations
output high-level source (e.g., C code)
Search parameter spaces
generate an implementation
compile using native compiler
measure performance (feedback)

4
Tuning System Examples

Linear algebra
PHiPAC (Bilmes, et al., 1997)
ATLAS (Whaley and Dongarra, 1998)
Sparsity (Im and Yelick, 1999)
Signal Processing
FFTW (Frigo and Johnson, 1998)
SPIRAL (Moura, et al., 2000)
Parallel Communcations
Automatically tuned MPI collective
operations(Vadhiyar, et al. 2000)
Related Iterative compilation (Bodin, et al.,
1998)

5
Road Map

Context
The Search Problem
Problem 1 Stopping searches early
Problem 2 High-level run-time selection
Summary

6
The Search Problem in PHiPAC

PHiPAC (Bilmes, et al., 1997)
produces dense matrix multiply (matmul)
implementations
generator parameters include
size and depth of fully unrolled core matmul
rectangular, multi-level cache tile sizes
6 flavors of software pipelining
scaling constants, transpose options, precisions,
etc.
An experiment
fix software pipelining method
vary register tile sizes
500 to 2500 reasonable implementations on 6
platforms

7
A Needle in a Haystack, Part I
8
Road Map

Context
The Search Problem
Problem 1 Stopping searches early
Problem 2 High-level run-time selection
Summary

9
Problem 1 Stopping Searches Early

Assume
dedicated resources limited
near-optimal implementation okay
Recall the search procedure
generate implementations at random
measure performance
Can we stop the search early?
how early is early?
guarantees on quality?

10
An Early Stopping Criterion

Performance scaled from 0 (worst) to 1 (best)
Goal Stop after t implementations when Prob
Mt lt 1-e lt a
Mt max observed performance at t
e proximity to best
a error tolerance
example find within top 5 with error 10
e .05, a .1
Can show probability depends only on F(x)
Prob performance lt x
Idea Estimate F(x) using observed samples

11
Stopping time (300 MHz Pentium-II)
12
Stopping Time (Cray T3E Node)
13
Road Map

Context
The Search Problem
Problem 1 Stopping searches early
Problem 2 High-level run-time selection
Summary

14
Problem 2 Run-Time Selection

Assume
one implementation is not best for all inputs
a few, good implementations known
can benchmark
How do we choose the best implementationat
run-time?
Example matrix multiply, tuned for small (L1),
medium (L2), and large workloads

C C AB
15
Truth Map (Sun Ultra-I/170)
16
A Formal Framework

Given
m implementations
n sample inputs (training set)
execution time
Find
decision function f(s)
returns bestimplementationon input s
f(s) cheap to evaluate

17
Solution Techniques (Overview)

Method 1 Cost Minimization
minimize overall execution time on
samples (boundary modeling)
pro intuitive, f(s) cheap
con ad hoc, geometric assumptions
Method 2 Regression (Brewer, 1995)
model run-time of each implementation e.g.,
Ta(N) b3N 3 b2N 2 b1N b0
pro simple, standard
con user must define model
Method 3 Support Vector Machines
statistical classification
pro solid theory, many successful applications
con heavy training and prediction machinery

18
Results 1 Cost Minimization
19
Results 2 Regression
20
Results 3 Classification
21
Quantitative Comparison
Note Cost of regression and cost-min prediction
O(3x3 matmul) Cost of SVM prediction O(32x32
matmul)
22
Road Map

Context
The Search Problem
Problem 1 Stopping searches early
Problem 2 High-level run-time selection
Summary

23
Conclusions

Search beneficial
Early stopping
simple (random a little bit)
informative criteria
High-level run-time selection
formal framework
error metrics
To do
other stopping models (cost-based)
large design space for run-time selection

24
Extra Slides

More detail (time and/or questions permitting)

25
PHiPAC Performance (Pentium-II)
26
PHiPAC Performance (Ultra-I/170)
27
PHiPAC Performance (IBM RS/6000)
28
PHiPAC Performance (MIPS R10K)
29
Needle in a Haystack, Part II
30
Performance Distribution (IBM RS/6000)
31
Performance Distribution (Pentium II)
32
Performance Distribution (Cray T3E Node)
33
Performance Distribution (Sun Ultra-I)
34
Cost Minimization

Decision function

Minimize overall execution time on samples

Softmax weight (boundary) functions

35
Regression

Decision function

Model implementation running time (e.g., square
matmul of dimension N)

For general matmul with operand sizes (M, K, N),
we generalize the above to include all product
terms
MKN, MK, KN, MN, M, K, N

36
Support Vector Machines

Decision function

Binary classifier

37
Proximity to Best (300 MHz Pentium-II)
38
Proximity to Best (Cray T3E Node)

Write a Comment

User Comments (0)

About PowerShow.com

Statistical Modeling of Feedback Data in an Automatic Tuning System PowerPoint PPT Presentation