Title: Statistical Modeling of Feedback Data in an Automatic Tuning System
1Statistical Modeling of Feedback Datain an
Automatic Tuning System
- Richard Vuduc, James Demmel (U.C. Berkeley, EECS)
- richie,demmel_at_cs.berkeley.edu
- Jeff Bilmes (Univ. of Washington, EE)
- bilmes_at_ee.washington.edu
- December 10, 2000
- Workshop on Feedback-Directed Dynamic Optimization
2Context High Performance Libraries
- Libraries can isolate performance issues
- BLAS/LAPACK/ScaLAPACK (linear algebra)
- VSIPL (signal and image processing)
- MPI (distributed parallel communications)
- Can we implement libraries
- automatically and portably
- incorporating machine-dependent features
- matching performance of hand-tuned
implementations - leveraging compiler technology
- using domain-specific knowledge
3Generate and SearchAn Automatic Tuning
Methodology
- Given a library routine
- Write parameterized code generators
- parameters
- machine (e.g., registers, cache, pipeline)
- input (e.g., problem size)
- problem-specific transformations
- output high-level source (e.g., C code)
- Search parameter spaces
- generate an implementation
- compile using native compiler
- measure performance (feedback)
4Tuning System Examples
- Linear algebra
- PHiPAC (Bilmes, et al., 1997)
- ATLAS (Whaley and Dongarra, 1998)
- Sparsity (Im and Yelick, 1999)
- Signal Processing
- FFTW (Frigo and Johnson, 1998)
- SPIRAL (Moura, et al., 2000)
- Parallel Communcations
- Automatically tuned MPI collective
operations(Vadhiyar, et al. 2000) - Related Iterative compilation (Bodin, et al.,
1998)
5Road Map
- Context
- The Search Problem
- Problem 1 Stopping searches early
- Problem 2 High-level run-time selection
- Summary
6The Search Problem in PHiPAC
- PHiPAC (Bilmes, et al., 1997)
- produces dense matrix multiply (matmul)
implementations - generator parameters include
- size and depth of fully unrolled core matmul
- rectangular, multi-level cache tile sizes
- 6 flavors of software pipelining
- scaling constants, transpose options, precisions,
etc. - An experiment
- fix software pipelining method
- vary register tile sizes
- 500 to 2500 reasonable implementations on 6
platforms
7A Needle in a Haystack, Part I
8Road Map
- Context
- The Search Problem
- Problem 1 Stopping searches early
- Problem 2 High-level run-time selection
- Summary
9Problem 1 Stopping Searches Early
- Assume
- dedicated resources limited
- near-optimal implementation okay
- Recall the search procedure
- generate implementations at random
- measure performance
- Can we stop the search early?
- how early is early?
- guarantees on quality?
10An Early Stopping Criterion
- Performance scaled from 0 (worst) to 1 (best)
- Goal Stop after t implementations when Prob
Mt lt 1-e lt a - Mt max observed performance at t
- e proximity to best
- a error tolerance
- example find within top 5 with error 10
- e .05, a .1
- Can show probability depends only on F(x)
Prob performance lt x - Idea Estimate F(x) using observed samples
11Stopping time (300 MHz Pentium-II)
12Stopping Time (Cray T3E Node)
13Road Map
- Context
- The Search Problem
- Problem 1 Stopping searches early
- Problem 2 High-level run-time selection
- Summary
14Problem 2 Run-Time Selection
- Assume
- one implementation is not best for all inputs
- a few, good implementations known
- can benchmark
- How do we choose the best implementationat
run-time? - Example matrix multiply, tuned for small (L1),
medium (L2), and large workloads
C C AB
15Truth Map (Sun Ultra-I/170)
16A Formal Framework
- Given
- m implementations
- n sample inputs (training set)
- execution time
- Find
- decision function f(s)
- returns bestimplementationon input s
- f(s) cheap to evaluate
17Solution Techniques (Overview)
- Method 1 Cost Minimization
- minimize overall execution time on
samples (boundary modeling) - pro intuitive, f(s) cheap
- con ad hoc, geometric assumptions
- Method 2 Regression (Brewer, 1995)
- model run-time of each implementation e.g.,
Ta(N) b3N 3 b2N 2 b1N b0 - pro simple, standard
- con user must define model
- Method 3 Support Vector Machines
- statistical classification
- pro solid theory, many successful applications
- con heavy training and prediction machinery
18Results 1 Cost Minimization
19Results 2 Regression
20Results 3 Classification
21Quantitative Comparison
Note Cost of regression and cost-min prediction
O(3x3 matmul) Cost of SVM prediction O(32x32
matmul)
22Road Map
- Context
- The Search Problem
- Problem 1 Stopping searches early
- Problem 2 High-level run-time selection
- Summary
23Conclusions
- Search beneficial
- Early stopping
- simple (random a little bit)
- informative criteria
- High-level run-time selection
- formal framework
- error metrics
- To do
- other stopping models (cost-based)
- large design space for run-time selection
24Extra Slides
- More detail (time and/or questions permitting)
25PHiPAC Performance (Pentium-II)
26PHiPAC Performance (Ultra-I/170)
27PHiPAC Performance (IBM RS/6000)
28PHiPAC Performance (MIPS R10K)
29Needle in a Haystack, Part II
30Performance Distribution (IBM RS/6000)
31Performance Distribution (Pentium II)
32Performance Distribution (Cray T3E Node)
33Performance Distribution (Sun Ultra-I)
34Cost Minimization
- Minimize overall execution time on samples
- Softmax weight (boundary) functions
35Regression
- Model implementation running time (e.g., square
matmul of dimension N)
- For general matmul with operand sizes (M, K, N),
we generalize the above to include all product
terms - MKN, MK, KN, MN, M, K, N
36Support Vector Machines
37Proximity to Best (300 MHz Pentium-II)
38Proximity to Best (Cray T3E Node)