Statistical Modeling of Feedback Data in an Automatic Tuning System PowerPoint PPT Presentation

presentation player overlay
1 / 38
About This Presentation
Transcript and Presenter's Notes

Title: Statistical Modeling of Feedback Data in an Automatic Tuning System


1
Statistical Modeling of Feedback Datain an
Automatic Tuning System
  • Richard Vuduc, James Demmel (U.C. Berkeley, EECS)
  • richie,demmel_at_cs.berkeley.edu
  • Jeff Bilmes (Univ. of Washington, EE)
  • bilmes_at_ee.washington.edu
  • December 10, 2000
  • Workshop on Feedback-Directed Dynamic Optimization

2
Context High Performance Libraries
  • Libraries can isolate performance issues
  • BLAS/LAPACK/ScaLAPACK (linear algebra)
  • VSIPL (signal and image processing)
  • MPI (distributed parallel communications)
  • Can we implement libraries
  • automatically and portably
  • incorporating machine-dependent features
  • matching performance of hand-tuned
    implementations
  • leveraging compiler technology
  • using domain-specific knowledge

3
Generate and SearchAn Automatic Tuning
Methodology
  • Given a library routine
  • Write parameterized code generators
  • parameters
  • machine (e.g., registers, cache, pipeline)
  • input (e.g., problem size)
  • problem-specific transformations
  • output high-level source (e.g., C code)
  • Search parameter spaces
  • generate an implementation
  • compile using native compiler
  • measure performance (feedback)

4
Tuning System Examples
  • Linear algebra
  • PHiPAC (Bilmes, et al., 1997)
  • ATLAS (Whaley and Dongarra, 1998)
  • Sparsity (Im and Yelick, 1999)
  • Signal Processing
  • FFTW (Frigo and Johnson, 1998)
  • SPIRAL (Moura, et al., 2000)
  • Parallel Communcations
  • Automatically tuned MPI collective
    operations(Vadhiyar, et al. 2000)
  • Related Iterative compilation (Bodin, et al.,
    1998)

5
Road Map
  • Context
  • The Search Problem
  • Problem 1 Stopping searches early
  • Problem 2 High-level run-time selection
  • Summary

6
The Search Problem in PHiPAC
  • PHiPAC (Bilmes, et al., 1997)
  • produces dense matrix multiply (matmul)
    implementations
  • generator parameters include
  • size and depth of fully unrolled core matmul
  • rectangular, multi-level cache tile sizes
  • 6 flavors of software pipelining
  • scaling constants, transpose options, precisions,
    etc.
  • An experiment
  • fix software pipelining method
  • vary register tile sizes
  • 500 to 2500 reasonable implementations on 6
    platforms

7
A Needle in a Haystack, Part I
8
Road Map
  • Context
  • The Search Problem
  • Problem 1 Stopping searches early
  • Problem 2 High-level run-time selection
  • Summary

9
Problem 1 Stopping Searches Early
  • Assume
  • dedicated resources limited
  • near-optimal implementation okay
  • Recall the search procedure
  • generate implementations at random
  • measure performance
  • Can we stop the search early?
  • how early is early?
  • guarantees on quality?

10
An Early Stopping Criterion
  • Performance scaled from 0 (worst) to 1 (best)
  • Goal Stop after t implementations when Prob
    Mt lt 1-e lt a
  • Mt max observed performance at t
  • e proximity to best
  • a error tolerance
  • example find within top 5 with error 10
  • e .05, a .1
  • Can show probability depends only on F(x)
    Prob performance lt x
  • Idea Estimate F(x) using observed samples

11
Stopping time (300 MHz Pentium-II)
12
Stopping Time (Cray T3E Node)
13
Road Map
  • Context
  • The Search Problem
  • Problem 1 Stopping searches early
  • Problem 2 High-level run-time selection
  • Summary

14
Problem 2 Run-Time Selection
  • Assume
  • one implementation is not best for all inputs
  • a few, good implementations known
  • can benchmark
  • How do we choose the best implementationat
    run-time?
  • Example matrix multiply, tuned for small (L1),
    medium (L2), and large workloads

C C AB
15
Truth Map (Sun Ultra-I/170)
16
A Formal Framework
  • Given
  • m implementations
  • n sample inputs (training set)
  • execution time
  • Find
  • decision function f(s)
  • returns bestimplementationon input s
  • f(s) cheap to evaluate

17
Solution Techniques (Overview)
  • Method 1 Cost Minimization
  • minimize overall execution time on
    samples (boundary modeling)
  • pro intuitive, f(s) cheap
  • con ad hoc, geometric assumptions
  • Method 2 Regression (Brewer, 1995)
  • model run-time of each implementation e.g.,
    Ta(N) b3N 3 b2N 2 b1N b0
  • pro simple, standard
  • con user must define model
  • Method 3 Support Vector Machines
  • statistical classification
  • pro solid theory, many successful applications
  • con heavy training and prediction machinery

18
Results 1 Cost Minimization
19
Results 2 Regression
20
Results 3 Classification
21
Quantitative Comparison
Note Cost of regression and cost-min prediction
O(3x3 matmul) Cost of SVM prediction O(32x32
matmul)
22
Road Map
  • Context
  • The Search Problem
  • Problem 1 Stopping searches early
  • Problem 2 High-level run-time selection
  • Summary

23
Conclusions
  • Search beneficial
  • Early stopping
  • simple (random a little bit)
  • informative criteria
  • High-level run-time selection
  • formal framework
  • error metrics
  • To do
  • other stopping models (cost-based)
  • large design space for run-time selection

24
Extra Slides
  • More detail (time and/or questions permitting)

25
PHiPAC Performance (Pentium-II)
26
PHiPAC Performance (Ultra-I/170)
27
PHiPAC Performance (IBM RS/6000)
28
PHiPAC Performance (MIPS R10K)
29
Needle in a Haystack, Part II
30
Performance Distribution (IBM RS/6000)
31
Performance Distribution (Pentium II)
32
Performance Distribution (Cray T3E Node)
33
Performance Distribution (Sun Ultra-I)
34
Cost Minimization
  • Decision function
  • Minimize overall execution time on samples
  • Softmax weight (boundary) functions

35
Regression
  • Decision function
  • Model implementation running time (e.g., square
    matmul of dimension N)
  • For general matmul with operand sizes (M, K, N),
    we generalize the above to include all product
    terms
  • MKN, MK, KN, MN, M, K, N

36
Support Vector Machines
  • Decision function
  • Binary classifier

37
Proximity to Best (300 MHz Pentium-II)
38
Proximity to Best (Cray T3E Node)
Write a Comment
User Comments (0)
About PowerShow.com