An Experimental Comparison of Empirical and Model-based Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

An Experimental Comparison of Empirical and Model-based Optimization

Description:

Code Generator (MMCase) ATLAS Search. Engine (MMSearch) ... Line size 1 Word. Optimal Replacement. Bottom line: N2 N 1 C. One full matrix. One row / column ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 50
Provided by: csCor
Category:

less

Transcript and Presenter's Notes

Title: An Experimental Comparison of Empirical and Model-based Optimization


1
An Experimental Comparison of Empirical and
Model-based Optimization
  • Keshav Pingali
  • Cornell University
  • Joint work with
  • Kamen Yotov2 ,Xiaoming Li1, Gang Ren1, Michael
    Cibulskis1,
  • Gerald DeJong1, Maria Garzaran1, David Padua1,
  • Paul Stodghill2, Peng Wu3
  • 1UIUC, 2Cornell University, 3IBM T.J.Watson

2
Context High-performance libraries
  • Traditional approach
  • Hand-optimized code (e.g.) BLAS
  • Problem tedious to develop
  • Alternatives
  • Restructuring compilers
  • General purpose generate code from high-level
    specifications
  • Use architectural models to determine
    optimization parameters
  • Performance of optimized code is not satisfactory
  • Library generators
  • Problem-specific (e.g.) ATLAS for BLAS, FFTW for
    FFT
  • Use empirical optimization to determine
    optimization parameters
  • Believed to produce optimal code
  • Why are library generators beating compilers?

3
How important is empirical search?
  • Model-based optimization and empirical
    optimization are not in conflict
  • Use models to prune search space
  • Search ? intelligent search
  • Use empirical observations to refine models
  • Understand what essential aspects of reality are
    missing from model and refine model appropriately
  • Multiple models are fine
  • Learn from experience

4
Previous work
  • Compared performance of
  • code generated by a sophisticated compiler like
    SGI MIPSpro
  • code generated by ATLAS
  • found ATLAS code is better
  • Hard to answer why
  • Perhaps ATLAS is effectively doing
    transformations that compilers do not know about
  • Phase-ordering problem perhaps compilers are
    doing transformations in wrong order
  • Perhaps parameters to transformations are chosen
    sub-optimally by compiler using models

5
Our Approach
  • Original ATLAS Infrastructure
  • Model-Based ATLAS Infrastructure

Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
ATLAS SearchEngine (MMSearch)
Detect Hardware Parameters
ATLAS MMCode Generator (MMCase)
Model
6
Detecting Machine Parameters
  • Micro-benchmarks
  • L1Size L1 Data Cache size
  • Similar to Hennessy-Patterson book
  • NR Number of registers
  • MulAdd Fused Multiply Add (FMA)
  • cab as opposed to ct tab
  • Latency Latency of FP Multiplication

7
Code Generation Compiler View
  • Original ATLAS Infrastructure
  • Model-Based ATLAS Infrastructure

8
BLAS
  • Let us focus on BLAS-3
  • Code for MMM
  • for (int i 0 i lt M i)
  • for (int j 0 j lt N j)
  • for (int k 0 k lt K k)
  • Cij AikBkj
  • Properties
  • Very good reuse O(N2) data, O(N3) computation
  • Many optimization opportunities
  • Few real dependencies
  • Will run poorly on modern machines
  • Poor use of cache and registers
  • Poor use of processor pipelines

9
Optimizations
  • Cache-level blocking (tiling)
  • Atlas blocks only for L1 cache
  • Register-level blocking
  • Highest level of memory hierarchy
  • Important to hold array values in registers
  • Software pipelining
  • Unroll and schedule operations
  • Versioning
  • Dynamically decide which way to compute
  • Back-end compiler optimizations
  • Scalar Optimizations
  • Instruction Scheduling

10
Cache-level blocking (tiling)
  • Tiling in ATLAS
  • Only square tiles (NBxNBxNB)
  • Working set of tile fits in L1
  • Tiles are usually copied to continuous storage
  • Special clean-up code generated for boundaries
  • Mini-MMM
  • for (int j 0 j lt NB j)
  • for (int i 0 i lt NB i)
  • for (int k 0 k lt NB k)
  • Cij Aik Bkj
  • NB Optimization parameter

11
Short excursion into tiling
12
MMM miss ratio
  • L1 Cache Miss Ratio for Intel Pentium III
  • MMM with N 11300
  • 16KB 32B/Block 4-way 8-byte elements

13
IJK version (large cache)
  • DO I 1, N//row-major storage
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)

B
K
A
C
K
  • Large cache scenario
  • Matrices are small enough to fit into cache
  • Only cold misses, no capacity misses
  • Miss ratio
  • Data size 3 N2
  • Each miss brings in b floating-point numbers
  • Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
    4,N10)

14
IJK version (small cache)
  • DO I 1, N
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)

B
K
A
C
K
  • Small cache scenario
  • Matrices are large compared to cache
  • reuse distance is not O(1) gt miss
  • Cold and capacity misses
  • Miss ratio
  • C N2/b misses (good temporal locality)
  • A N3 /b misses (good spatial locality)
  • B N3 misses (poor temporal and spatial
    locality)
  • Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)

15
MMM experiments
Tile size
  • L1 Cache Miss Ratio for Intel Pentium III
  • MMM with N 11300
  • 16KB 32B/Block 4-way 8-byte elements

16
Register-level blocking
  • Micro-MMM
  • A MUx1
  • B 1xNU
  • C MUxNU
  • MUxNUMUNU registers
  • Unroll loops by MU, NU, and KU
  • Mini-MMM with Micro-MMM inside
  • for (int j 0 j lt NB j NU)
  • for (int i 0 i lt NB i MU)
  • load Ci..iMU-1, j..jNU-1 into registers
  • for (int k 0 k lt NB k)
  • load Ai..iMU-1,k into registers
  • load Bk,j..jNU-1 into registers
  • multiply As and Bs and add to Cs
  • store Ci..iMU-1, j..jNU-1
  • MU, NU, KU optimization parameters

KU times
17
Scheduling
Computation
MemoryOperations
  • FMA Present?
  • Schedule Computation
  • Using Latency
  • Schedule Memory Operations
  • Using IFetch, NFetch,FFetch
  • Latency, xFetch optimization parameters

L1
L2
L3

LMUNU
18
Comments
  • Optimization parameters
  • NB constrained by size of L1 cache
  • MU,NU constrained by NR
  • KU constrained by size of I-Cache
  • xFetch constrained by of OL
  • MulAdd/Latency related to hardware parameters
  • Similar parameters would be used by compilers

MFlops
MFlops
parameter
parameter
Sensitive parameter
Insensitive parameter
19
ATLAS Search
  • Original ATLAS Infrastructure
  • Model-Based ATLAS Infrastructure

20
High-level picture
  • Multi-dimensional optimization problem
  • Independent parameters NB,MU,NU,KU,
  • Dependent variable MFlops
  • Function from parameters to variables is given
    implicitly can be evaluated repeatedly
  • One optimization strategy orthogonal range
    search
  • Optimize along one dimension at a time, using
    reference values for parameters not yet optimized
  • Not guaranteed to find optimal point, but might
    come close

21
Specification of OR Search
  • Order in which dimensions are optimized
  • Reference values for un-optimized dimensions at
    any step
  • Interval in which range search is done for each
    dimension

22
Search strategy
  • Find Best NB
  • Find Best MU NU
  • Find Best KU
  • Find Best xFetch
  • Find Best Latency (lat)
  • Find non-copy version tile size (NCNB)

23
Find Best NB
  • Search in following range
  • 16 lt NB lt 80
  • NB2 lt L1Size
  • In this search, use simple estimates for other
    parameters
  • (eg) KU Test each candidate for
  • Full K unrolling (KU NB)
  • No K unrolling (KU 1)

24
Finding other parameters
  • Find best MU, NU try all MU NU that satisfy
  • In this step, use best NB from previous step
  • Find best KU
  • Find best Latency
  • 16
  • Find best xFetch
  • IFetch 2,MUNU,
    Nfetch1,MUNU-IFetch

1 lt MU,NU lt NB MUNU MU NU lt NR
25
Our Models
  • Original ATLAS Infrastructure
  • Model-Based ATLAS Infrastructure

26
Modeling for Optimization Parameters
  • Optimization parameters
  • NB
  • Hierarchy of Models (later)
  • MU, NU
  • KU
  • maximize subject to L1 Instruction Cache
  • Latency, MulAdd
  • hardware parameters
  • xFetch
  • set to 2

27
Largest NB for no capacity/conflict misses
  • Tiles are copied into
  • contiguous memory
  • Condition for cold misses only
  • 3NB2 lt L1Size

B
k
k
j

i
NB
NB
A
NB
NB
28
Largest NB for no capacity misses
  • MMM
  • for (int j 0 i lt N i) for (int i 0
    j lt N j) for (int k 0 k lt N k)
    cij aik bkj
  • Cache model
  • Fully associative
  • Line size 1 Word
  • Optimal Replacement
  • Bottom line
  • N2N1ltC
  • One full matrix
  • One row / column
  • One element

29
Extending the Model
  • Line Size gt 1
  • Spatial locality
  • Array layout in memory matters
  • Bottom line depending on loop order
  • either
  • or

30
Extending the Model (cont.)
  • LRU (not optimal replacement)
  • MMM sample
  • for (int j 0 i lt N i) for (int i 0
    j lt N j) for (int k 0 k lt N k)
    cij aik bkj
  • Bottom line

31
Summary Modeling for Tile Size (NB)
  • Models of increasing complexity
  • 3NB2 C
  • Whole work-set fits in L1
  • NB2 NB 1 C
  • Fully Associative
  • Optimal Replacement
  • Line Size 1 word
  • or
  • Line Size gt 1 word
  • or
  • LRU Replacement

32
Comments
  • Lot of work in compiler literature on automatic
    tile size selection
  • Not much is known about how well these algorithms
    do in practice
  • Few comparisons to BLAS
  • Not obvious how one generalizes our models to
    more complex codes
  • Insight needed how sensitive is performance to
    tile size?

33
Experiments
  • Architectures
  • SGI R12000, 270MHz
  • Sun UltraSPARC III, 900MHz
  • Intel Pentium III, 550MHz
  • Measure
  • Mini-MMM performance
  • Complete MMM performance
  • Sensitivity of performance to parameter
    variations

34
Installation Time of ATLAS vs. Model
35
MiniMMM Performance
  • SGI
  • ATLAS 457 MFLOPS
  • Model 453 MFLOPS
  • Difference 1
  • Sun
  • ATLAS 1287 MFLOPS
  • Model 1052 MFLOPS
  • Difference 20
  • Intel
  • ATLAS 394 MFLOPS
  • Model 384 MFLOPS
  • Difference 2

36
MMM Performance
  • SGI
  • Sun
  • Intel

BLAS MODEL F77 ATLAS
37
Optimization Parameter Values
  • NB MU/NU/KU F/I/N-Fetch Latency
  • ATLAS
  • SGI 64 4/4/64 0/5/1 3
  • Sun 48 5/3/48 0/3/5 5
  • Intel 40 2/1/40 0/3/1 4
  • Model
  • SGI 62 4/4/62 1/2/2 6
  • Sun 88 4/4/78 1/2/2 4
  • Intel 42 2/1/42 1/2/2 3

38
Sensitivity to NB and Latency Sun
  • Tile Size (NB)
  • Latency

ATLAS
MODEL
BEST
ATLAS
MODEL
BEST
39
Sensitivity to NB SGI
ATLAS
MODEL
BEST
40
Sensitivity to NB Intel
41
Sensitivity to MU,NU SGI
42
Sensitivity to MU,NU Sun
43
Sensitivity to MU,NU Intel
44
Shape of register tile matters
45
Sensitivity to KU
46
Conclusions
  • Search is not as important as one might think
  • Compilers can achieve near-ATLAS performance if
    they
  • implement well known transformations
  • use models to choose parameter values
  • There is room for improvement in both models and
    empirical search
  • Both are 20-25 slower than BLAS
  • Higher levels of memory hierarchy cannot be
    neglected

47
Future Directions
  • Study hand-written BLAS codes to understand
    performance gap
  • Repeat study with FFTW/SPIRAL
  • Uses search to choose between algorithms
  • Combine models with search
  • Use models to speed up empirical search
  • Use empirical studies to enhance models
  • Feed insights back into compilers
  • How do we make it easier for compiler writers to
    implement transformations?
  • Use insights to simplify memory system

48
Information
  • URL http//iss.cs.cornell.edu
  • Email pingali_at_cs.cornell.edu

49
Sensitivity to Latency Intel
Write a Comment
User Comments (0)
About PowerShow.com