SPIRAL: Current Status - PowerPoint PPT Presentation

About This Presentation
Title:

SPIRAL: Current Status

Description:

Jos Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) ... controls loop unrolling. SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 54
Provided by: jose270
Category:

less

Transcript and Presenter's Notes

Title: SPIRAL: Current Status


1
SPIRAL Current Status
Markus Püschel
Students
  • Gavin Haentjens (CMU)
  • Pinit Kumhom (Drexel)
  • Neungsoo Park (USC)
  • David Sepiashvili (CMU)
  • Bryan Singer (CMU)
  • Yevgen Voronenko (Drexel)
  • Edward Wertz (CMU)
  • Jianxin Xiong (UIUC)

Faculty
  • José Moura (CMU)
  • Jeremy Johnson (Drexel)
  • Robert Johnson (MathStar)
  • David Padua (UIUC)
  • Viktor Prasanna (USC)
  • Markus Püschel (CMU)
  • Manuela Veloso (CMU)

Collaborators
  • Christoph Überhuber (TU Vienna)
  • Franz Franchetti (TU Vienna)

http//www.ece.cmu.edu/spiral
2
Sponsor
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3
Moores Law and High(est) Performance Scientific
Computing
4
SPIRAL
Automates
Implementation
Optimization
Platform-Adaptation
of DSP algorithms
5
SPIRAL system
6
DSP Transform
Algorithm
7
DSP Algorithms Example 4-point DFT
Cooley/Tukey FFT (size 4)
Fourier transform
Diagonal matrix (twiddles)
Permutation
Kronecker product
Identity
  • product of structured sparse matrices
  • mathematical notation

8
DSP Algorithms Terminology
Transform
parameterized matrix
Rule
  • a breakdown strategy
  • product of sparse matrices

Ruletree
  • recursive application of rules
  • uniquely defines an algorithm
  • efficient representation
  • easy manipulation

Formula
  • few constructs and primitives
  • uniquely defines an algorithm
  • can be translated into code

9
DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sine Transforms (16 types)
modified discrete cosine transform
two-dimensional transform
Others filters, discrete wavelet transforms,
Haar, Hartley,
10
Rules Breakdown Strategies
base case
recursive
translation
iterative
recursive
recursive
recursive
recursive
recursive
iterative/ recursive
translation
11
Formula for a DCT, size 16
12
DSP Transform
Algorithm (Formula)
Implementation
13
Formulas in SPL

( compose ( diagonal ( 2cos(1/16pi)
2cos(3/16pi) 2cos(5/16pi) 2cos(7/16pi) ) )
( permutation ( 1 3 4 2 ) ) ( tensor
( I 2 ) ( F 2 ) ) (
permutation ( 1 4 2 3 ) ) ( direct_sum
( compose ( F 2 ) (
diagonal ( 1 sqrt(1/2) ) ) ) (
compose ( matrix ( 1 1 0 )
( 0 (-1) 1 ) ) (
diagonal ( cos(13/8pi)-sin(13/8pi) sin(13/8pi)
cos(13/8pi)sin(13/8pi) ) ) ( matrix
( 1 0 ) ( 1 1 )
( 0 1 ) ) ( permutation ( 2
1 ) )

14
SPL Syntax (Subset)
  • matrix operations
  • (compose formula formula ...)
  • (tensor formula formula ...)
  • (direct_sum formula formula ...)
  • direct matrix description
  • (matrix (a11 a12 ...) (a21 a22 ...) ...)
  • (diagonal (d1 d2 ...))
  • (permutation (p1 p2 ...))
  • parameterized matrices
  • (I n)
  • (F n)
  • scalars
  • 1.5, 2/7, cos(..), w(3), pi, 1.2e-04
  • definition of new symbols
  • (define name formula)
  • (template formula (i-code-list)
  • directives for code generation
  • codetype real/complex
  • unroll on/off

allows extension of SPL
controls loop unrolling
15
SPL Compiler, 4-point FFT
fast algorithm as formula as SPL program
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor
(I 2) (F 2)) (L 4 2))
codetype
complex
real
16
SPL Compiler Summary
SPL Program
SPL Formula
Template Definition
Symbol Definition
Parsing
Symbol Table
Abstract Syntax Tree
Template Table
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Built-in optimizations
Optimization
  • single static assignment code
  • no reuse of temporary vars
  • only scalar temporary vars
  • constants precomputed
  • limited CSE

I-Code
Target Code Generation
C, FORTRAN function
Extensible through templates
17
DSP Transform
Algorithm (Formula)
Search
Implementation
18
Why Search?
DCT, type IV, size 16
31000 formulas
  • maaaany different formulas
  • large spread in runtimes, even for modest size
  • not due to arithmetic cost
  • best formula is platform-dependent

19
Search Methods available in SPIRAL
  • Exhaustive Search
  • Dynamic Programming (DP)
  • Random Search
  • Hill Climbing
  • STEER (similar to a genetic algorithm)

Possible Formulas
Sizes Timed Results
Exhaust Very small All Best
DP All 10s-100s (very) good
Random All User decided fair/good
Hill Climbing All 100s-1000s Good
STEER All 100s-1000s (very) good
  • Search over
  • algorithm space and
  • implementation options (degree of unrolling)

20
STEER
Population n
Mutation
expand differently

Cross-Breeding
Population n1
swap expansions

Survival of Fittest
21
Experimental Results (C code)
search methods (applicable to all transforms)
high performance code (compared with FFTW)
different transforms
generated high quality code
22
SPIRAL System
  • Available for download (v3.1)
    www.ece.cmu.edu/spiral
  • Easy installation (Unix configure/make
    Windows install shield)
  • Unix/Linux and Windows 98/ME/NT/2000/XP
  • Current transforms DFT, DHT, WHT, RHT, DCT/DST
    type I IV,
  • MDCT, Filters, Wavelets, Toeplitz, Circulants
  • Extensible
  • New version (4.0) in preparation

23
Recent Work
24
Learning to Generate Fast Algorithms
  • Learns from given dataset (formulasruntimes)
    how to design a fast algorithm (breakdown
    strategy)
  • Learns from a transform of one size, generates
    the best algorithm for many sizes
  • Tested for DFT and WHT

25
SIMD Short Vector Extensions
vector length 4
(4-way)
x
  • Extension to instruction set architecture
  • Available on most current architectures
  • (SSE on Pentium, AltiVec on Motorola G4)
  • Originally for multimedia (like MMX for
    integers)
  • Requires fine grain parallelism
  • Large potential speed-up

Problems
  • SIMD instructions are architecture specific
  • No common API (usually assembly hand coding)
  • Performance very sensitive to memory access
  • Automatic vectorization very limited

very difficult to use
26
Vector code generation from SPL formulas
27
Generated Vector Code DFTs Pentium 4
gflops
n
DFT 2n, Pentium 4, 2.53 GHz, using Intel C
compiler 6.0
  • speedups (to C code) up to factor of 3.3
  • beats hand-tuned vendor library

28
Generated Vector Code, Other Transforms
2-dim DCT
WHT
normalized runtime
normalized runtime
transform size
transform size
speedups up to factor of 2.5
29
Flexible Loop Interleaving (Runtime Gain WHT)
Athlon XP up to 55
Pentium 4 up to 45
UltraSparc III up to 60
Alpha 21264 up to 70
30
Parallel Code Generation Example WHT
PowerPC RS64 III
1 thread
8 threads
10 threads
WHT size log(N)
Parallelized constructs In ? A, A ? In
31
Code Scheduling
Runtime histograms
DFT, size 16 6500 formulas
DCT4, size 16 16000 formulas
unscheduled scheduled
32
Filters and Wavelets
New constructs row/column overlapped tensor
product
Examples for rules
33
Conclusions
  • Automatic code generation for the entire domain
    of (linear) DSP algorithms
  • Portable high performance across platforms and
    across time
  • Integration of math (high) level and
    implementation (low) level
  • Intelligence through search and learning in the
    space of alternatives

34
Future Plans
  • Transforms Radon, Gabor, Hankel, structured
    matrices,
  • Target platforms parallel platforms, DSP
    processors, SW/HW architectures, FPGAs, ASICs
  • Instructions Vector, FMAs, prefetching, OpenMP,
    MPI
  • Beyond transforms entire DSP applications
  • Other domains amenable to SPIRAL approach?

35
Questions?
http//www.ece.cmu.edu/spiral
36
Extra Slides
37
Generating Parallel Programs
  • Interpret constructs such as In ? A as parallel
    operations and transform formulas to obtain
    maximal parallelism.
  • Explore alternative data access patterns
    mathematically (e.g. different permutations in
    matrix factorizations)
  • Prototype implementation using WHT
  • Build on existing sequential package
  • SMP implementation using OpenMP (IPDPS02)
  • 90 efficiency obtained on 12 processor PowerPC
    RS64 III
  • Distributed memory implementation using MPI
    (POHLL02)

38
Comparison of Parallel DDL Schemes
10
PowerPC RS64 III
10 threads
8
Best sequential with
6
DDL
Speedup
Parallel without DDL
4
Coarse-grained DDL
2
Fine-grained DDL
with ID shift
0
1
6
11
16
21
26
WHT size log(N)
39
Overall Parallel Speedup
10
PowerPC RS64 III
8
1 thread
6
Speedup
8 threads
10 threads
4
2
0
1
6
11
16
21
26
WHT size log(N)
40
Performance of Digit Permutations on CRAY T3E
41
Architecture Framework
Parameters Dimension, Pd
M Memory AG Address Generator CU
Computation Unit
I/O Interface
Parameters Pi, 1 ? i ? n, no. of processor
AG
Parameters Dimension, and Ti, 1 ? i ? n, no. of
processor 2m
CU
42
Pease Algorithm Dataflow
43
Optimal Dataflow
44
Pease v.s. Optimal
45
Performance Speedup
Speedup S/C C no. of. clock cycles using
4-processor FFT engine S minimum no. of clock
cycles using single processor (42nn,
2n is the number of points)
46
SPIRAL Approach
given
DSP Transform (DFT, DCT, Wavelets etc.)
Possible Algorithms
Possible Implementations
SPIRAL Search Space
Intelligent Search
Performance Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon, SUN, PowerPC,
Alpha, )
47
Classical Code Generation System
given
DSP Transform (DFT, DCT, Wavelets etc.)
Math/Algorithm Expert
Expert Programmer
Performance Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon, SUN, PowerPC,
Alpha, )
48
Algorithms Ruletrees Formulas
49
Mathematical Framework Summary
  • fast algorithms represented as ruletrees (easy
    generation/manipulation)
  • and as formulas (can be translated into code)
  • formulas built from few constructs and
    primitives
  • many different algorithms/formulas generated
    from few rules
  • (combinatorial explosion)
  • these algorithms are (essentially) equal in
    arithmetic cost,
  • but differ in data flow

50
Formula Generation
data base (extensible!)
data type
Formula Generator
recursive application
runtime
rules
control
search engine
formula translation (spl compiler)
transforms
formulas
ruletrees
export
translation
  • written in GAP/AREP (computer algebra system)
  • all computation/manipulation is symbolic
  • exact arithmetic
  • easy extensible rule and transform data base
  • verification of rules and formulas

51
Number of Formulas/Algorithms
k 1 2 3 4 5 6 7 8 9
DFT, size 2k 1 6 40 296 27744 162570361280 1.
01 1027 2.31 1061 2.86 10133
DCT-IV, size 2k 1 10 126 31242 1924443362 7343
815121631354242 1.07 1038 2.30 1076 1.06
10153
  • differ in data flow not in arithmetic cost
  • exponential search space

52
Extensibility of SPIRAL
New transforms are readily included on the high
level
(easy, due to SPIRALs framework)
New constructs and primitives (potentially
required by radically different transforms) are
readily included in SPL
(moderate effort, due to template mechanism)
New instructions sets available (e.g., SSE) are
included by extending the SPL compiler
(doable one time effort)
53
Generated Vector Code DFTs Pentium III
gflops
n
DFT 2n, Pentium III, 1 GHz, using Intel C
compiler 6.0
speedups (to C code) up to factor of 2.5
Write a Comment
User Comments (0)
About PowerShow.com