Title: SPIRAL: Current Status
1SPIRAL Current Status
Markus Püschel
Students
- Gavin Haentjens (CMU)
- Pinit Kumhom (Drexel)
- Neungsoo Park (USC)
- David Sepiashvili (CMU)
- Bryan Singer (CMU)
- Yevgen Voronenko (Drexel)
- Edward Wertz (CMU)
- Jianxin Xiong (UIUC)
Faculty
- José Moura (CMU)
- Jeremy Johnson (Drexel)
- Robert Johnson (MathStar)
- David Padua (UIUC)
- Viktor Prasanna (USC)
- Markus Püschel (CMU)
- Manuela Veloso (CMU)
Collaborators
- Christoph Überhuber (TU Vienna)
- Franz Franchetti (TU Vienna)
http//www.ece.cmu.edu/spiral
2Sponsor
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3Moores Law and High(est) Performance Scientific
Computing
4SPIRAL
Automates
Implementation
Optimization
Platform-Adaptation
of DSP algorithms
5SPIRAL system
6DSP Transform
Algorithm
7DSP Algorithms Example 4-point DFT
Cooley/Tukey FFT (size 4)
Fourier transform
Diagonal matrix (twiddles)
Permutation
Kronecker product
Identity
- product of structured sparse matrices
- mathematical notation
8DSP Algorithms Terminology
Transform
parameterized matrix
Rule
- a breakdown strategy
- product of sparse matrices
Ruletree
- recursive application of rules
- uniquely defines an algorithm
- efficient representation
- easy manipulation
Formula
- few constructs and primitives
- uniquely defines an algorithm
- can be translated into code
9DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sine Transforms (16 types)
modified discrete cosine transform
two-dimensional transform
Others filters, discrete wavelet transforms,
Haar, Hartley,
10Rules Breakdown Strategies
base case
recursive
translation
iterative
recursive
recursive
recursive
recursive
recursive
iterative/ recursive
translation
11Formula for a DCT, size 16
12DSP Transform
Algorithm (Formula)
Implementation
13Formulas in SPL
( compose ( diagonal ( 2cos(1/16pi)
2cos(3/16pi) 2cos(5/16pi) 2cos(7/16pi) ) )
( permutation ( 1 3 4 2 ) ) ( tensor
( I 2 ) ( F 2 ) ) (
permutation ( 1 4 2 3 ) ) ( direct_sum
( compose ( F 2 ) (
diagonal ( 1 sqrt(1/2) ) ) ) (
compose ( matrix ( 1 1 0 )
( 0 (-1) 1 ) ) (
diagonal ( cos(13/8pi)-sin(13/8pi) sin(13/8pi)
cos(13/8pi)sin(13/8pi) ) ) ( matrix
( 1 0 ) ( 1 1 )
( 0 1 ) ) ( permutation ( 2
1 ) )
14SPL Syntax (Subset)
- matrix operations
- (compose formula formula ...)
- (tensor formula formula ...)
- (direct_sum formula formula ...)
- direct matrix description
- (matrix (a11 a12 ...) (a21 a22 ...) ...)
- (diagonal (d1 d2 ...))
- (permutation (p1 p2 ...))
- parameterized matrices
- (I n)
- (F n)
- scalars
- 1.5, 2/7, cos(..), w(3), pi, 1.2e-04
- definition of new symbols
- (define name formula)
- (template formula (i-code-list)
- directives for code generation
- codetype real/complex
- unroll on/off
allows extension of SPL
controls loop unrolling
15SPL Compiler, 4-point FFT
fast algorithm as formula as SPL program
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor
(I 2) (F 2)) (L 4 2))
codetype
complex
real
16SPL Compiler Summary
SPL Program
SPL Formula
Template Definition
Symbol Definition
Parsing
Symbol Table
Abstract Syntax Tree
Template Table
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Built-in optimizations
Optimization
- single static assignment code
- no reuse of temporary vars
- only scalar temporary vars
- constants precomputed
- limited CSE
I-Code
Target Code Generation
C, FORTRAN function
Extensible through templates
17DSP Transform
Algorithm (Formula)
Search
Implementation
18Why Search?
DCT, type IV, size 16
31000 formulas
- maaaany different formulas
- large spread in runtimes, even for modest size
- not due to arithmetic cost
- best formula is platform-dependent
19Search Methods available in SPIRAL
- Exhaustive Search
- Dynamic Programming (DP)
- Random Search
- Hill Climbing
- STEER (similar to a genetic algorithm)
Possible Formulas
Sizes Timed Results
Exhaust Very small All Best
DP All 10s-100s (very) good
Random All User decided fair/good
Hill Climbing All 100s-1000s Good
STEER All 100s-1000s (very) good
- Search over
- algorithm space and
- implementation options (degree of unrolling)
20STEER
Population n
Mutation
expand differently
Cross-Breeding
Population n1
swap expansions
Survival of Fittest
21Experimental Results (C code)
search methods (applicable to all transforms)
high performance code (compared with FFTW)
different transforms
generated high quality code
22SPIRAL System
- Available for download (v3.1)
www.ece.cmu.edu/spiral - Easy installation (Unix configure/make
Windows install shield) - Unix/Linux and Windows 98/ME/NT/2000/XP
- Current transforms DFT, DHT, WHT, RHT, DCT/DST
type I IV, - MDCT, Filters, Wavelets, Toeplitz, Circulants
- Extensible
- New version (4.0) in preparation
23Recent Work
24Learning to Generate Fast Algorithms
- Learns from given dataset (formulasruntimes)
how to design a fast algorithm (breakdown
strategy) - Learns from a transform of one size, generates
the best algorithm for many sizes - Tested for DFT and WHT
25SIMD Short Vector Extensions
vector length 4
(4-way)
x
- Extension to instruction set architecture
- Available on most current architectures
- (SSE on Pentium, AltiVec on Motorola G4)
- Originally for multimedia (like MMX for
integers) - Requires fine grain parallelism
- Large potential speed-up
Problems
- SIMD instructions are architecture specific
- No common API (usually assembly hand coding)
- Performance very sensitive to memory access
- Automatic vectorization very limited
very difficult to use
26Vector code generation from SPL formulas
27Generated Vector Code DFTs Pentium 4
gflops
n
DFT 2n, Pentium 4, 2.53 GHz, using Intel C
compiler 6.0
- speedups (to C code) up to factor of 3.3
- beats hand-tuned vendor library
28Generated Vector Code, Other Transforms
2-dim DCT
WHT
normalized runtime
normalized runtime
transform size
transform size
speedups up to factor of 2.5
29Flexible Loop Interleaving (Runtime Gain WHT)
Athlon XP up to 55
Pentium 4 up to 45
UltraSparc III up to 60
Alpha 21264 up to 70
30Parallel Code Generation Example WHT
PowerPC RS64 III
1 thread
8 threads
10 threads
WHT size log(N)
Parallelized constructs In ? A, A ? In
31Code Scheduling
Runtime histograms
DFT, size 16 6500 formulas
DCT4, size 16 16000 formulas
unscheduled scheduled
32Filters and Wavelets
New constructs row/column overlapped tensor
product
Examples for rules
33Conclusions
- Automatic code generation for the entire domain
of (linear) DSP algorithms - Portable high performance across platforms and
across time - Integration of math (high) level and
implementation (low) level - Intelligence through search and learning in the
space of alternatives
34Future Plans
- Transforms Radon, Gabor, Hankel, structured
matrices, - Target platforms parallel platforms, DSP
processors, SW/HW architectures, FPGAs, ASICs - Instructions Vector, FMAs, prefetching, OpenMP,
MPI - Beyond transforms entire DSP applications
- Other domains amenable to SPIRAL approach?
35Questions?
http//www.ece.cmu.edu/spiral
36Extra Slides
37Generating Parallel Programs
- Interpret constructs such as In ? A as parallel
operations and transform formulas to obtain
maximal parallelism. - Explore alternative data access patterns
mathematically (e.g. different permutations in
matrix factorizations) - Prototype implementation using WHT
- Build on existing sequential package
- SMP implementation using OpenMP (IPDPS02)
- 90 efficiency obtained on 12 processor PowerPC
RS64 III - Distributed memory implementation using MPI
(POHLL02)
38Comparison of Parallel DDL Schemes
10
PowerPC RS64 III
10 threads
8
Best sequential with
6
DDL
Speedup
Parallel without DDL
4
Coarse-grained DDL
2
Fine-grained DDL
with ID shift
0
1
6
11
16
21
26
WHT size log(N)
39Overall Parallel Speedup
10
PowerPC RS64 III
8
1 thread
6
Speedup
8 threads
10 threads
4
2
0
1
6
11
16
21
26
WHT size log(N)
40Performance of Digit Permutations on CRAY T3E
41Architecture Framework
Parameters Dimension, Pd
M Memory AG Address Generator CU
Computation Unit
I/O Interface
Parameters Pi, 1 ? i ? n, no. of processor
AG
Parameters Dimension, and Ti, 1 ? i ? n, no. of
processor 2m
CU
42Pease Algorithm Dataflow
43Optimal Dataflow
44Pease v.s. Optimal
45Performance Speedup
Speedup S/C C no. of. clock cycles using
4-processor FFT engine S minimum no. of clock
cycles using single processor (42nn,
2n is the number of points)
46SPIRAL Approach
given
DSP Transform (DFT, DCT, Wavelets etc.)
Possible Algorithms
Possible Implementations
SPIRAL Search Space
Intelligent Search
Performance Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon, SUN, PowerPC,
Alpha, )
47Classical Code Generation System
given
DSP Transform (DFT, DCT, Wavelets etc.)
Math/Algorithm Expert
Expert Programmer
Performance Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon, SUN, PowerPC,
Alpha, )
48Algorithms Ruletrees Formulas
49Mathematical Framework Summary
- fast algorithms represented as ruletrees (easy
generation/manipulation) - and as formulas (can be translated into code)
- formulas built from few constructs and
primitives - many different algorithms/formulas generated
from few rules - (combinatorial explosion)
- these algorithms are (essentially) equal in
arithmetic cost, - but differ in data flow
50Formula Generation
data base (extensible!)
data type
Formula Generator
recursive application
runtime
rules
control
search engine
formula translation (spl compiler)
transforms
formulas
ruletrees
export
translation
- written in GAP/AREP (computer algebra system)
- all computation/manipulation is symbolic
- exact arithmetic
- easy extensible rule and transform data base
- verification of rules and formulas
51Number of Formulas/Algorithms
k 1 2 3 4 5 6 7 8 9
DFT, size 2k 1 6 40 296 27744 162570361280 1.
01 1027 2.31 1061 2.86 10133
DCT-IV, size 2k 1 10 126 31242 1924443362 7343
815121631354242 1.07 1038 2.30 1076 1.06
10153
- differ in data flow not in arithmetic cost
- exponential search space
52Extensibility of SPIRAL
New transforms are readily included on the high
level
(easy, due to SPIRALs framework)
New constructs and primitives (potentially
required by radically different transforms) are
readily included in SPL
(moderate effort, due to template mechanism)
New instructions sets available (e.g., SSE) are
included by extending the SPL compiler
(doable one time effort)
53Generated Vector Code DFTs Pentium III
gflops
n
DFT 2n, Pentium III, 1 GHz, using Intel C
compiler 6.0
speedups (to C code) up to factor of 2.5