SPIRAL: Current Status - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

SPIRAL: Current Status

Description:

Jos Moura (CMU) Jeremy Johnson (Drexel) Robert Johnson (MathStar) ... controls loop unrolling. SPL Compiler, 4-point FFT (compose (tensor (F 2) (I 2)) (T 4 2) ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 54

Provided by: jose270

Learn more at: http://polaris.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: SPIRAL: Current Status

1
SPIRAL Current Status
Markus Püschel
Students

Gavin Haentjens (CMU)
Pinit Kumhom (Drexel)
Neungsoo Park (USC)
David Sepiashvili (CMU)
Bryan Singer (CMU)
Yevgen Voronenko (Drexel)
Edward Wertz (CMU)
Jianxin Xiong (UIUC)

Faculty

José Moura (CMU)
Jeremy Johnson (Drexel)
Robert Johnson (MathStar)
David Padua (UIUC)
Viktor Prasanna (USC)
Markus Püschel (CMU)
Manuela Veloso (CMU)

Collaborators

Christoph Überhuber (TU Vienna)
Franz Franchetti (TU Vienna)

http//www.ece.cmu.edu/spiral
2
Sponsor
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3
Moores Law and High(est) Performance Scientific
Computing
4
SPIRAL
Automates
Implementation
Optimization
Platform-Adaptation
of DSP algorithms
5
SPIRAL system
6
DSP Transform
Algorithm
7
DSP Algorithms Example 4-point DFT
Cooley/Tukey FFT (size 4)
Fourier transform
Diagonal matrix (twiddles)
Permutation
Kronecker product
Identity

product of structured sparse matrices
mathematical notation

8
DSP Algorithms Terminology
Transform
parameterized matrix
Rule

a breakdown strategy
product of sparse matrices

Ruletree

recursive application of rules
uniquely defines an algorithm
efficient representation
easy manipulation

Formula

few constructs and primitives
uniquely defines an algorithm
can be translated into code

9
DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sine Transforms (16 types)
modified discrete cosine transform
two-dimensional transform
Others filters, discrete wavelet transforms,
Haar, Hartley,
10
Rules Breakdown Strategies
base case
recursive
translation
iterative
recursive
recursive
recursive
recursive
recursive
iterative/ recursive
translation
11
Formula for a DCT, size 16
12
DSP Transform
Algorithm (Formula)
Implementation
13
Formulas in SPL

( compose ( diagonal ( 2cos(1/16pi)
2cos(3/16pi) 2cos(5/16pi) 2cos(7/16pi) ) )
( permutation ( 1 3 4 2 ) ) ( tensor
( I 2 ) ( F 2 ) ) (
permutation ( 1 4 2 3 ) ) ( direct_sum
( compose ( F 2 ) (
diagonal ( 1 sqrt(1/2) ) ) ) (
compose ( matrix ( 1 1 0 )
( 0 (-1) 1 ) ) (
diagonal ( cos(13/8pi)-sin(13/8pi) sin(13/8pi)
cos(13/8pi)sin(13/8pi) ) ) ( matrix
( 1 0 ) ( 1 1 )
( 0 1 ) ) ( permutation ( 2
1 ) )

14
SPL Syntax (Subset)

matrix operations
(compose formula formula ...)
(tensor formula formula ...)
(direct_sum formula formula ...)
direct matrix description
(matrix (a11 a12 ...) (a21 a22 ...) ...)
(diagonal (d1 d2 ...))
(permutation (p1 p2 ...))
parameterized matrices
(I n)
(F n)
scalars
1.5, 2/7, cos(..), w(3), pi, 1.2e-04
definition of new symbols
(define name formula)
(template formula (i-code-list)
directives for code generation
codetype real/complex
unroll on/off

allows extension of SPL
controls loop unrolling
15
SPL Compiler, 4-point FFT
fast algorithm as formula as SPL program
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor
(I 2) (F 2)) (L 4 2))
codetype
complex
real
16
SPL Compiler Summary
SPL Program
SPL Formula
Template Definition
Symbol Definition
Parsing
Symbol Table
Abstract Syntax Tree
Template Table
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Built-in optimizations
Optimization

single static assignment code
no reuse of temporary vars
only scalar temporary vars
constants precomputed
limited CSE

I-Code
Target Code Generation
C, FORTRAN function
Extensible through templates
17
DSP Transform
Algorithm (Formula)
Search
Implementation
18
Why Search?
DCT, type IV, size 16
31000 formulas

maaaany different formulas
large spread in runtimes, even for modest size
not due to arithmetic cost
best formula is platform-dependent

19
Search Methods available in SPIRAL

Exhaustive Search
Dynamic Programming (DP)
Random Search
Hill Climbing
STEER (similar to a genetic algorithm)

Possible Formulas
Sizes Timed Results
Exhaust Very small All Best
DP All 10s-100s (very) good
Random All User decided fair/good
Hill Climbing All 100s-1000s Good
STEER All 100s-1000s (very) good

Search over
algorithm space and
implementation options (degree of unrolling)

20
STEER
Population n
Mutation
expand differently

Cross-Breeding
Population n1
swap expansions

Survival of Fittest
21
Experimental Results (C code)
search methods (applicable to all transforms)
high performance code (compared with FFTW)
different transforms
generated high quality code
22
SPIRAL System

Available for download (v3.1)
www.ece.cmu.edu/spiral
Easy installation (Unix configure/make
Windows install shield)
Unix/Linux and Windows 98/ME/NT/2000/XP
Current transforms DFT, DHT, WHT, RHT, DCT/DST
type I IV,
MDCT, Filters, Wavelets, Toeplitz, Circulants
Extensible
New version (4.0) in preparation

23
Recent Work
24
Learning to Generate Fast Algorithms

Learns from given dataset (formulasruntimes)
how to design a fast algorithm (breakdown
strategy)
Learns from a transform of one size, generates
the best algorithm for many sizes
Tested for DFT and WHT

25
SIMD Short Vector Extensions
vector length 4
(4-way)
x

Extension to instruction set architecture
Available on most current architectures
(SSE on Pentium, AltiVec on Motorola G4)
Originally for multimedia (like MMX for
integers)
Requires fine grain parallelism
Large potential speed-up

Problems

SIMD instructions are architecture specific
No common API (usually assembly hand coding)
Performance very sensitive to memory access
Automatic vectorization very limited

very difficult to use
26
Vector code generation from SPL formulas
27
Generated Vector Code DFTs Pentium 4
gflops
n
DFT 2n, Pentium 4, 2.53 GHz, using Intel C
compiler 6.0

speedups (to C code) up to factor of 3.3
beats hand-tuned vendor library

28
Generated Vector Code, Other Transforms
2-dim DCT
WHT
normalized runtime
normalized runtime
transform size
transform size
speedups up to factor of 2.5
29
Flexible Loop Interleaving (Runtime Gain WHT)
Athlon XP up to 55
Pentium 4 up to 45
UltraSparc III up to 60
Alpha 21264 up to 70
30
Parallel Code Generation Example WHT
PowerPC RS64 III
1 thread
8 threads
10 threads
WHT size log(N)
Parallelized constructs In ? A, A ? In
31
Code Scheduling
Runtime histograms
DFT, size 16 6500 formulas
DCT4, size 16 16000 formulas
unscheduled scheduled
32
Filters and Wavelets
New constructs row/column overlapped tensor
product
Examples for rules
33
Conclusions

Automatic code generation for the entire domain
of (linear) DSP algorithms
Portable high performance across platforms and
across time
Integration of math (high) level and
implementation (low) level
Intelligence through search and learning in the
space of alternatives

34
Future Plans

Transforms Radon, Gabor, Hankel, structured
matrices,
Target platforms parallel platforms, DSP
processors, SW/HW architectures, FPGAs, ASICs
Instructions Vector, FMAs, prefetching, OpenMP,
MPI
Beyond transforms entire DSP applications
Other domains amenable to SPIRAL approach?

35
Questions?
http//www.ece.cmu.edu/spiral
36
Extra Slides
37
Generating Parallel Programs

Interpret constructs such as In ? A as parallel
operations and transform formulas to obtain
maximal parallelism.
Explore alternative data access patterns
mathematically (e.g. different permutations in
matrix factorizations)
Prototype implementation using WHT
Build on existing sequential package
SMP implementation using OpenMP (IPDPS02)
90 efficiency obtained on 12 processor PowerPC
RS64 III
Distributed memory implementation using MPI
(POHLL02)

38
Comparison of Parallel DDL Schemes
10
PowerPC RS64 III
10 threads
8
Best sequential with
6
DDL
Speedup
Parallel without DDL
4
Coarse-grained DDL
2
Fine-grained DDL
with ID shift
0
1
6
11
16
21
26
WHT size log(N)
39
Overall Parallel Speedup
10
PowerPC RS64 III
8
1 thread
6
Speedup
8 threads
10 threads
4
2
0
1
6
11
16
21
26
WHT size log(N)
40
Performance of Digit Permutations on CRAY T3E
41
Architecture Framework
Parameters Dimension, Pd
M Memory AG Address Generator CU
Computation Unit
I/O Interface
Parameters Pi, 1 ? i ? n, no. of processor
AG
Parameters Dimension, and Ti, 1 ? i ? n, no. of
processor 2m
CU
42
Pease Algorithm Dataflow
43
Optimal Dataflow
44
Pease v.s. Optimal
45
Performance Speedup
Speedup S/C C no. of. clock cycles using
4-processor FFT engine S minimum no. of clock
cycles using single processor (42nn,
2n is the number of points)
46
SPIRAL Approach
given
DSP Transform (DFT, DCT, Wavelets etc.)
Possible Algorithms
Possible Implementations
SPIRAL Search Space
Intelligent Search
Performance Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon, SUN, PowerPC,
Alpha, )
47
Classical Code Generation System
given
DSP Transform (DFT, DCT, Wavelets etc.)
Math/Algorithm Expert
Expert Programmer
Performance Evaluation
given
Computing Platform
(Pentium III, Pentium 4, Athlon, SUN, PowerPC,
Alpha, )
48
Algorithms Ruletrees Formulas
49
Mathematical Framework Summary

fast algorithms represented as ruletrees (easy
generation/manipulation)
and as formulas (can be translated into code)
formulas built from few constructs and
primitives
many different algorithms/formulas generated
from few rules
(combinatorial explosion)
these algorithms are (essentially) equal in
arithmetic cost,
but differ in data flow

50
Formula Generation
data base (extensible!)
data type
Formula Generator
recursive application
runtime
rules
control
search engine
formula translation (spl compiler)
transforms
formulas
ruletrees
export
translation

written in GAP/AREP (computer algebra system)
all computation/manipulation is symbolic
exact arithmetic
easy extensible rule and transform data base
verification of rules and formulas

51
Number of Formulas/Algorithms
k 1 2 3 4 5 6 7 8 9
DFT, size 2k 1 6 40 296 27744 162570361280 1.
01 1027 2.31 1061 2.86 10133
DCT-IV, size 2k 1 10 126 31242 1924443362 7343
815121631354242 1.07 1038 2.30 1076 1.06
10153

differ in data flow not in arithmetic cost
exponential search space

52
Extensibility of SPIRAL
New transforms are readily included on the high
level
(easy, due to SPIRALs framework)
New constructs and primitives (potentially
required by radically different transforms) are
readily included in SPL
(moderate effort, due to template mechanism)
New instructions sets available (e.g., SSE) are
included by extending the SPL compiler
(doable one time effort)
53
Generated Vector Code DFTs Pentium III
gflops
n
DFT 2n, Pentium III, 1 GHz, using Intel C
compiler 6.0
speedups (to C code) up to factor of 2.5

Write a Comment

User Comments (0)