Title: Efficient Radar Processing Via Array and Index Algebras
1 Efficient Radar Processing Via Array and Index
Algebras
- Lenore R. Mullin, Daniel J. Rosenkrantz, and
Harry B. Hunt III, Xingmin Luo - University at Albany, SUNY
- NSF CCR 0105536
2Outline
- Overview
- Motivation
- Radar Software Processing
- to exceed 1 x 1012 ops/second
- The Mapping Problem
- Efficient Use of Memory Hierarchy, Portable,
Scalable, - Radar uses Linear and Multi-linear Operators
- Array Based Operations
- Array Operations Require Array Algebra and Index
Calculus - Array Algebra MoA and Index Calculus Psi
Calculus - Reshape to use Processor/Memory Hierarchy
Efficiently Lift Dimension - High-Level Monolithic Operations Remove
Temporaries - Time Domain Convolution
- Benefits of Using MoA and Psi Calculus
3Levels of Processor/Memory Hierarchy
- Can be Modeled by Increasing Dimensionality of
- Data Array.
- Additional dimension for each level of the
hierarchy. - Envision data as reshaped to reflect increased
dimensionality. - Calculus automatically transforms algorithm to
reflect reshaped data array. - Data, layout, data movement, and scalarization
automatically generated based on reshaped data
array.
4Levels of Processor/Memory Hierarchy continu
ed
- Math and indexing operations in same expression
- Framework for design space search
- Rigorous and provably correct
- Extensible to complex architectures
Mathematics of Arrays
y conv
(x)
Map
Approach
Example raising array dimensionality
lt 0 1 2 gt
x lt 0 1 2 35 gt
lt 3 4 5 gt
P0
Main Memory
lt 6 7 8 gt
lt 9 10 11 gt
L2 Cache
lt 12 13 14 gt
L1 Cache
lt 15 16 17 gt
P1
Memory Hierarchy
Map
lt 18 19 20 gt
lt 21 22 23 gt
lt 24 25 26 gt
lt 27 28 29 gt
P2
Parallelism
lt 30 31 32 gt
lt 33 34 35 gt
5Application DomainSignal Processing
- 3-d Radar Data Processing
- Composition of Monolithic Array Operations
Convolution
Matrix Multiply
Hardware Info - Memory - Processor
Change algorithmto better match
hardware/memory/communication. Lift
dimension algebraically
Algorithm is Input
Architectural Information is Input
Model processors(dimdim1) Model
time-variance(dimdim1) Model Level 1
cache(dimdim1) Model All Three dimdim3
6Current Abstraction Approaches
Some Modern Programming Languages with Monolithic
Arrays
C w/classesfunctions, templates
Fortran 95
ZPL
MATLAB
PETE
AST Preprocessor
Loop transformationsTheories Grammar Changes
CompilerAST OptimizationsGrammar Changes
StandardCompilerOptimizations
Interpreted
Compiled
PartialAlgebras
Requireshighly skilledprogrammers
Blas, Linpack, LAPACK, SCALAPACK ATLAS
PVL, Blitz,MTL
Libraries
Libraries
High Performance
Scalable/Portable
Fine Tune
Classical CompilerTechnology Optimization
Even when operations compose, they dont
compose, X(YZ) without temporary arrays
7 Outline
- Overview
- Array Algebra MoA and Index Calculus Psi
Calculus - Reshape to use Processor/Memory Hierarchy
Efficiently Lift Dimension - High-Level Monolithic Operations Remove
Temporaries - Time Domain Convolution
- Benefits of Using MoA and Psi Calculus
8PSI Calculus
- Basic Properties
- Index calculus Centers around psi function.
- Shape polymorphic functions and operators
- Operations are defined using shapes and psi.
- Fundamental type is the array modeled as
(shape_vector, components). - scalars are 0-dimensional arrays, that is
(empty_vector, scalar value). - Denotational Normal Form(DNF) reduced form in
Cartesian coordinates (independent of data
layout row major, column major, regular sparse,
) - Operational Normal Form(ONF) reduced form for
1-d memory layout(s).
9Psi Reduction
This becomes by psi
Reduction
Acat(rev(B), rev(C)) ? AiBB.size-1-i
if 0 i lt B.size
AiCC.sizeB.size-1-i if B.size
i lt B.sizeC.size)
ONF has minimum number of reads/writes
PSI Calculus rules applied mechanically to
produce ONF which is easily translated to
optimal loop implementation
10Some Psi Calculus Operations
11Convolution PSI Calculus Description
Definition of yconv(h,x)
yn where x has N elements, h has M elements,
0nltNM-1, and x is x padded by M-1 zeros on
either end
Psi Calculus
Algorithm step
Algorithm and PSI Calculus Description
Initial step
x lt 1 2 3 4 gt h lt 5 6 7 gt
x lt 1 2 3 4 gt h lt 5 6 7 gt
Form x
xcat(reshape(ltk-1gt, lt0gt), cat(x,
reshape(ltk-1gt,lt0gt)))
x
lt 0 0 1 . . . 4 0 0 gt
rotate x (NM-1) times
x rot
x rotbinaryOmega(rotate,0,iota(NM-1), 1 x)
take the size of h part of xrot
x final
x finalbinaryOmega(take,0,reshapeltNM-1gt,ltMgt,1,
x rot
Prod
multiply
ProdbinaryOmega (,1, h,1,x final)
sum
YunaryOmega (sum, 1, Prod)
lt 7 20 38 . . . gt
Y
PSI Calculus operators compose to form higher
level operations
12Experimental Platform and Method
- Hardware
- DY4 CHAMP-AV Board
- Contains 4 MPC7400s and 1 MPC 8420
- MPC7400 (G4)
- 450 MHz
- 32 KB L1 data cache
- 2 MB L2 cache
- 64 MB memory/processor
- Software
- VxWorks 5.2
- Real-time OS
- GCC 2.95.4 (non-official release)
- GCC 2.95.3 with patches for VxWorks
- Optimization flags
- -O3 -funroll-loops -fstrict-aliasing
- Method
- Run many iterations, report average, minimum,
maximum time - From 10,000,000 iterations for small data sizes,
to 1000 for large data sizes - All approaches run on same data
- Only average times shown here
- Only one G4 processor used
- Use of the VxWorks OS resulted in very low
variability in timing - High degree of confidence in results
13Experiment Conv(x,h)
- Cost of temporaries in regular C approach more
pronounced due to large number of operations - Cost of expression tree manipulation also more
pronounced
14Convolution and Dimension Lifting
- Model Processor and Level 1 cache.
- Start with 1-d inputs(input dimension).
- Envision 2nd dimension ranging over output
values. - Envision Processors
- Reshaped into a 3rd dimension.
- The 2nd dimension is partitioned.
- Envision Cache
- Reshaped into a 4th dimension.
- The 1st dimension is partitioned.
- psi Reduce to Normal Form
15- Envision 2nd dimension ranging over output
values. - Let tzNM-1
- Mth4
- Ntx
4
x
S
h3
h2
h1
h0
0
0
0
x0
tz
tz
16- - Envision Processors
- Reshaped into a 3rd dimension.
- The 2nd dimension is partitioned.
Let p 2
-4
-4 -
-4-
x
x
S
S
tz
tz
2
2
2
17- Reshaped into a 4th dimension
- The 1st dimension is partitioned.
x
x
S
S
2
2
S
S
2
2
x
x
2
2
tz/2
tz
2
Tz/2
Tz/2
2
18ONF for the Convolution Decomposition with
Processors Cache
Generic form- 4 dimensional after psi
Reduction
- For i0 0 to p-1 do
- For i11 0 to tz/p 1 do
- sum ? 0
- For icacherow 0 to M/cache -1 do
- For i3 0 to cache 1 do
- sum ? sum h (M-((icacherow cache)
i3))-1 x(((tz/p i0)i1) icacherow
cache) i3)
Let tzNM-1 Mth Ntx
P r o c e s s o r l o o p
C a c h e l o o p
T I m e l o o p
sum is calculated for each element of y.
Time Domain
19Outline
- Overview
- Array Algebra MoA and Index Calculus Psi
Calculus - Time Domain Convolution
- Other algorithms in Radar
- Modified Gram-Schmidt QR Decompositions
- MOA to ONF
- Experiments
- Composition of Matrix Multiplication in
Beamforming - MoA to DNF
- Experiments
- FFT
- Benefits of Using Moa and Psi Calculus
20Algorithms in Radar
Mechanize Using Expression Templates
Time DomainConvolution (x,y)
ONF for 1proc
Lift dimension - Processor - L1 cache reformulate
Use toreason about RAW
Manual description derivation for 1
processor DNF
DNF
ONF
Thoughtson an Abstract Machine
Modified Gram SchmidtQR (A)
Benchmark at NCSAw/LAPACK
CompilerOptimizationsDNF to ONF
ImplementDNF/ONFFortran 90
A x (BH x C) Beamforming
MoA y Calculus
21ONF for the QRDecomposition with Processors
Cache
Initialization
ProcessorLoop
ComputeNorm
MainLo o p
ProcessorLoop
Normalize
ProcessorCache Loop
DoTProduct
Processor CacheLoop
Ortothogonalize
Modified Gram Schmidt
22DNF for the Composition of A x (BH x C)
Generic form- 4 dimensional
- Z0
- For i0 to n-1 do
- For j0 to n-1 do
- For k0 to n-1 do
- zk?zkAkjxXjixBi
Given A, B, X, Z n by n arrays
Beamforming
23Fftpsirad2 Performance Comparisons
24Mechanizing MoA and Psi Reduction
Index Theory IntroducedAbrams 1972
MoA y calculus theory Mullin 88 Prototype
compiler output C, F90, HPF Mullin and
Thibault94 HPF compiler AST manipulations
Mullin, et al 96 SAC functional C Mullin and
Bodo96 C classes Helal, Sameh and
Mullin01 C expression templates Mullin,
Ruttledge, Bond02 PVL with the portable
expression template engine(PETE) Parallel and
distributed processing Abstract machine Automate
cost and determine optimizations minimize search
space
Fortran
C
Theory applied to embedded systems
C
Lifting Compiler Optimizations to Application
Programmer Interface
25On-going research
- we are implementing the psi calculus using
expression templates. - we are building on work done at MIT and we are
working with MTL library developers (lumsdaine)
at Indiana University and STL library developer,
musser, at rpi.
26Benefits of Using Moa and Psi Calculus
- Processor/Memory Hierarchy can be modeled by
reshaping data using an extra dimension for each
level. - Composition of monolithic operations can be
reexpressed as composition of operations on
smaller data granularities - Matches memory hierarchy levels
- Avoids materialization of intermediate arrays.
- Algorithm can be automatically(algebraically)
transformed to reflect array reshapings above. - Facilitates programming expressed at a high level
- Facilitates intentional program design and
analysis - Facilitates portability
- This approach is applicable to many other
problems in radar.
27Email and Question?
- Lenore R. Mullin, lenore_at_cs.albany.edu
-
- Daniel J. Rosenkrantz, djr_at_cs.albany.edu
- Harry B. Hunt III, hunt_at_cs.albany.edu
- Xingmin Luo, xluo_at_cs.albany.edu
- The
End
28Typical C Operator Overloading
Example ABC vector add
2 temporary vectors created
Main
1. Pass B and C references to operator
Additional Memory Use
B, C
- Static memory
- Dynamic memory (also affects execution time)
Operator
2. Create temporary result vector 3. Calculate
results, store in temporary 4. Return copy of
temporary
temp
BC
temp
Additional Execution Time
temp copy
5. Pass results reference to operator
- Cache misses/page faults
- Time to create anew vector
- Time to create a copy of a vector
- Time to destructboth temporaries
temp copy
Operator
temp copy
A
6. Perform assignment
29C Expression Templates and PETE
Expression Type
Parse Tree
Expression
BinaryNodeltOpAdd, ReferenceltVectorgt,
ReferenceltVector gt gt
ExpressionTemplates
ABC
C
B
Main
Parse trees, not vectors, created
Parse trees, not vectors, created
1. Pass B and Creferences to operator
Reduced Memory Use
B, C
Operator
- Parse tree only contains references
2. Create expressionparse tree
B
C
3. Return expressionparse tree
copy
Reduced Execution Time
4. Pass expression treereference to operator
- Better cache use
- Loop fusion style optimization
- Compile-time expression tree manipulation
copy
Operator
5. Calculate result andperform assignment
BC
A
- PETE, the Portable Expression Template Engine, is
available from theAdvanced Computing Laboratory
at Los Alamos National Laboratory - PETE provides
- Expression template capability
- Facilities to help navigate and evaluating parse
trees
PETE http//www.acl.lanl.gov/pete
30Implementing Psi Calculus with Expression
Templates
Example Atake(4,drop(3,rev(B))) Blt1 2 3 4 5
6 7 8 9 10gt Alt7 6 5 4gt
Recall Psi Reduction for 1-d arrays always
yields one or more expressions of the
form xiystridei offset l i lt u
1. Form expression tree
3. Apply Psi Reduction rules
2. Add size information
take
size4
drop
4
size7
Reduction
Size info
rev
3
size10
size10
B
4. Rewrite as sub-expressions with iterators at
the leaves, and loop bounds information at the
root
- Iterators used for efficiency, rather than
recalculating indices for each i - One for loop to evaluate each
sub-expression
iterator offset6stride-1
size4