Efficient Radar Processing Via Array and Index Algebras - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Efficient Radar Processing Via Array and Index Algebras

Description:

Efficient Use of Memory Hierarchy, Portable, Scalable, ... Blas, Linpack, LAPACK, SCALAPACK. ATLAS. Libraries. PVL, Blitz , MTL. Libraries ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 31

Provided by: kmo37

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Radar Processing Via Array and Index Algebras

1
Efficient Radar Processing Via Array and Index
Algebras

Lenore R. Mullin, Daniel J. Rosenkrantz, and
Harry B. Hunt III, Xingmin Luo
University at Albany, SUNY
NSF CCR 0105536

2
Outline

Overview
Motivation
Radar Software Processing
to exceed 1 x 1012 ops/second
The Mapping Problem
Efficient Use of Memory Hierarchy, Portable,
Scalable,
Radar uses Linear and Multi-linear Operators
Array Based Operations
Array Operations Require Array Algebra and Index
Calculus
Array Algebra MoA and Index Calculus Psi
Calculus
Reshape to use Processor/Memory Hierarchy
Efficiently Lift Dimension
High-Level Monolithic Operations Remove
Temporaries
Time Domain Convolution
Benefits of Using MoA and Psi Calculus

3
Levels of Processor/Memory Hierarchy

Can be Modeled by Increasing Dimensionality of
Data Array.
Additional dimension for each level of the
hierarchy.
Envision data as reshaped to reflect increased
dimensionality.
Calculus automatically transforms algorithm to
reflect reshaped data array.
Data, layout, data movement, and scalarization
automatically generated based on reshaped data
array.

4
Levels of Processor/Memory Hierarchy continu
ed

Math and indexing operations in same expression
Framework for design space search
Rigorous and provably correct
Extensible to complex architectures

Mathematics of Arrays
y conv
(x)
Map
Approach
Example raising array dimensionality
lt 0 1 2 gt
x lt 0 1 2 35 gt
lt 3 4 5 gt
P0
Main Memory
lt 6 7 8 gt
lt 9 10 11 gt
L2 Cache
lt 12 13 14 gt
L1 Cache
lt 15 16 17 gt
P1
Memory Hierarchy
Map
lt 18 19 20 gt
lt 21 22 23 gt
lt 24 25 26 gt
lt 27 28 29 gt
P2
Parallelism
lt 30 31 32 gt
lt 33 34 35 gt
5
Application DomainSignal Processing

3-d Radar Data Processing
Composition of Monolithic Array Operations

Convolution
Matrix Multiply
Hardware Info - Memory - Processor
Change algorithmto better match
hardware/memory/communication. Lift
dimension algebraically
Algorithm is Input
Architectural Information is Input
Model processors(dimdim1) Model
time-variance(dimdim1) Model Level 1
cache(dimdim1) Model All Three dimdim3
6
Current Abstraction Approaches
Some Modern Programming Languages with Monolithic
Arrays
C w/classesfunctions, templates
Fortran 95
ZPL
MATLAB
PETE
AST Preprocessor
Loop transformationsTheories Grammar Changes
CompilerAST OptimizationsGrammar Changes
StandardCompilerOptimizations
Interpreted
Compiled
PartialAlgebras
Requireshighly skilledprogrammers
Blas, Linpack, LAPACK, SCALAPACK ATLAS
PVL, Blitz,MTL
Libraries
Libraries
High Performance
Scalable/Portable
Fine Tune
Classical CompilerTechnology Optimization
Even when operations compose, they dont
compose, X(YZ) without temporary arrays
7

Outline

Overview
Array Algebra MoA and Index Calculus Psi
Calculus
Reshape to use Processor/Memory Hierarchy
Efficiently Lift Dimension
High-Level Monolithic Operations Remove
Temporaries
Time Domain Convolution
Benefits of Using MoA and Psi Calculus

8
PSI Calculus

Basic Properties
Index calculus Centers around psi function.
Shape polymorphic functions and operators
Operations are defined using shapes and psi.
Fundamental type is the array modeled as
(shape_vector, components).
scalars are 0-dimensional arrays, that is
(empty_vector, scalar value).
Denotational Normal Form(DNF) reduced form in
Cartesian coordinates (independent of data
layout row major, column major, regular sparse,
)
Operational Normal Form(ONF) reduced form for
1-d memory layout(s).

9
Psi Reduction
This becomes by psi
Reduction
Acat(rev(B), rev(C)) ? AiBB.size-1-i
if 0 i lt B.size
AiCC.sizeB.size-1-i if B.size
i lt B.sizeC.size)
ONF has minimum number of reads/writes
PSI Calculus rules applied mechanically to
produce ONF which is easily translated to
optimal loop implementation
10
Some Psi Calculus Operations
11
Convolution PSI Calculus Description
Definition of yconv(h,x)
yn where x has N elements, h has M elements,
0nltNM-1, and x is x padded by M-1 zeros on
either end
Psi Calculus
Algorithm step
Algorithm and PSI Calculus Description
Initial step
x lt 1 2 3 4 gt h lt 5 6 7 gt
x lt 1 2 3 4 gt h lt 5 6 7 gt
Form x
xcat(reshape(ltk-1gt, lt0gt), cat(x,
reshape(ltk-1gt,lt0gt)))
x
lt 0 0 1 . . . 4 0 0 gt
rotate x (NM-1) times
x rot
x rotbinaryOmega(rotate,0,iota(NM-1), 1 x)
take the size of h part of xrot
x final
x finalbinaryOmega(take,0,reshapeltNM-1gt,ltMgt,1,
x rot
Prod
multiply
ProdbinaryOmega (,1, h,1,x final)
sum
YunaryOmega (sum, 1, Prod)
lt 7 20 38 . . . gt
Y
PSI Calculus operators compose to form higher
level operations
12
Experimental Platform and Method

Hardware
DY4 CHAMP-AV Board
Contains 4 MPC7400s and 1 MPC 8420
MPC7400 (G4)
450 MHz
32 KB L1 data cache
2 MB L2 cache
64 MB memory/processor

Software
VxWorks 5.2
Real-time OS
GCC 2.95.4 (non-official release)
GCC 2.95.3 with patches for VxWorks
Optimization flags
-O3 -funroll-loops -fstrict-aliasing

Method
Run many iterations, report average, minimum,
maximum time
From 10,000,000 iterations for small data sizes,
to 1000 for large data sizes
All approaches run on same data
Only average times shown here
Only one G4 processor used

Use of the VxWorks OS resulted in very low
variability in timing
High degree of confidence in results

13
Experiment Conv(x,h)

Cost of temporaries in regular C approach more
pronounced due to large number of operations
Cost of expression tree manipulation also more
pronounced

14
Convolution and Dimension Lifting

Model Processor and Level 1 cache.
Start with 1-d inputs(input dimension).
Envision 2nd dimension ranging over output
values.
Envision Processors
Reshaped into a 3rd dimension.
The 2nd dimension is partitioned.
Envision Cache
Reshaped into a 4th dimension.
The 1st dimension is partitioned.
psi Reduce to Normal Form

Envision 2nd dimension ranging over output
values.
Let tzNM-1
Mth4
Ntx

4
x
S
h3
h2
h1
h0
0
0
0
x0
tz
tz
16

- Envision Processors
Reshaped into a 3rd dimension.
The 2nd dimension is partitioned.

Let p 2
-4
-4 -
-4-
x
x
S
S
tz
tz
2
2
2
17

Envision Cache

Reshaped into a 4th dimension
The 1st dimension is partitioned.

x
x
S
S
2
2
S
S
2
2
x
x
2
2
tz/2
tz
2
Tz/2
Tz/2
2
18
ONF for the Convolution Decomposition with
Processors Cache
Generic form- 4 dimensional after psi
Reduction

For i0 0 to p-1 do
For i11 0 to tz/p 1 do
sum ? 0
For icacherow 0 to M/cache -1 do
For i3 0 to cache 1 do
sum ? sum h (M-((icacherow cache)
i3))-1 x(((tz/p i0)i1) icacherow
cache) i3)

Let tzNM-1 Mth Ntx
P r o c e s s o r l o o p
C a c h e l o o p
T I m e l o o p
sum is calculated for each element of y.
Time Domain
19
Outline

Overview
Array Algebra MoA and Index Calculus Psi
Calculus
Time Domain Convolution
Other algorithms in Radar
Modified Gram-Schmidt QR Decompositions
MOA to ONF
Experiments
Composition of Matrix Multiplication in
Beamforming
MoA to DNF
Experiments
FFT
Benefits of Using Moa and Psi Calculus

20
Algorithms in Radar
Mechanize Using Expression Templates
Time DomainConvolution (x,y)
ONF for 1proc
Lift dimension - Processor - L1 cache reformulate
Use toreason about RAW
Manual description derivation for 1
processor DNF
DNF
ONF
Thoughtson an Abstract Machine
Modified Gram SchmidtQR (A)
Benchmark at NCSAw/LAPACK
CompilerOptimizationsDNF to ONF
ImplementDNF/ONFFortran 90
A x (BH x C) Beamforming
MoA y Calculus
21
ONF for the QRDecomposition with Processors
Cache
Initialization
ProcessorLoop
ComputeNorm
MainLo o p
ProcessorLoop
Normalize
ProcessorCache Loop
DoTProduct
Processor CacheLoop
Ortothogonalize
Modified Gram Schmidt
22
DNF for the Composition of A x (BH x C)
Generic form- 4 dimensional

Z0
For i0 to n-1 do
For j0 to n-1 do
For k0 to n-1 do
zk?zkAkjxXjixBi

Given A, B, X, Z n by n arrays
Beamforming
23
Fftpsirad2 Performance Comparisons
24
Mechanizing MoA and Psi Reduction
Index Theory IntroducedAbrams 1972
MoA y calculus theory Mullin 88 Prototype
compiler output C, F90, HPF Mullin and
Thibault94 HPF compiler AST manipulations
Mullin, et al 96 SAC functional C Mullin and
Bodo96 C classes Helal, Sameh and
Mullin01 C expression templates Mullin,
Ruttledge, Bond02 PVL with the portable
expression template engine(PETE) Parallel and
distributed processing Abstract machine Automate
cost and determine optimizations minimize search
space
Fortran
C
Theory applied to embedded systems
C
Lifting Compiler Optimizations to Application
Programmer Interface
25
On-going research

we are implementing the psi calculus using
expression templates.
we are building on work done at MIT and we are
working with MTL library developers (lumsdaine)
at Indiana University and STL library developer,
musser, at rpi.

26
Benefits of Using Moa and Psi Calculus

Processor/Memory Hierarchy can be modeled by
reshaping data using an extra dimension for each
level.
Composition of monolithic operations can be
reexpressed as composition of operations on
smaller data granularities
Matches memory hierarchy levels
Avoids materialization of intermediate arrays.
Algorithm can be automatically(algebraically)
transformed to reflect array reshapings above.
Facilitates programming expressed at a high level
Facilitates intentional program design and
analysis
Facilitates portability
This approach is applicable to many other
problems in radar.

27
Email and Question?

Lenore R. Mullin, lenore_at_cs.albany.edu
Daniel J. Rosenkrantz, djr_at_cs.albany.edu
Harry B. Hunt III, hunt_at_cs.albany.edu
Xingmin Luo, xluo_at_cs.albany.edu
The
End

28
Typical C Operator Overloading
Example ABC vector add
2 temporary vectors created
Main
1. Pass B and C references to operator
Additional Memory Use
B, C

Static memory
Dynamic memory (also affects execution time)

Operator
2. Create temporary result vector 3. Calculate
results, store in temporary 4. Return copy of
temporary
temp
BC
temp
Additional Execution Time
temp copy
5. Pass results reference to operator

Cache misses/page faults
Time to create anew vector
Time to create a copy of a vector
Time to destructboth temporaries

temp copy
Operator
temp copy
A
6. Perform assignment
29
C Expression Templates and PETE
Expression Type
Parse Tree
Expression

BinaryNodeltOpAdd, ReferenceltVectorgt,
ReferenceltVector gt gt
ExpressionTemplates
ABC
C
B
Main
Parse trees, not vectors, created
Parse trees, not vectors, created
1. Pass B and Creferences to operator
Reduced Memory Use
B, C
Operator

Parse tree only contains references

2. Create expressionparse tree
B
C
3. Return expressionparse tree
copy
Reduced Execution Time
4. Pass expression treereference to operator

Better cache use
Loop fusion style optimization
Compile-time expression tree manipulation

copy
Operator
5. Calculate result andperform assignment
BC
A

PETE, the Portable Expression Template Engine, is
available from theAdvanced Computing Laboratory
at Los Alamos National Laboratory
PETE provides
Expression template capability
Facilities to help navigate and evaluating parse
trees

PETE http//www.acl.lanl.gov/pete
30
Implementing Psi Calculus with Expression
Templates
Example Atake(4,drop(3,rev(B))) Blt1 2 3 4 5
6 7 8 9 10gt Alt7 6 5 4gt
Recall Psi Reduction for 1-d arrays always
yields one or more expressions of the
form xiystridei offset l i lt u
1. Form expression tree
3. Apply Psi Reduction rules
2. Add size information
take
size4
drop
4
size7
Reduction
Size info
rev
3
size10
size10
B
4. Rewrite as sub-expressions with iterators at
the leaves, and loop bounds information at the
root