Efficient Radar Processing Via Array and Index Algebras - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Efficient Radar Processing Via Array and Index Algebras

Description:

Efficient Use of Memory Hierarchy, Portable, Scalable, ... Blas, Linpack, LAPACK, SCALAPACK. ATLAS. Libraries. PVL, Blitz , MTL. Libraries ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 31
Provided by: kmo37
Category:

less

Transcript and Presenter's Notes

Title: Efficient Radar Processing Via Array and Index Algebras


1
Efficient Radar Processing Via Array and Index
Algebras
  • Lenore R. Mullin, Daniel J. Rosenkrantz, and
    Harry B. Hunt III, Xingmin Luo
  • University at Albany, SUNY
  • NSF CCR 0105536

2
Outline
  • Overview
  • Motivation
  • Radar Software Processing
  • to exceed 1 x 1012 ops/second
  • The Mapping Problem
  • Efficient Use of Memory Hierarchy, Portable,
    Scalable,
  • Radar uses Linear and Multi-linear Operators
  • Array Based Operations
  • Array Operations Require Array Algebra and Index
    Calculus
  • Array Algebra MoA and Index Calculus Psi
    Calculus
  • Reshape to use Processor/Memory Hierarchy
    Efficiently Lift Dimension
  • High-Level Monolithic Operations Remove
    Temporaries
  • Time Domain Convolution
  • Benefits of Using MoA and Psi Calculus

3
Levels of Processor/Memory Hierarchy
  • Can be Modeled by Increasing Dimensionality of
  • Data Array.
  • Additional dimension for each level of the
    hierarchy.
  • Envision data as reshaped to reflect increased
    dimensionality.
  • Calculus automatically transforms algorithm to
    reflect reshaped data array.
  • Data, layout, data movement, and scalarization
    automatically generated based on reshaped data
    array.

4
Levels of Processor/Memory Hierarchy continu
ed
  • Math and indexing operations in same expression
  • Framework for design space search
  • Rigorous and provably correct
  • Extensible to complex architectures

Mathematics of Arrays
y conv
(x)
Map
Approach
Example raising array dimensionality
lt 0 1 2 gt
x lt 0 1 2 35 gt
lt 3 4 5 gt
P0
Main Memory
lt 6 7 8 gt
lt 9 10 11 gt
L2 Cache
lt 12 13 14 gt
L1 Cache
lt 15 16 17 gt
P1
Memory Hierarchy
Map
lt 18 19 20 gt
lt 21 22 23 gt
lt 24 25 26 gt
lt 27 28 29 gt
P2
Parallelism
lt 30 31 32 gt
lt 33 34 35 gt
5
Application DomainSignal Processing
  • 3-d Radar Data Processing
  • Composition of Monolithic Array Operations

Convolution
Matrix Multiply
Hardware Info - Memory - Processor
Change algorithmto better match
hardware/memory/communication. Lift
dimension algebraically
Algorithm is Input
Architectural Information is Input
Model processors(dimdim1) Model
time-variance(dimdim1) Model Level 1
cache(dimdim1) Model All Three dimdim3
6
Current Abstraction Approaches
Some Modern Programming Languages with Monolithic
Arrays
C w/classesfunctions, templates
Fortran 95
ZPL
MATLAB
PETE
AST Preprocessor
Loop transformationsTheories Grammar Changes
CompilerAST OptimizationsGrammar Changes
StandardCompilerOptimizations
Interpreted
Compiled
PartialAlgebras
Requireshighly skilledprogrammers
Blas, Linpack, LAPACK, SCALAPACK ATLAS
PVL, Blitz,MTL
Libraries
Libraries
High Performance
Scalable/Portable
Fine Tune
Classical CompilerTechnology Optimization
Even when operations compose, they dont
compose, X(YZ) without temporary arrays
7

Outline
  • Overview
  • Array Algebra MoA and Index Calculus Psi
    Calculus
  • Reshape to use Processor/Memory Hierarchy
    Efficiently Lift Dimension
  • High-Level Monolithic Operations Remove
    Temporaries
  • Time Domain Convolution
  • Benefits of Using MoA and Psi Calculus

8
PSI Calculus
  • Basic Properties
  • Index calculus Centers around psi function.
  • Shape polymorphic functions and operators
  • Operations are defined using shapes and psi.
  • Fundamental type is the array modeled as
    (shape_vector, components).
  • scalars are 0-dimensional arrays, that is
    (empty_vector, scalar value).
  • Denotational Normal Form(DNF) reduced form in
    Cartesian coordinates (independent of data
    layout row major, column major, regular sparse,
    )
  • Operational Normal Form(ONF) reduced form for
    1-d memory layout(s).

9
Psi Reduction
This becomes by psi
Reduction
Acat(rev(B), rev(C)) ? AiBB.size-1-i
if 0 i lt B.size
AiCC.sizeB.size-1-i if B.size
i lt B.sizeC.size)
ONF has minimum number of reads/writes
PSI Calculus rules applied mechanically to
produce ONF which is easily translated to
optimal loop implementation
10
Some Psi Calculus Operations
11
Convolution PSI Calculus Description
Definition of yconv(h,x)
yn where x has N elements, h has M elements,
0nltNM-1, and x is x padded by M-1 zeros on
either end
Psi Calculus
Algorithm step
Algorithm and PSI Calculus Description
Initial step
x lt 1 2 3 4 gt h lt 5 6 7 gt
x lt 1 2 3 4 gt h lt 5 6 7 gt
Form x
xcat(reshape(ltk-1gt, lt0gt), cat(x,
reshape(ltk-1gt,lt0gt)))
x
lt 0 0 1 . . . 4 0 0 gt
rotate x (NM-1) times
x rot
x rotbinaryOmega(rotate,0,iota(NM-1), 1 x)
take the size of h part of xrot
x final
x finalbinaryOmega(take,0,reshapeltNM-1gt,ltMgt,1,
x rot
Prod
multiply
ProdbinaryOmega (,1, h,1,x final)
sum
YunaryOmega (sum, 1, Prod)
lt 7 20 38 . . . gt
Y
PSI Calculus operators compose to form higher
level operations
12
Experimental Platform and Method
  • Hardware
  • DY4 CHAMP-AV Board
  • Contains 4 MPC7400s and 1 MPC 8420
  • MPC7400 (G4)
  • 450 MHz
  • 32 KB L1 data cache
  • 2 MB L2 cache
  • 64 MB memory/processor
  • Software
  • VxWorks 5.2
  • Real-time OS
  • GCC 2.95.4 (non-official release)
  • GCC 2.95.3 with patches for VxWorks
  • Optimization flags
  • -O3 -funroll-loops -fstrict-aliasing
  • Method
  • Run many iterations, report average, minimum,
    maximum time
  • From 10,000,000 iterations for small data sizes,
    to 1000 for large data sizes
  • All approaches run on same data
  • Only average times shown here
  • Only one G4 processor used
  • Use of the VxWorks OS resulted in very low
    variability in timing
  • High degree of confidence in results

13
Experiment Conv(x,h)
  • Cost of temporaries in regular C approach more
    pronounced due to large number of operations
  • Cost of expression tree manipulation also more
    pronounced

14
Convolution and Dimension Lifting
  • Model Processor and Level 1 cache.
  • Start with 1-d inputs(input dimension).
  • Envision 2nd dimension ranging over output
    values.
  • Envision Processors
  • Reshaped into a 3rd dimension.
  • The 2nd dimension is partitioned.
  • Envision Cache
  • Reshaped into a 4th dimension.
  • The 1st dimension is partitioned.
  • psi Reduce to Normal Form

15
  • Envision 2nd dimension ranging over output
    values.
  • Let tzNM-1
  • Mth4
  • Ntx

4
x
S
h3
h2
h1
h0
0
0
0
x0
tz
tz
16
  • - Envision Processors
  • Reshaped into a 3rd dimension.
  • The 2nd dimension is partitioned.

Let p 2
-4
-4 -
-4-
x
x
S
S
tz
tz
2
2
2
17
  • Envision Cache
  • Reshaped into a 4th dimension
  • The 1st dimension is partitioned.

x
x
S
S
2
2
S
S
2
2
x
x
2
2
tz/2
tz
2
Tz/2
Tz/2
2
18
ONF for the Convolution Decomposition with
Processors Cache
Generic form- 4 dimensional after psi
Reduction
  • For i0 0 to p-1 do
  • For i11 0 to tz/p 1 do
  • sum ? 0
  • For icacherow 0 to M/cache -1 do
  • For i3 0 to cache 1 do
  • sum ? sum h (M-((icacherow cache)
    i3))-1 x(((tz/p i0)i1) icacherow
    cache) i3)

Let tzNM-1 Mth Ntx
P r o c e s s o r l o o p
C a c h e l o o p
T I m e l o o p
sum is calculated for each element of y.
Time Domain
19
Outline
  • Overview
  • Array Algebra MoA and Index Calculus Psi
    Calculus
  • Time Domain Convolution
  • Other algorithms in Radar
  • Modified Gram-Schmidt QR Decompositions
  • MOA to ONF
  • Experiments
  • Composition of Matrix Multiplication in
    Beamforming
  • MoA to DNF
  • Experiments
  • FFT
  • Benefits of Using Moa and Psi Calculus

20
Algorithms in Radar
Mechanize Using Expression Templates
Time DomainConvolution (x,y)
ONF for 1proc
Lift dimension - Processor - L1 cache reformulate
Use toreason about RAW
Manual description derivation for 1
processor DNF
DNF
ONF
Thoughtson an Abstract Machine
Modified Gram SchmidtQR (A)
Benchmark at NCSAw/LAPACK
CompilerOptimizationsDNF to ONF
ImplementDNF/ONFFortran 90
A x (BH x C) Beamforming
MoA y Calculus
21
ONF for the QRDecomposition with Processors
Cache
Initialization
ProcessorLoop
ComputeNorm
MainLo o p
ProcessorLoop
Normalize
ProcessorCache Loop
DoTProduct
Processor CacheLoop
Ortothogonalize
Modified Gram Schmidt
22
DNF for the Composition of A x (BH x C)
Generic form- 4 dimensional
  • Z0
  • For i0 to n-1 do
  • For j0 to n-1 do
  • For k0 to n-1 do
  • zk?zkAkjxXjixBi

Given A, B, X, Z n by n arrays
Beamforming
23
Fftpsirad2 Performance Comparisons
24
Mechanizing MoA and Psi Reduction
Index Theory IntroducedAbrams 1972
MoA y calculus theory Mullin 88 Prototype
compiler output C, F90, HPF Mullin and
Thibault94 HPF compiler AST manipulations
Mullin, et al 96 SAC functional C Mullin and
Bodo96 C classes Helal, Sameh and
Mullin01 C expression templates Mullin,
Ruttledge, Bond02 PVL with the portable
expression template engine(PETE) Parallel and
distributed processing Abstract machine Automate
cost and determine optimizations minimize search
space
Fortran
C
Theory applied to embedded systems
C
Lifting Compiler Optimizations to Application
Programmer Interface
25
On-going research
  • we are implementing the psi calculus using
    expression templates.
  • we are building on work done at MIT and we are
    working with MTL library developers (lumsdaine)
    at Indiana University and STL library developer,
    musser, at rpi.

26
Benefits of Using Moa and Psi Calculus
  • Processor/Memory Hierarchy can be modeled by
    reshaping data using an extra dimension for each
    level.
  • Composition of monolithic operations can be
    reexpressed as composition of operations on
    smaller data granularities
  • Matches memory hierarchy levels
  • Avoids materialization of intermediate arrays.
  • Algorithm can be automatically(algebraically)
    transformed to reflect array reshapings above.
  • Facilitates programming expressed at a high level
  • Facilitates intentional program design and
    analysis
  • Facilitates portability
  • This approach is applicable to many other
    problems in radar.

27
Email and Question?
  • Lenore R. Mullin, lenore_at_cs.albany.edu
  • Daniel J. Rosenkrantz, djr_at_cs.albany.edu
  • Harry B. Hunt III, hunt_at_cs.albany.edu
  • Xingmin Luo, xluo_at_cs.albany.edu
  • The
    End

28
Typical C Operator Overloading
Example ABC vector add
2 temporary vectors created
Main
1. Pass B and C references to operator
Additional Memory Use
B, C
  • Static memory
  • Dynamic memory (also affects execution time)

Operator
2. Create temporary result vector 3. Calculate
results, store in temporary 4. Return copy of
temporary
temp
BC
temp
Additional Execution Time
temp copy
5. Pass results reference to operator
  • Cache misses/page faults
  • Time to create anew vector
  • Time to create a copy of a vector
  • Time to destructboth temporaries

temp copy
Operator
temp copy
A
6. Perform assignment
29
C Expression Templates and PETE
Expression Type
Parse Tree
Expression

BinaryNodeltOpAdd, ReferenceltVectorgt,
ReferenceltVector gt gt
ExpressionTemplates
ABC
C
B
Main
Parse trees, not vectors, created
Parse trees, not vectors, created
1. Pass B and Creferences to operator
Reduced Memory Use
B, C
Operator
  • Parse tree only contains references


2. Create expressionparse tree
B
C
3. Return expressionparse tree
copy
Reduced Execution Time
4. Pass expression treereference to operator
  • Better cache use
  • Loop fusion style optimization
  • Compile-time expression tree manipulation

copy
Operator
5. Calculate result andperform assignment
BC
A
  • PETE, the Portable Expression Template Engine, is
    available from theAdvanced Computing Laboratory
    at Los Alamos National Laboratory
  • PETE provides
  • Expression template capability
  • Facilities to help navigate and evaluating parse
    trees

PETE http//www.acl.lanl.gov/pete
30
Implementing Psi Calculus with Expression
Templates
Example Atake(4,drop(3,rev(B))) Blt1 2 3 4 5
6 7 8 9 10gt Alt7 6 5 4gt
Recall Psi Reduction for 1-d arrays always
yields one or more expressions of the
form xiystridei offset l i lt u
1. Form expression tree
3. Apply Psi Reduction rules
2. Add size information
take
size4
drop
4
size7
Reduction
Size info
rev
3
size10
size10
B
4. Rewrite as sub-expressions with iterators at
the leaves, and loop bounds information at the
root
  • Iterators used for efficiency, rather than
    recalculating indices for each i
  • One for loop to evaluate each
    sub-expression

iterator offset6stride-1
size4
Write a Comment
User Comments (0)
About PowerShow.com