A uniform algebraicallybased approach to computational physics and efficient programming presentation

About This Presentation

Transcript and Presenter's Notes

Title: A uniform algebraicallybased approach to computational physics and efficient programming

1
A uniform algebraically-based approach to
computational physics and efficient programming

James E. Raynolds
College of Nanoscale Science and Engineering
University at Albany, State University of New
York, Albany, NY 12309
Lenore Mullin,
Computer Science
University at Albany, State University of New
York, Albany, NY 12309

2
Matrix Example

In Fortran 90
First temporary computed
Second temporary
Last operation

3
Matrix Example (cont)

Intermediate temporaries consume memory and add
to processing operations
Solution compose index operations
Loop over i, j
No temporaries

4
Need for formalism

Few problems are as simple as
Formalism designed to handle extremely
complicated situations systematically
Goal composition of algorithms
For Example Radar is composed of the composition
of numerous algorithms QR(FFT(X)).
Optimizations are classically done sequentially
even when parallel processors and nodes are used.
FFT(or DFT?) then QR
Optimizations can be optimized across algorithms,
processors, and memories

5
MoA and PSI Calculus

Basic Properties
An index calculus psi function.
Shape polymorphic functions and operators
Operations are defined using shapes and psi.
MoA defines some useful operations and function.
As long as shapes define functions and operations
any new function or operation may be defined and
reduced.
Fundamental type is the array
scalars are 0-dimensional arrays.
Denotational Normal Form(DNF) reduced form in
Cartesian coordinates (independent of data
layout row major, column major, regular sparse,
)
Operational Normal Form(ONF) reduced form for
1-d memory layout(s).
Defines How to Build the code on processor/memory
hierarchies. ONF reveals loops and control.

6
Applications Levels of Processor/Memory Hierarchy

Can be Modeled by Increasing Dimensionality of
Data Array.
Additional dimension for each level of the
hierarchy.
Envision data as reshaped/transposed to reflect
mapping to increased dimensionality.
An Index Calculus automatically transforms
algorithm to reflect restructured data array.
Data, layout, data movement, and scalarization
automatically generated based on MoA descriptions
and Psi Calculus Definitions of Array Operations,
Functions and their compositions.
Arrays are any dimension, even 0, I.e. scalars

7
Processor/Memory Hierarchy continu
ed
intricate math

Math and indexing operations in same expression
Framework for design space search
Rigorous and provably correct
Extensible to complex architectures

Mathematics of Arrays
y conv
(x)
intricatememory accesses (indexing)
Map
Approach
Example raising array dimensionality
lt 0 1 2 gt
x lt 0 1 2 35 gt
lt 3 4 5 gt
P0
Main Memory
lt 6 7 8 gt
lt 9 10 11 gt
L2 Cache
lt 12 13 14 gt
L1 Cache
lt 15 16 17 gt
P1
Memory Hierarchy
Map
lt 18 19 20 gt
lt 21 22 23 gt
lt 24 25 26 gt
lt 27 28 29 gt
P2
Parallelism
lt 30 31 32 gt
lt 33 34 35 gt
8
Manipulation of an array

Given a 3 by 5 by 4 array
Shape vector
Index vector
Used to select

9
More Definitions

Reverse Given an array
The reversal is given through indexing
Examples

10
Some Psi Calculus OperationsBuilt Using y
Shapes
Operations
Arguments
Definition
take
Vector A, int N
Forms a Vector of the first N elements of A
drop
Vector A, int N
Forms a Vector of the last (A.size-N) elements of
A
Forms a Vector of the last N elements of A
concatenated to the other elements of A
rotate
Vector A, int N
Vector A, Vector B
cat
Forms a Vector that is the concatenation of A and
B
Operation Op, dimension D, Array A
Applies unary operator Op to D-dimensional
components of A (like a for all loop)
unaryOmega
Operation Op,Dimension Adim. Array A, Dimension
Bdim, Array B
Applies binary operator Op to Adim-dimensional
components of A and Bdim-dimensional components
of B (like a for all loop)
binaryOmega
Reshapes B into an array having A.size
dimensions, where the length in each dimension is
given by the corresponding element of A
reshape
Vector A, Vector B
iota
int N
Forms a vector of size N, containing values 0 . .
N-1
index permutation
operators
restructuring
index generation
11
New FFT algorithm record speed

Maximize in-cache operations through use of
repeated transpose-reshape operations
Similar to partitioning for parallel
implementation
Do as many operations in cache as possible
Re-materialize the array to achieve locality
Continue processing in cache and repeat process

12
Example

Assume cache size c 4 input vector length n
32 number of rows r n/c 8
Generate vector of indices
Use re-shape operator to generate a matrix

13
Starting Matrix

Each row is of length equal to the size c
Standard butterfly applied to each row as...

14
Next transpose

To continue further would induce cache misses so
transpose and reshape.
Transpose-reshape operation composed over indices
(only result is materialized.
The transpose is

15
Resulting Transpose-Reshape

Materialize the transpose-reshaped array B
Carry out butterfly operation on each row
Weights are re-ordered
Access patterns are standard...

16
Transpose-Reshape again

As before to proceed further would induce cache
misses so
Do the transpose-reshape again (composing
indices)
The transpose is

17
Last step (in this example)

Materialize the composed transpose-reshaped array
C
Carry out the last step of the FFT
This last step corresponds to cycles of length 2
involving elements 0 and 16, 1 and 17, etc.

1
18
Final Transpose

Data has been permuted numerous times
Multiple reshape-transposes
We could reverse the transformations
There would be multiple steps, multiple writes.
Viewing the problem as an n-cube(hypercube for
radix 2) allows us to use the number of
reshape-transposes as an argument to rotate(or
shift) of a vector generated from the dimension
of the hypercube.
This rotated vector is used as an argument to
binary transpose.
Permutes everything at once.
Express Algebraically, Psi reduce to DNF then ONF
for a generic design.
ONF has only two loops no matter what dimension
hypercube(or n-cube for radix n) we start with.

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Summary

All operations have been carried out in cache at
the price of re-arranging the data
Data blocks can be of any size (powers of the
radix) need not equal the cache size
Optimum performance tradeoff between reduction
of cache misses and cost of transpose-reshape
operations
Number of transpose-reshape operations determined
by the data block size (cache size)
Record performance up to factor of 4 better than
libraries

23
Science Direct 25 Hottest Articles
24
Book under review at springer
25
New paper at J. Comp. Phys.

Write a Comment

User Comments (0)

About PowerShow.com

A uniform algebraicallybased approach to computational physics and efficient programming PowerPoint PPT Presentation