Title: A uniform algebraicallybased approach to computational physics and efficient programming
1A uniform algebraically-based approach to
computational physics and efficient programming
- James E. Raynolds
- College of Nanoscale Science and Engineering
- University at Albany, State University of New
York, Albany, NY 12309 - Lenore Mullin,
- Computer Science
- University at Albany, State University of New
York, Albany, NY 12309
2Matrix Example
- In Fortran 90
- First temporary computed
- Second temporary
- Last operation
3Matrix Example (cont)
- Intermediate temporaries consume memory and add
to processing operations - Solution compose index operations
- Loop over i, j
- No temporaries
4Need for formalism
- Few problems are as simple as
- Formalism designed to handle extremely
complicated situations systematically - Goal composition of algorithms
- For Example Radar is composed of the composition
of numerous algorithms QR(FFT(X)). - Optimizations are classically done sequentially
even when parallel processors and nodes are used.
FFT(or DFT?) then QR - Optimizations can be optimized across algorithms,
processors, and memories
5MoA and PSI Calculus
- Basic Properties
- An index calculus psi function.
- Shape polymorphic functions and operators
- Operations are defined using shapes and psi.
- MoA defines some useful operations and function.
- As long as shapes define functions and operations
any new function or operation may be defined and
reduced. - Fundamental type is the array
- scalars are 0-dimensional arrays.
- Denotational Normal Form(DNF) reduced form in
Cartesian coordinates (independent of data
layout row major, column major, regular sparse,
) - Operational Normal Form(ONF) reduced form for
1-d memory layout(s). - Defines How to Build the code on processor/memory
hierarchies. ONF reveals loops and control.
6Applications Levels of Processor/Memory Hierarchy
- Can be Modeled by Increasing Dimensionality of
- Data Array.
- Additional dimension for each level of the
hierarchy. - Envision data as reshaped/transposed to reflect
mapping to increased dimensionality. - An Index Calculus automatically transforms
algorithm to reflect restructured data array. - Data, layout, data movement, and scalarization
automatically generated based on MoA descriptions
and Psi Calculus Definitions of Array Operations,
Functions and their compositions. - Arrays are any dimension, even 0, I.e. scalars
7 Processor/Memory Hierarchy continu
ed
intricate math
- Math and indexing operations in same expression
- Framework for design space search
- Rigorous and provably correct
- Extensible to complex architectures
Mathematics of Arrays
y conv
(x)
intricatememory accesses (indexing)
Map
Approach
Example raising array dimensionality
lt 0 1 2 gt
x lt 0 1 2 35 gt
lt 3 4 5 gt
P0
Main Memory
lt 6 7 8 gt
lt 9 10 11 gt
L2 Cache
lt 12 13 14 gt
L1 Cache
lt 15 16 17 gt
P1
Memory Hierarchy
Map
lt 18 19 20 gt
lt 21 22 23 gt
lt 24 25 26 gt
lt 27 28 29 gt
P2
Parallelism
lt 30 31 32 gt
lt 33 34 35 gt
8Manipulation of an array
- Given a 3 by 5 by 4 array
- Shape vector
- Index vector
- Used to select
9More Definitions
- Reverse Given an array
- The reversal is given through indexing
- Examples
10Some Psi Calculus OperationsBuilt Using y
Shapes
Operations
Arguments
Definition
take
Vector A, int N
Forms a Vector of the first N elements of A
drop
Vector A, int N
Forms a Vector of the last (A.size-N) elements of
A
Forms a Vector of the last N elements of A
concatenated to the other elements of A
rotate
Vector A, int N
Vector A, Vector B
cat
Forms a Vector that is the concatenation of A and
B
Operation Op, dimension D, Array A
Applies unary operator Op to D-dimensional
components of A (like a for all loop)
unaryOmega
Operation Op,Dimension Adim. Array A, Dimension
Bdim, Array B
Applies binary operator Op to Adim-dimensional
components of A and Bdim-dimensional components
of B (like a for all loop)
binaryOmega
Reshapes B into an array having A.size
dimensions, where the length in each dimension is
given by the corresponding element of A
reshape
Vector A, Vector B
iota
int N
Forms a vector of size N, containing values 0 . .
N-1
index permutation
operators
restructuring
index generation
11New FFT algorithm record speed
- Maximize in-cache operations through use of
repeated transpose-reshape operations - Similar to partitioning for parallel
implementation - Do as many operations in cache as possible
- Re-materialize the array to achieve locality
- Continue processing in cache and repeat process
-
12Example
- Assume cache size c 4 input vector length n
32 number of rows r n/c 8 - Generate vector of indices
- Use re-shape operator to generate a matrix
13Starting Matrix
- Each row is of length equal to the size c
- Standard butterfly applied to each row as...
14Next transpose
- To continue further would induce cache misses so
transpose and reshape. - Transpose-reshape operation composed over indices
(only result is materialized. - The transpose is
15Resulting Transpose-Reshape
- Materialize the transpose-reshaped array B
- Carry out butterfly operation on each row
- Weights are re-ordered
- Access patterns are standard...
16Transpose-Reshape again
- As before to proceed further would induce cache
misses so - Do the transpose-reshape again (composing
indices) - The transpose is
17Last step (in this example)
- Materialize the composed transpose-reshaped array
C - Carry out the last step of the FFT
- This last step corresponds to cycles of length 2
involving elements 0 and 16, 1 and 17, etc.
1
18Final Transpose
- Data has been permuted numerous times
- Multiple reshape-transposes
- We could reverse the transformations
- There would be multiple steps, multiple writes.
- Viewing the problem as an n-cube(hypercube for
radix 2) allows us to use the number of
reshape-transposes as an argument to rotate(or
shift) of a vector generated from the dimension
of the hypercube. - This rotated vector is used as an argument to
binary transpose. - Permutes everything at once.
- Express Algebraically, Psi reduce to DNF then ONF
for a generic design. - ONF has only two loops no matter what dimension
hypercube(or n-cube for radix n) we start with.
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Summary
- All operations have been carried out in cache at
the price of re-arranging the data - Data blocks can be of any size (powers of the
radix) need not equal the cache size - Optimum performance tradeoff between reduction
of cache misses and cost of transpose-reshape
operations - Number of transpose-reshape operations determined
by the data block size (cache size) - Record performance up to factor of 4 better than
libraries
23Science Direct 25 Hottest Articles
24Book under review at springer
25New paper at J. Comp. Phys.