Title: Cell CF06 Presentation
1The Potential of the Cell Processor for
Scientific Computing Sam Williams samw_at_cs.berkele
y.edu Computational Research Division Future
Technologies Group July 12, 2006
2Outline
- Cell Architecture
- Programming Cell
- Benchmarks Performance
- Stencils on Structured Grids
- Sparse Matrix-Vector Multiplication
- Matrix-Matrix Multiplication
- FFTs
- Summary
3Cell Architecture
4Cell Chip Architecture
- 9 Core SMP
- One core is a conventional cache based PPC
- The other 8 are local memory based SIMD
processors (SPEs) - Cores connected via 4 rings (4 x 128b _at_ 1.6GHz)
- 25.6GB/s memory bandwidth (128b _at_ 1.6GHz) to XDR
- I/O channels can be used to build multichip SMPs
5Cell Chip Characteristics
- Core frequency up to 3.2GHz
- Limited access to 2.1GHz hardware
- Faster (2.4GHz 3.2GHz) became available after
the paper. - 221mm2 chip,
- Much smaller than Itanium2 (1/2 to 1/3)
- About the size of a Power5
- Opteron/Pentium before shrink
- 500W blades (2 chips DRAM network)
6SPE Architecture
- 128b SIMD
- Dual issue (FP/ALU load/store/permute/etc),
but not DP - Statically Scheduled, in-order
- 4 FMA SP datapaths, 6 cycle latency
- 1 FMA DP datapath, 13 cycle latency (including 7
stall cycles) - Software managed BTB (branch hints)
- Local memory based architecture (two steps to
access DRAM) - DMA commands are queued in the MFC
- 128b _at_ 1.6GHz to local store (256KB)
- Aligned loads from local store to RF (128 x 128b)
- 3W _at_ 3.2GHz
7Programming Cell
8Estimation, Simulation and Exploration
- Perform analysis before writing any code
- Conceptualize Algorithm with
- Double buffering long DMAs in order machine
- use static timing analysis memory traffic
modeling - Model Performance
- For regular data structures, spreadsheet modeling
works - SpMV requires more advanced modeling
- Full System Simulator
- Cell
- How severely does DP throughput of 1 SIMD
instruction every 7 or 8 cycles impair
performance? - 1 SIMD instruction every 2 cycles
- Allows dual issuing of DP instructions with
loads/stores/permutes/branch
9Cell Programming
- Modified SPMD (Single Program Multiple Data)
- Dual Program Multiple Data
- Hierarchical SPMD
- Kind of clunky approach compared to MPI or
pthreads - Power core is used to
- Load/initialize data structures
- Spawn SPE threads
- Parallelize data structures
- Pass pointers to SPEs
- Synchronize SPE threads
- Communicate with other processors
- Perform other I/O operations
10SPE Programming
- Software controlled Memory
- Use pointers from PPC to construct global
addresses - Programmer handles transfers from global to local
- Compiler handles transfers from local to RF
- Most applicable when address stream is long and
independent - DMAs
- Granularity is 1,2,4,8,16N bytes
- Issued with very low level intrinsics (no error
checking) - GETL Stanza Gather/Scatter distributed in
global, packed in local - Double buffering
- SPU and MFC operate in parallel
- Operate on current data set while transferring
the next/last - Time is max of computation time and communication
time - Prologue and epilogue penalties
- Although more verbose, intrinsics are an
effective way of delivering peak performance.
11- Benchmark Kernels
- Stencil Computations on Structured Grids
- Sparse Matrix-Vector Multiplication
- Matrix-Matrix Multiplication
- 1D FFTs
- 2D FFTs
12Processors Used
Note Cell performance does not include the
Power core
13Stencil Operations on Structured Grids
14Stencil Operations
- Simple Example - The Heat Equation
- dT/dt k?2T
- Parabolic PDE on 3D discretized scalar domain
- Jacobi Style (read from current grid, write to
next grid) - 7 FLOPs per point, typically double precision
- Nextx,y,z AlphaCurrentx,y,z
- Currentx-1,y,z Currentx1,y,z
- Currentx,y-1,z Currentx,y1,z
- Currentx,y,z-1 Currentx,y,z1
- Doesnt exploit the FMA pipeline well
- Basically 6 streams presented to the memory
subsystem - Explicit ghost zones bound grid
15Optimization - Planes
- Naïve approach (cacheless vector machine) is to
load 5 streams and store one. - This is 7 flops per 48 bytes
- memory limits performance to 3.7 GFLOP/s
- A better approach is to make each DMA the size of
a plane - cache the 3 most recent planes (z-1, z, z1)
- there are only two streams (one load, one store)
- memory now limits performance to 11.2 GFLOP/s
- Still must compute on each plane after it is
loaded - e.g. forall Current_localx,y update
Next_localx,y - Note computation can severely limit performance
16Optimization - Double Buffering
- Add a input stream buffer and and output stream
buffer (keep 6 planes in local store) - Two phases (transfer compute) are now
overlapped - Thus it is possible to hide the faster of DMA
transfer time and computation time
17Optimization - Cache Blocking
- Domains can be quite large (1GB)
- A single plane, let alone 6, might not fit in the
local store - Partition domain (and thus planes) into cache
blocks so that 6 can fit in the local store. - Has the added benefit that cache blocks are
independent and thus can be parallelized across
multiple SPEs - Memory efficiency can be diminished as an intra
grid ghost zone is implicitly created.
18Optimization - Register Blocking
- Instead of computing on pencils, compute on
ribbons (4x2) - Hides functional unit local store latency
- Minimizes local store memory traffic
- Minimizes loop overhead
- May not be beneficial / noticeable for cache
based machines
19Optimization - Time Skewing
- If the application allows it, perform multiple
time steps within the local store - Only appropriate on memory bound implementations
- Improves computational intensity
- single precision, or
- improved double precision
- Simple approach
- Overlapping trapezoids in time-space plot
- Can be inefficient due to duplicated work
- If performing two steps, local store must now
hold 9 planes (thus smaller cache blocks) - If performing n steps, the local store must hold
3(n1) planes
20Stencil Performance
- Notes
- Performance model when accounting for limited
dual issue, matches well with FSS and hardware - Double precision runs dont exploit time skewing
- In single precision time skewing 4
- Problem was best of 1283 and 2563
21Sparse Matrix-Vector Multiplication
22Sparse Matrix Vector Multiplication
- Most of the matrix entries are zero, thus the
nonzeros are sparsely distributed - Can be used for unstructured grid problems
- Issues
- Like DGEMM, can exploit a FMA well
- Very low computational intensity
- (1 FMA for every 12 bytes)
- Non FP instructions can dominate
- Can be very irregular
- Row lengths can be unique and very short
23Compressed Sparse Row
- Compressed Sparse Row (CSR) is the standard
format - Array of nonzero values
- Array of corresponding column for each nonzero
value - Array of row starts containing index (in the
above arrays) of first nonzero in the row
24Optimization - Double Buffer Nonzeros
- Computation and Communication are approximately
equally expensive - While operating on the current set of nonzeros,
load the next (1K nonzero buffers) - Need complex (thought) code to stop and restart a
row between buffers - Can nearly double performance
25Optimization - SIMDization
- BCSR
- Nonzeros are grouped into r x c blocks
- O(nnz) explicit zeros are added
- Choose rc so that it meshes well with 128b
registers - Performance can hurt especially in DP as
computing on zeros is very wasteful - Can hide latency and amortize loop overhead
- Only used in initial performance model
- Row Padding
- Pad rows to the nearest multiple of 128b
- Might requires O(N) explicit zeros
- Loop overhead still present
- Generally works better in double precision
26Optimization - Cache Blocked Vectors
- Doubleword DMA gathers from DRAM can be expensive
- Cache block source and destination vectors
- Finite LS, so whats the best aspect ratio?
- DMA large blocks into local store
- Gather operations into local store
- ISA vs. memory subsystem inefficiencies
- Exploits temporal and spatial locality within
- the SPE
- In effect, the sparse matrix is explicitly
blocked - into submatrices, and we can skip, or otherwise
- optimize empty submatrices
- Indices are now relative to the cache block
- half words
- reduces memory traffic by 16
27Optimization - Load Balancing
- Potentially irregular problem
- Load imbalance can severely hurt performance
- Partitioning by rows is clearly insufficient
- Partitioning by nonzeros is inefficient when
- average number of nonzeros per row per cache
- block is small
- Define a cost function of number of row starts
- and number of nonzeros.
- Determine the parameters via static timing
analysis or tuning.
28Benchmark Matrices
- 4 nonsymmetric SPARSITY matrices
- 6 symmetric SPARSITY matrices
- 7pt Heat equation matrix
29Other Approaches
- BeBop / OSKI on the Itanium2 Opteron
- uses BCSR
- auto tunes for optimal r x c blocking
- Cell implementation is similar
- Crays routines on the X1E
- Report best of CSRP, Segmented Scan Jagged
Diagonal
30Double Precision SpMV Performance
- 16 SPE version needs broadcast optimization
- Frequency helps (mildly computationally bound)
- Cell doesnt help much more
- (non FP instruction BW)
31Dense Matrix Matrix Multiplication (performance
model only)
32Dense Matrix-Matrix Multiplication
- Blocking
- Explicit (BDL) or implicit blocking (gather
stanzas) - Hybrid method would be to convert and store to
DRAM on the fly - Choose a block size so that kernel is
computationally bound - ?642 in single precision
- much easier in double precision (14x
computational time, 2x transfer time) - Parallelization
- Partition A C among SPEs
- Future work - cannons algorithm
33GEMM Performance
34Fourier Transforms (performance model only)
351D Fast Fourier Transforms
- Naïve Algorithm
- Load roots of unity
- Load data (cyclic)
- Local work, on-chip transpose, local work
- i.e. SPEs cooperate on a single FFT
- No overlap of communication or computation
362D Fast Fourier Transforms
- Each SPE performs 2 (N/8) FFTs
- Double buffer rows
- overlap communication and computation
- 2 incoming, 2 outgoing
- Straightforward algorithm (N2 2D FFT)
- N simultaneous FFTs, transpose,
- N simultaneous FFTs, transpose.
- Long DMAs necessitate transposes
- transposes represent about 50 of total SP
execution time - SP Simultaneous FFTs run at 75 GFLOP/s
37FFT Performance
38Conclusions
39Conclusions
- Far more predictable than conventional OOO
machines - Even in double precision, it obtains much better
performance on a surprising variety of codes. - Cell can eliminate unneeded memory traffic, hide
memory latency, and thus achieves a much higher
percentage of memory bandwidth. - Instruction set can be very inefficient for
poorly SIMDizable or misaligned codes. - Loop overheads can heavily dominate performance.
- Programming Model is clunky
40Acknowledgments
- This work (paper in CF06, poster in EDGE06) is a
collaboration with the following FTG members - John Shalf, Lenny Oliker, Shoaib Kamil, Parry
Husbands, and Kathy Yelick - Additional thanks to
- Joe Gebis and David Patterson
- X1E FFT numbers provided by
- Bracy Elton, and Adrian Tate
- Cell access provided by
- Mike Perrone (IBM Watson)
- Otto Wohlmuth (IBM Germany)
- Nehal Desai (Los Alamos National Lab)
41Questions?