Title: Scientific Applications on Multi-PIM Systems
1Scientific Applications on Multi-PIM Systems
- WIMPS 2002
- Katherine Yelick
- U.C. Berkeley and NERSC/LBNL
Joint with with Xiaoye Li, Lenny Oliker, Brian
Gaeke, Parry Husbands (LBNL) And the Berkeley
IRAM group Dave Patterson, Joe Gebis, Dave Judd,
Christoforos Kozyrakis, Sam Williams, Steve Pope
2Algorithm Space
Search
Two-sided dense linear algebra
FFTs
Grobner Basis (Symbolic LU)
Sorting
Reuse
Sparse iterative solvers
Asynchronous discrete even simulation
One-sided dense linear algebra
Sparse direct solvers
Regularity
3Why build Multiprocessor PIM?
- Scaling to Petaflops
- Low power/footprint/etc.
- Performance
- And performance predictability
- Programmability
- Lets not forget this
- Would like to increase user base
- Start with single chip problem by looking at
VIRAM
4VIRAM Overview
- MIPS core (200 MHz)
- Single-issue, 8 Kbyte ID caches
- Vector unit (200 MHz)
- 32 64b elements per register
- 256b datapaths, (16b, 32b, 64b ops)
- 4 address generation units
- Main memory system
- 13 MB of on-chip DRAM in 8 banks
- 12.8 GBytes/s peak bandwidth
- Typical power consumption 2.0 W
- Peak vector performance
- 1.6/3.2/6.4 Gops wo. multiply-add
- 1.6 Gflops (single-precision)
- Fabrication by IBM
- Tape-out in O(1 month)
5Benchmarks for Scientific Problems
- Dense Matrix-vector multiplication
- Compare to hand-tuned codes on conventional
machines - Transitive-closure (small large data set)
- On a dense graph representation
- NSA Giga-Updates Per Second (GUPS, 16-bit
64-bit) - Fetch-and-increment a stream of random
addresses - Sparse matrix-vector product
- Order 10000, nonzeros 177820
- Computing a histogram
- Used for image processing of a 16-bit greyscale
image 1536 x 1536 - 2 algorithms 64-elements sorting kernel
privatization - Also used in sorting
- 2D unstructured mesh adaptation
- initial grid 4802 triangles, final grid 24010
6Power and Performance on BLAS-2
- 100x100 matrix vector multiplication (column
layout) - VIRAM result compiled, others hand-coded or Atlas
optimized - VIRAM performance improves with larger matrices
- VIRAM power includes on-chip main memory
- 8-lane version of VIRAM nearly doubles MFLOPS
7Performance Comparison
- IRAM designed for media processing
- Low power was a higher priority than high
performance - IRAM (at 200MHz) is better for apps with
sufficient parallelism
8Power Efficiency
- Huge power/performance advantage in VIRAM from
both - PIM technology
- Data parallel execution model (compiler-controlled
)
9Power Efficiency
- Same data on a log plot
- Includes both low power processors (Mobile PIII)
- The same picture for operations/cycle
10Which Problems are Limited by Bandwidth?
- What is the bottleneck in each case?
- Transitive and GUPS are limited by bandwidth
(near 6.4GB/s peak) - SPMV and Mesh limited by address generation and
bank conflicts - For Histogram there is insufficient parallelism
11Summary of 1-PIM Results
- Programmability advantage
- All vectorized by the VIRAM compiler (Cray
vectorizer) - With restructuring and hints from programmers
- Performance advantage
- Large on applications limited only by bandwidth
- More address generators/sub-banks would help
irregular performance - Performance/Power advantage
- Over both low power and high performance
processors - Both PIM and data parallelism are key
12Analysis of a Multi-PIM System
- Machine Parameters
- Floating point performance
- PIM-node dependent
- Application dependent, not theoretical peak
- Amount of memory per processor
- Use 1/10th Algorithm data
- Communication Overhead
- Time processor is busy sending a message
- Cannot be overlapped
- Communication Latency
- Time across the network (can be overlapped)
- Communication Bandwidth
- Single node and bisection
- Back-of-the envelope calculations !
13Real Data from an Old Machine (T3E)
- UPC uses a global address space
- Non-blocking remote put/get model
- Does not cache remote data
14Running Sparse MVM on a Pflop PIM
- 1 GHz 8 pipes 8 ALUs/Pipe 64 GFLOPS/node
peak - 8 Address generators limit performance to 16
Gflops - 500ns latency, 1 cycle put/get overhead, 100
cycle MP overhead - Programmability differences too packing vs.
global address space
15Effect of Memory Size
- For small memory nodes or smaller problem sizes
- Low overhead is more important
- For large memory nodes and large problems packing
is better
16Conclusions
- Performance advantage for PIMS depends on
application - Need fine-grained parallelism to utilize on-chip
bandwidth - Data parallelism is one model with the usual
trade-offs - Hardware and programming simplicity
- Limited expressibility
- Largest advantages for PIMS are power and
packaging - Enables Peta-scale machine
- Multiprocessor PIMs should be easier to program
- At least at scale of current machines (Tflops)
- Can we bget rid of the current programming model
hierarchy?
17The End
18Benchmarks
- Kernels
- Designed to stress memory systems
- Some taken from the Data Intensive Systems
Stressmarks - Unit and constant stride memory
- Dense matrix-vector multiplication
- Transitive-closure
- Constant stride
- FFT
- Indirect addressing
- NSA Giga-Updates Per Second (GUPS)
- Sparse Matrix Vector multiplication
- Histogram calculation (sorting)
- Frequent branching a well and irregular memory
acess - Unstructured mesh adaptation
19Conclusions and VIRAM Future Directions
- VIRAM outperforms Pentium III on Scientific
problems - With lower power and clock rate than the Mobile
Pentium - Vectorization techniques developed for the Cray
PVPs applicable. - PIM technology provides low power, low cost
memory system. - Similar combination used in Sony Playstation.
- Small ISA changes can have large impact
- Limited in-register permutations sped up 1K FFT
by 5x. - Memory system can still be a bottleneck
- Indexed/variable stride costly, due to address
generation. - Future work
- Ongoing investigations into impact of lanes,
subbanks - Technical paper in preparation expect
completion 09/01 - Run benchmark on real VIRAM chips
- Examine multiprocessor VIRAM configurations
20Management Plan
- Roles of different groups and PIs
- Senior researchers working on particular class of
benchmarks - Parry sorting and histograms
- Sherry sparse matrices
- Lenny unstructured mesh adaptation
- Brian simulation
- Jin and Hyun specific benchmarks
- Plan to hire additional postdoc for next year
(focus on Imagine) - Undergrad model used for targeted benchmark
efforts - Plan for using computational resources at NERSC
- Few resourced used, except for comparisons
21Future Funding Prospects
- FY2003 and beyond
- DARPA initiated DIS program
- Related projects are continuing under Polymorphic
Computing - New BAA coming in High Productivity Systems
- Interest from other DOE labs (LANL) in general
problem - General model
- Most architectural research projects need
benchmarking - Work has higher quality if done by people who
understand apps. - Expertise for hardware projects is different
system level design, circuit design, etc. - Interest from both IRAM and Imagine groups show
level of interest
22Long Term Impact
- Potential impact on Computer Science
- Promote research of new architectures and
micro-architectures - Understand future architectures
- Preparation for procurements
- Provide visibility of NERSC in core CS research
areas - Correlate applications DOE vs. large market
problems - Influence future machines through research
collaborations
23Benchmark Performance on IRAM Simulator
- IRAM (200 MHz, 2 W) versus Mobile Pentium III
(500 MHz, 4 W)
24Project Goals for FY02 and Beyond
- Use established data-intensive scientific
benchmarks with other emerging architectures - IMAGINE (Stanford Univ.)
- Designed for graphics and image/signal processing
- Peak 20 GLOPS (32-bit FP)
- Key features vector processing, VLIW, a
streaming memory system. (Not a PIM-based
design.) - Preliminary discussions with Bill Dally.
- DIVA (DARPA-sponsored USC/ISI)
- Based on PIM smart memory design, but for
multiprocessors - Move computation to data
- Designed for irregular data structures and
dynamic databases. - Discussions with Mary Hall about benchmark
comparisons
25Media Benchmarks
- FFT uses in-register permutations, generalized
reduction - All others written in C with Cray vectorizing
compiler
26Integer Benchmarks
- Strided access important, e.g., RGB
- narrow types limited by address generation
- Outer loop vectorization and unrolling used
- helps avoid short vectors
- spilling can be a problem
27Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized
- Future work
- Write more documentation, add better test cases
as we find them - Incorporate media benchmarks, AMR code, library
of frequently-used compiler flags pragmas
28Status of benchmarking work
- Two performance models
- simulator (vsim-p), and trace analyzer (vsimII)
- Recent work on vsim-p
- Refining the performance model for
double-precision FP performance. - Recent work on vsimII
- Making the backend modular
- Goal Model different architectures w/ same ISA.
- Fixing bugs in the memory model of the VIRAM-1
backend. - Better comments in code for better
maintainability. - Completing a new backend for a new decoupled
cluster architecture.
29Comparison with Mobile Pentium
- GUPS VIRAM gets 6x more GUPS
Data element width 16 bit 32 bit 64 bit
Mobile Pentium GUPS .045 .046 .036
VIRAM GUPS .295 .295 .244
Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
30Sparse CG
- Solve Ax b Sparse matrix-vector
multiplication dominates. - Traditional CRS format requires
- Indexed load/store for X/Y vectors
- Variable vector length, usually short
- Other formats for better vectorization
- CRS with narrow band (e.g., RCM ordering)
- Smaller strides for X vector
- Segmented-Sum (Modified the old code developed
for Cray PVP) - Long vector length, of same size
- Unit stride
- ELL format make all rows the same length by
padding zeros - Long vector length, of same size
- Extra flops
31SMVM Performance
- DIS matrix N 10000, M 177820 ( 17 nonzeros
per row) - IRAM results (MFLOPS)
- Mobile PIII (500 MHz)
- CRS 35 MFLOPS
SubBanks 1 2 4 8
CRS 91 106 109 110
CRS banded 110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)
322D Unstructured Mesh Adaptation
- Powerful tool for efficiently solving
computational problems with evolving physical
features (shocks, vortices, shear layers, crack
propagation) - Complicated logic and data structures
- Difficult to achieve high efficiently
- Irregular data access patterns (pointer chasing)
- Many conditionals / integer intensive
- Adaptation is tool for making numerical solution
cost effective - Three types of element subdivision
33 Vectorization Strategy and Performance Results
- Color elements based on vertices (not edges)
- Guarantees no conflicts during vector operations
- Vectorize across each subdivision (12, 13, 14)
one color at a time - Difficult many conditionals, low flops,
irregular data access, dependencies - Initial grid 4802 triangles, Final grid 24010
triangles - Preliminary results demonstrate VIRAM 4.5x faster
than Mobile Pentium III 500 - Higher code complexity (requires graph coloring
reordering)
Pentium III 500 1 Lane 2 Lanes 4 Lanes
61 18 14 13
Time (ms)