Title: Suitability of Alternative Architectures for Scientific Computing in 510 Years
1Suitability of Alternative Architectures for
Scientific Computing in 5-10 Years
- LDRD 2002 Strategic-Computational Review
- July 31, 2001
PIs Xiaoye Li, Bob Lucas, Lenny Oliker,
Katherine Yelick Others Brian Gaeke, Parry
Husbands, Hyun Jin Kim, Hyn Jin Moon
2Outline
- Project Goals
- FY01 progress report
- Benchmark kernels definition
- Performance on IRAM, comparisons with
conventional machines - Management plan
- Funding opportunities in future
3Motivation and Goal
- NERSC-3 (now) and NERSC-4 (in 2-3 years) consist
of large clusters of commodity SMPs. What about
5-10 years from now? - Future architecture technologies
- PIM (e.g. IRAM, DIVA, Blue Gene)
- SIMD/Vector/Stream (e.g. IRAM, Imagine,
Playstation) - Low power, narrow data types (e.g., MMX, IRAM,
Imagine) - Feasibility of building large-scale systems
- What will the commodity building blocks (nodes
and networks) be? - Driven by NERSC and DOE scientific applications
codes. - Where do the needs diverge from big market
applications? - Influence future architectures
4Computational Kernels and Applications
- Kernels
- Designed to stress memory systems
- Some taken from the Data Intensive Systems
Stressmarks - Unit and constant stride memory
- Transitive-closure
- FFT
- Dense, sparse linear algebra (BLAS 1 and 2)
- Indirect addressing
- Pointer-jumping, Neighborhood (Image), sparse CG
- NSA Giga-Updates Per Second (GUPS)
- Frequent branching a well and irregular memory
acess - Unstructured mesh adaptation
- Examples of NERSC/DOE applications that may
benefit - Omega3P, accelerator design (SLAC AMR and sparse
linear algebra) - Paratec, material science package (LBNL FFT and
dense linear algebra) - Camille, 3D atmospheric circulation model
(preconditioned CG) - HyperClaw, simulate gas dynamics in AMR framework
(LBNL) - NWChem, quantum chemistry (PNNL global arrays
and linear algebra)
5VIRAM Overview (UCB)
- MIPS core (200 MHz)
- Single-issue, 8 Kbyte ID caches
- Vector unit (200 MHz)
- 32 64b elements per register
- 256b datapaths, (16b, 32b, 64b ops)
- 4 address generation units
- Main memory system
- 12 MB of on-chip DRAM in 8 banks
- 12.8 GBytes/s peak bandwidth
- Typical power consumption 2.0 W
- Peak vector performance
- 1.6/3.2/6.4 Gops wo. multiply-add
- 1.6 Gflops (single-precision)
- Same process technology as Blue Gene
- But for single chip for multi-media
6Status of IRAM Benchmarking Infrastructure
- Improved the VIRAM simulator.
- Refining the performance model for
double-precision FP performance. - Making the backend modular to allow for other
microarchitectures. - Packaging the benchmark codes.
- Build and test scripts plus input data (small and
large data sets). - Added documentation.
- Prepare for final chip benchmarking
- Tape-out scheduled by UCB for 9/01.
7Media Benchmarks
- FFT uses in-register permutations, generalized
reduction - All others written in C with Cray vectorizing
compiler
8Power Advantage of PIMVectors
- 100x100 matrix vector multiplication (column
layout) - Results from the LAPACK manual (vendor optimized
assembly) - VIRAM performance improves with larger matrices!
- VIRAM power includes on-chip main memory!
9Benchmarks for Scientific Problems
- Transitive-closure (small large data set)
- Pointer-jumping (small large working set)
- Computing a histogram
- Used for image processing of a 16-bit greyscale
image 1536 x 1536 - 2 algorithms 64-elements sorting kernel
privatization - Needed for sorting
- Neighborhood image processing (small large
images) - NSA Giga-Updates Per Second (GUPS, 16-bit
64-bit) - Sparse matrix-vector product
- Order 10000, nonzeros 177820
- 2D unstructured mesh adaptation
- initial grid 4802 triangles, final grid 24010
10Benchmark Performance on IRAM Simulator
- IRAM (200 MHz, 2 W) versus Mobile Pentium III
(500 MHz, 4 W)
11Conclusions and VIRAM Future Directions
- VIRAM outperforms Pentium III on Scientific
problems - With lower power and clock rate than the Mobile
Pentium - Vectorization techniques developed for the Cray
PVPs applicable. - PIM technology provides low power, low cost
memory system. - Similar combination used in Sony Playstation.
- Small ISA changes can have large impact
- Limited in-register permutations sped up 1K FFT
by 5x. - Memory system can still be a bottleneck
- Indexed/variable stride costly, due to address
generation. - Future work
- Ongoing investigations into impact of lanes,
subbanks - Technical paper in preparation expect
completion 09/01 - Run benchmark on real VIRAM chips
- Examine multiprocessor VIRAM configurations
12Project Goals for FY02 and Beyond
- Use established data-intensive scientific
benchmarks with other emerging architectures - IMAGINE (Stanford Univ.)
- Designed for graphics and image/signal processing
- Peak 20 GLOPS (32-bit FP)
- Key features vector processing, VLIW, a
streaming memory system. (Not a PIM-based
design.) - Preliminary discussions with Bill Dally.
- DIVA (DARPA-sponsored USC/ISI)
- Based on PIM smart memory design, but for
multiprocessors - Move computation to data
- Designed for irregular data structures and
dynamic databases. - Discussions with Mary Hall about benchmark
comparisons
13Management Plan
- Roles of different groups and PIs
- Senior researchers working on particular class of
benchmarks - Parry sorting and histograms
- Sherry sparse matrices
- Lenny unstructured mesh adaptation
- Brian simulation
- Jin and Hyun specific benchmarks
- Plan to hire additional postdoc for next year
(focus on Imagine) - Undergrad model used for targeted benchmark
efforts - Plan for using computational resources at NERSC
- Few resourced used, except for comparisons
14Future Funding Prospects
- FY2003 and beyond
- DARPA initiated DIS program
- Related projects are continuing under Polymorphic
Computing - New BAA coming in High Productivity Systems
- Interest from other DOE labs (LANL) in general
problem - General model
- Most architectural research projects need
benchmarking - Work has higher quality if done by people who
understand apps. - Expertise for hardware projects is different
system level design, circuit design, etc. - Interest from both IRAM and Imagine groups show
level of interest
15Long Term Impact
- Potential impact on Computer Science
- Promote research of new architectures and
micro-architectures - Understand future architectures
- Preparation for procurements
- Provide visibility of NERSC in core CS research
areas - Correlate applications DOE vs. large market
problems - Influence future machines through research
collaborations
16The End
17Integer Benchmarks
- Strided access important, e.g., RGB
- narrow types limited by address generation
- Outer loop vectorization and unrolling used
- helps avoid short vectors
- spilling can be a problem
18Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized
- Future work
- Write more documentation, add better test cases
as we find them - Incorporate media benchmarks, AMR code, library
of frequently-used compiler flags pragmas
19Status of benchmarking work
- Two performance models
- simulator (vsim-p), and trace analyzer (vsimII)
- Recent work on vsim-p
- Refining the performance model for
double-precision FP performance. - Recent work on vsimII
- Making the backend modular
- Goal Model different architectures w/ same ISA.
- Fixing bugs in the memory model of the VIRAM-1
backend. - Better comments in code for better
maintainability. - Completing a new backend for a new decoupled
cluster architecture.
20Comparison with Mobile Pentium
- GUPS VIRAM gets 6x more GUPS
Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
21Sparse CG
- Solve Ax b Sparse matrix-vector
multiplication dominates. - Traditional CRS format requires
- Indexed load/store for X/Y vectors
- Variable vector length, usually short
- Other formats for better vectorization
- CRS with narrow band (e.g., RCM ordering)
- Smaller strides for X vector
- Segmented-Sum (Modified the old code developed
for Cray PVP) - Long vector length, of same size
- Unit stride
- ELL format make all rows the same length by
padding zeros - Long vector length, of same size
- Extra flops
22SMVM Performance
- DIS matrix N 10000, M 177820 ( 17 nonzeros
per row) - IRAM results (MFLOPS)
- Mobile PIII (500 MHz)
- CRS 35 MFLOPS
232D Unstructured Mesh Adaptation
- Powerful tool for efficiently solving
computational problems with evolving physical
features (shocks, vortices, shear layers, crack
propagation) - Complicated logic and data structures
- Difficult to achieve high efficiently
- Irregular data access patterns (pointer chasing)
- Many conditionals / integer intensive
- Adaptation is tool for making numerical solution
cost effective - Three types of element subdivision
24 Vectorization Strategy and Performance Results
- Color elements based on vertices (not edges)
- Guarantees no conflicts during vector operations
- Vectorize across each subdivision (12, 13, 14)
one color at a time - Difficult many conditionals, low flops,
irregular data access, dependencies - Initial grid 4802 triangles, Final grid 24010
triangles - Preliminary results demonstrate VIRAM 4.5x faster
than Mobile Pentium III 500 - Higher code complexity (requires graph coloring
reordering)
Time (ms)