Suitability of Alternative Architectures for Scientific Computing in 510 Years presentation

About This Presentation

Transcript and Presenter's Notes

Title: Suitability of Alternative Architectures for Scientific Computing in 510 Years

1
Suitability of Alternative Architectures for
Scientific Computing in 5-10 Years

LDRD 2002 Strategic-Computational Review
July 31, 2001

PIs Xiaoye Li, Bob Lucas, Lenny Oliker,
Katherine Yelick Others Brian Gaeke, Parry
Husbands, Hyun Jin Kim, Hyn Jin Moon
2
Outline

Project Goals
FY01 progress report
Benchmark kernels definition
Performance on IRAM, comparisons with
conventional machines
Management plan
Funding opportunities in future

3
Motivation and Goal

NERSC-3 (now) and NERSC-4 (in 2-3 years) consist
of large clusters of commodity SMPs. What about
5-10 years from now?
Future architecture technologies
PIM (e.g. IRAM, DIVA, Blue Gene)
SIMD/Vector/Stream (e.g. IRAM, Imagine,
Playstation)
Low power, narrow data types (e.g., MMX, IRAM,
Imagine)
Feasibility of building large-scale systems
What will the commodity building blocks (nodes
and networks) be?
Driven by NERSC and DOE scientific applications
codes.
Where do the needs diverge from big market
applications?
Influence future architectures

4
Computational Kernels and Applications

Kernels
Designed to stress memory systems
Some taken from the Data Intensive Systems
Stressmarks
Unit and constant stride memory
Transitive-closure
FFT
Dense, sparse linear algebra (BLAS 1 and 2)
Indirect addressing
Pointer-jumping, Neighborhood (Image), sparse CG
NSA Giga-Updates Per Second (GUPS)
Frequent branching a well and irregular memory
acess
Unstructured mesh adaptation
Examples of NERSC/DOE applications that may
benefit
Omega3P, accelerator design (SLAC AMR and sparse
linear algebra)
Paratec, material science package (LBNL FFT and
dense linear algebra)
Camille, 3D atmospheric circulation model
(preconditioned CG)
HyperClaw, simulate gas dynamics in AMR framework
(LBNL)
NWChem, quantum chemistry (PNNL global arrays
and linear algebra)

5
VIRAM Overview (UCB)

MIPS core (200 MHz)
Single-issue, 8 Kbyte ID caches
Vector unit (200 MHz)
32 64b elements per register
256b datapaths, (16b, 32b, 64b ops)
4 address generation units
Main memory system
12 MB of on-chip DRAM in 8 banks
12.8 GBytes/s peak bandwidth
Typical power consumption 2.0 W
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add
1.6 Gflops (single-precision)
Same process technology as Blue Gene
But for single chip for multi-media

6
Status of IRAM Benchmarking Infrastructure

Improved the VIRAM simulator.
Refining the performance model for
double-precision FP performance.
Making the backend modular to allow for other
microarchitectures.
Packaging the benchmark codes.
Build and test scripts plus input data (small and
large data sets).
Added documentation.
Prepare for final chip benchmarking
Tape-out scheduled by UCB for 9/01.

7
Media Benchmarks

FFT uses in-register permutations, generalized
reduction
All others written in C with Cray vectorizing
compiler

8
Power Advantage of PIMVectors

100x100 matrix vector multiplication (column
layout)
Results from the LAPACK manual (vendor optimized
assembly)
VIRAM performance improves with larger matrices!
VIRAM power includes on-chip main memory!

9
Benchmarks for Scientific Problems

Transitive-closure (small large data set)
Pointer-jumping (small large working set)
Computing a histogram
Used for image processing of a 16-bit greyscale
image 1536 x 1536
2 algorithms 64-elements sorting kernel
privatization
Needed for sorting
Neighborhood image processing (small large
images)
NSA Giga-Updates Per Second (GUPS, 16-bit
64-bit)
Sparse matrix-vector product
Order 10000, nonzeros 177820
2D unstructured mesh adaptation
initial grid 4802 triangles, final grid 24010

10
Benchmark Performance on IRAM Simulator

IRAM (200 MHz, 2 W) versus Mobile Pentium III
(500 MHz, 4 W)

11
Conclusions and VIRAM Future Directions

VIRAM outperforms Pentium III on Scientific
problems
With lower power and clock rate than the Mobile
Pentium
Vectorization techniques developed for the Cray
PVPs applicable.
PIM technology provides low power, low cost
memory system.
Similar combination used in Sony Playstation.
Small ISA changes can have large impact
Limited in-register permutations sped up 1K FFT
by 5x.
Memory system can still be a bottleneck
Indexed/variable stride costly, due to address
generation.
Future work
Ongoing investigations into impact of lanes,
subbanks
Technical paper in preparation expect
completion 09/01
Run benchmark on real VIRAM chips
Examine multiprocessor VIRAM configurations

12
Project Goals for FY02 and Beyond

Use established data-intensive scientific
benchmarks with other emerging architectures
IMAGINE (Stanford Univ.)
Designed for graphics and image/signal processing
Peak 20 GLOPS (32-bit FP)
Key features vector processing, VLIW, a
streaming memory system. (Not a PIM-based
design.)
Preliminary discussions with Bill Dally.
DIVA (DARPA-sponsored USC/ISI)
Based on PIM smart memory design, but for
multiprocessors
Move computation to data
Designed for irregular data structures and
dynamic databases.
Discussions with Mary Hall about benchmark
comparisons

13
Management Plan

Roles of different groups and PIs
Senior researchers working on particular class of
benchmarks
Parry sorting and histograms
Sherry sparse matrices
Lenny unstructured mesh adaptation
Brian simulation
Jin and Hyun specific benchmarks
Plan to hire additional postdoc for next year
(focus on Imagine)
Undergrad model used for targeted benchmark
efforts
Plan for using computational resources at NERSC
Few resourced used, except for comparisons

14
Future Funding Prospects

FY2003 and beyond
DARPA initiated DIS program
Related projects are continuing under Polymorphic
Computing
New BAA coming in High Productivity Systems
Interest from other DOE labs (LANL) in general
problem
General model
Most architectural research projects need
benchmarking
Work has higher quality if done by people who
understand apps.
Expertise for hardware projects is different
system level design, circuit design, etc.
Interest from both IRAM and Imagine groups show
level of interest

15
Long Term Impact

Potential impact on Computer Science
Promote research of new architectures and
micro-architectures
Understand future architectures
Preparation for procurements
Provide visibility of NERSC in core CS research
areas
Correlate applications DOE vs. large market
problems
Influence future machines through research
collaborations

16
The End
17
Integer Benchmarks

Strided access important, e.g., RGB
narrow types limited by address generation
Outer loop vectorization and unrolling used
helps avoid short vectors
spilling can be a problem

18
Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized

Future work
Write more documentation, add better test cases
as we find them
Incorporate media benchmarks, AMR code, library
of frequently-used compiler flags pragmas

19
Status of benchmarking work

Two performance models
simulator (vsim-p), and trace analyzer (vsimII)
Recent work on vsim-p
Refining the performance model for
double-precision FP performance.
Recent work on vsimII
Making the backend modular
Goal Model different architectures w/ same ISA.
Fixing bugs in the memory model of the VIRAM-1
backend.
Better comments in code for better
maintainability.
Completing a new backend for a new decoupled
cluster architecture.

20
Comparison with Mobile Pentium

GUPS VIRAM gets 6x more GUPS

Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
21
Sparse CG

Solve Ax b Sparse matrix-vector
multiplication dominates.
Traditional CRS format requires
Indexed load/store for X/Y vectors
Variable vector length, usually short
Other formats for better vectorization
CRS with narrow band (e.g., RCM ordering)
Smaller strides for X vector
Segmented-Sum (Modified the old code developed
for Cray PVP)
Long vector length, of same size
Unit stride
ELL format make all rows the same length by
padding zeros
Long vector length, of same size
Extra flops

22
SMVM Performance

DIS matrix N 10000, M 177820 ( 17 nonzeros
per row)
IRAM results (MFLOPS)
Mobile PIII (500 MHz)
CRS 35 MFLOPS

23
2D Unstructured Mesh Adaptation

Powerful tool for efficiently solving
computational problems with evolving physical
features (shocks, vortices, shear layers, crack
propagation)
Complicated logic and data structures
Difficult to achieve high efficiently
Irregular data access patterns (pointer chasing)
Many conditionals / integer intensive
Adaptation is tool for making numerical solution
cost effective
Three types of element subdivision

24
Vectorization Strategy and Performance Results

Color elements based on vertices (not edges)
Guarantees no conflicts during vector operations
Vectorize across each subdivision (12, 13, 14)
one color at a time
Difficult many conditionals, low flops,
irregular data access, dependencies
Initial grid 4802 triangles, Final grid 24010
triangles
Preliminary results demonstrate VIRAM 4.5x faster
than Mobile Pentium III 500
Higher code complexity (requires graph coloring
reordering)

Time (ms)

Write a Comment

User Comments (0)

About PowerShow.com

Suitability of Alternative Architectures for Scientific Computing in 510 Years PowerPoint PPT Presentation