Scientific Applications on Multi-PIM Systems presentation

About This Presentation

Transcript and Presenter's Notes

Title: Scientific Applications on Multi-PIM Systems

1
Scientific Applications on Multi-PIM Systems

WIMPS 2002
Katherine Yelick
U.C. Berkeley and NERSC/LBNL

Joint with with Xiaoye Li, Lenny Oliker, Brian
Gaeke, Parry Husbands (LBNL) And the Berkeley
IRAM group Dave Patterson, Joe Gebis, Dave Judd,
Christoforos Kozyrakis, Sam Williams, Steve Pope
2
Algorithm Space
Search
Two-sided dense linear algebra
FFTs
Grobner Basis (Symbolic LU)
Sorting
Reuse
Sparse iterative solvers
Asynchronous discrete even simulation
One-sided dense linear algebra
Sparse direct solvers
Regularity
3
Why build Multiprocessor PIM?

Scaling to Petaflops
Low power/footprint/etc.
Performance
And performance predictability
Programmability
Lets not forget this
Would like to increase user base
Start with single chip problem by looking at
VIRAM

4
VIRAM Overview

MIPS core (200 MHz)
Single-issue, 8 Kbyte ID caches
Vector unit (200 MHz)
32 64b elements per register
256b datapaths, (16b, 32b, 64b ops)
4 address generation units
Main memory system
13 MB of on-chip DRAM in 8 banks
12.8 GBytes/s peak bandwidth
Typical power consumption 2.0 W
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add
1.6 Gflops (single-precision)
Fabrication by IBM
Tape-out in O(1 month)

5
Benchmarks for Scientific Problems

Dense Matrix-vector multiplication
Compare to hand-tuned codes on conventional
machines
Transitive-closure (small large data set)
On a dense graph representation
NSA Giga-Updates Per Second (GUPS, 16-bit
64-bit)
Fetch-and-increment a stream of random
addresses
Sparse matrix-vector product
Order 10000, nonzeros 177820
Computing a histogram
Used for image processing of a 16-bit greyscale
image 1536 x 1536
2 algorithms 64-elements sorting kernel
privatization
Also used in sorting
2D unstructured mesh adaptation
initial grid 4802 triangles, final grid 24010

6
Power and Performance on BLAS-2

100x100 matrix vector multiplication (column
layout)
VIRAM result compiled, others hand-coded or Atlas
optimized
VIRAM performance improves with larger matrices
VIRAM power includes on-chip main memory
8-lane version of VIRAM nearly doubles MFLOPS

7
Performance Comparison

IRAM designed for media processing
Low power was a higher priority than high
performance
IRAM (at 200MHz) is better for apps with
sufficient parallelism

8
Power Efficiency

Huge power/performance advantage in VIRAM from
both
PIM technology
Data parallel execution model (compiler-controlled
)

9
Power Efficiency

Same data on a log plot
Includes both low power processors (Mobile PIII)
The same picture for operations/cycle

10
Which Problems are Limited by Bandwidth?

What is the bottleneck in each case?
Transitive and GUPS are limited by bandwidth
(near 6.4GB/s peak)
SPMV and Mesh limited by address generation and
bank conflicts
For Histogram there is insufficient parallelism

11
Summary of 1-PIM Results

Programmability advantage
All vectorized by the VIRAM compiler (Cray
vectorizer)
With restructuring and hints from programmers
Performance advantage
Large on applications limited only by bandwidth
More address generators/sub-banks would help
irregular performance
Performance/Power advantage
Over both low power and high performance
processors
Both PIM and data parallelism are key

12
Analysis of a Multi-PIM System

Machine Parameters
Floating point performance
PIM-node dependent
Application dependent, not theoretical peak
Amount of memory per processor
Use 1/10th Algorithm data
Communication Overhead
Time processor is busy sending a message
Cannot be overlapped
Communication Latency
Time across the network (can be overlapped)
Communication Bandwidth
Single node and bisection
Back-of-the envelope calculations !

13
Real Data from an Old Machine (T3E)

UPC uses a global address space
Non-blocking remote put/get model
Does not cache remote data

14
Running Sparse MVM on a Pflop PIM

1 GHz 8 pipes 8 ALUs/Pipe 64 GFLOPS/node
peak
8 Address generators limit performance to 16
Gflops
500ns latency, 1 cycle put/get overhead, 100
cycle MP overhead
Programmability differences too packing vs.
global address space

15
Effect of Memory Size

For small memory nodes or smaller problem sizes
Low overhead is more important
For large memory nodes and large problems packing
is better

16
Conclusions

Performance advantage for PIMS depends on
application
Need fine-grained parallelism to utilize on-chip
bandwidth
Data parallelism is one model with the usual
trade-offs
Hardware and programming simplicity
Limited expressibility
Largest advantages for PIMS are power and
packaging
Enables Peta-scale machine
Multiprocessor PIMs should be easier to program
At least at scale of current machines (Tflops)
Can we bget rid of the current programming model
hierarchy?

17
The End
18
Benchmarks

Kernels
Designed to stress memory systems
Some taken from the Data Intensive Systems
Stressmarks
Unit and constant stride memory
Dense matrix-vector multiplication
Transitive-closure
Constant stride
FFT
Indirect addressing
NSA Giga-Updates Per Second (GUPS)
Sparse Matrix Vector multiplication
Histogram calculation (sorting)
Frequent branching a well and irregular memory
acess
Unstructured mesh adaptation

19
Conclusions and VIRAM Future Directions

VIRAM outperforms Pentium III on Scientific
problems
With lower power and clock rate than the Mobile
Pentium
Vectorization techniques developed for the Cray
PVPs applicable.
PIM technology provides low power, low cost
memory system.
Similar combination used in Sony Playstation.
Small ISA changes can have large impact
Limited in-register permutations sped up 1K FFT
by 5x.
Memory system can still be a bottleneck
Indexed/variable stride costly, due to address
generation.
Future work
Ongoing investigations into impact of lanes,
subbanks
Technical paper in preparation expect
completion 09/01
Run benchmark on real VIRAM chips
Examine multiprocessor VIRAM configurations

20
Management Plan

Roles of different groups and PIs
Senior researchers working on particular class of
benchmarks
Parry sorting and histograms
Sherry sparse matrices
Lenny unstructured mesh adaptation
Brian simulation
Jin and Hyun specific benchmarks
Plan to hire additional postdoc for next year
(focus on Imagine)
Undergrad model used for targeted benchmark
efforts
Plan for using computational resources at NERSC
Few resourced used, except for comparisons

21
Future Funding Prospects

FY2003 and beyond
DARPA initiated DIS program
Related projects are continuing under Polymorphic
Computing
New BAA coming in High Productivity Systems
Interest from other DOE labs (LANL) in general
problem
General model
Most architectural research projects need
benchmarking
Work has higher quality if done by people who
understand apps.
Expertise for hardware projects is different
system level design, circuit design, etc.
Interest from both IRAM and Imagine groups show
level of interest

22
Long Term Impact

Potential impact on Computer Science
Promote research of new architectures and
micro-architectures
Understand future architectures
Preparation for procurements
Provide visibility of NERSC in core CS research
areas
Correlate applications DOE vs. large market
problems
Influence future machines through research
collaborations

23
Benchmark Performance on IRAM Simulator

IRAM (200 MHz, 2 W) versus Mobile Pentium III
(500 MHz, 4 W)

24
Project Goals for FY02 and Beyond

Use established data-intensive scientific
benchmarks with other emerging architectures
IMAGINE (Stanford Univ.)
Designed for graphics and image/signal processing
Peak 20 GLOPS (32-bit FP)
Key features vector processing, VLIW, a
streaming memory system. (Not a PIM-based
design.)
Preliminary discussions with Bill Dally.
DIVA (DARPA-sponsored USC/ISI)
Based on PIM smart memory design, but for
multiprocessors
Move computation to data
Designed for irregular data structures and
dynamic databases.
Discussions with Mary Hall about benchmark
comparisons

25
Media Benchmarks

FFT uses in-register permutations, generalized
reduction
All others written in C with Cray vectorizing
compiler

26
Integer Benchmarks

Strided access important, e.g., RGB
narrow types limited by address generation
Outer loop vectorization and unrolling used
helps avoid short vectors
spilling can be a problem

27
Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized

Future work
Write more documentation, add better test cases
as we find them
Incorporate media benchmarks, AMR code, library
of frequently-used compiler flags pragmas

28
Status of benchmarking work

Two performance models
simulator (vsim-p), and trace analyzer (vsimII)
Recent work on vsim-p
Refining the performance model for
double-precision FP performance.
Recent work on vsimII
Making the backend modular
Goal Model different architectures w/ same ISA.
Fixing bugs in the memory model of the VIRAM-1
backend.
Better comments in code for better
maintainability.
Completing a new backend for a new decoupled
cluster architecture.

29
Comparison with Mobile Pentium

GUPS VIRAM gets 6x more GUPS

Data element width 16 bit 32 bit 64 bit
Mobile Pentium GUPS .045 .046 .036
VIRAM GUPS .295 .295 .244
Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
30
Sparse CG

Solve Ax b Sparse matrix-vector
multiplication dominates.
Traditional CRS format requires
Indexed load/store for X/Y vectors
Variable vector length, usually short
Other formats for better vectorization
CRS with narrow band (e.g., RCM ordering)
Smaller strides for X vector
Segmented-Sum (Modified the old code developed
for Cray PVP)
Long vector length, of same size
Unit stride
ELL format make all rows the same length by
padding zeros
Long vector length, of same size
Extra flops

31
SMVM Performance

DIS matrix N 10000, M 177820 ( 17 nonzeros
per row)
IRAM results (MFLOPS)
Mobile PIII (500 MHz)
CRS 35 MFLOPS

SubBanks 1 2 4 8
CRS 91 106 109 110
CRS banded 110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)
32
2D Unstructured Mesh Adaptation

Powerful tool for efficiently solving
computational problems with evolving physical
features (shocks, vortices, shear layers, crack
propagation)
Complicated logic and data structures
Difficult to achieve high efficiently
Irregular data access patterns (pointer chasing)
Many conditionals / integer intensive
Adaptation is tool for making numerical solution
cost effective
Three types of element subdivision

33
Vectorization Strategy and Performance Results

Color elements based on vertices (not edges)
Guarantees no conflicts during vector operations
Vectorize across each subdivision (12, 13, 14)
one color at a time
Difficult many conditionals, low flops,
irregular data access, dependencies
Initial grid 4802 triangles, Final grid 24010
triangles
Preliminary results demonstrate VIRAM 4.5x faster
than Mobile Pentium III 500
Higher code complexity (requires graph coloring
reordering)

Pentium III 500 1 Lane 2 Lanes 4 Lanes
61 18 14 13
Time (ms)

Write a Comment

User Comments (0)

About PowerShow.com

Scientific Applications on Multi-PIM Systems PowerPoint PPT Presentation