Scientific Applications on Multi-PIM Systems - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Scientific Applications on Multi-PIM Systems

Description:

And the Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave Judd, Christoforos ... Used for image processing of a 16-bit greyscale image: 1536 x 1536 ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 34
Provided by: Brg8
Category:

less

Transcript and Presenter's Notes

Title: Scientific Applications on Multi-PIM Systems


1
Scientific Applications on Multi-PIM Systems
  • WIMPS 2002
  • Katherine Yelick
  • U.C. Berkeley and NERSC/LBNL

Joint with with Xiaoye Li, Lenny Oliker, Brian
Gaeke, Parry Husbands (LBNL) And the Berkeley
IRAM group Dave Patterson, Joe Gebis, Dave Judd,
Christoforos Kozyrakis, Sam Williams, Steve Pope
2
Algorithm Space
Search
Two-sided dense linear algebra
FFTs
Grobner Basis (Symbolic LU)
Sorting
Reuse
Sparse iterative solvers
Asynchronous discrete even simulation
One-sided dense linear algebra
Sparse direct solvers
Regularity
3
Why build Multiprocessor PIM?
  • Scaling to Petaflops
  • Low power/footprint/etc.
  • Performance
  • And performance predictability
  • Programmability
  • Lets not forget this
  • Would like to increase user base
  • Start with single chip problem by looking at
    VIRAM

4
VIRAM Overview
  • MIPS core (200 MHz)
  • Single-issue, 8 Kbyte ID caches
  • Vector unit (200 MHz)
  • 32 64b elements per register
  • 256b datapaths, (16b, 32b, 64b ops)
  • 4 address generation units
  • Main memory system
  • 13 MB of on-chip DRAM in 8 banks
  • 12.8 GBytes/s peak bandwidth
  • Typical power consumption 2.0 W
  • Peak vector performance
  • 1.6/3.2/6.4 Gops wo. multiply-add
  • 1.6 Gflops (single-precision)
  • Fabrication by IBM
  • Tape-out in O(1 month)

5
Benchmarks for Scientific Problems
  • Dense Matrix-vector multiplication
  • Compare to hand-tuned codes on conventional
    machines
  • Transitive-closure (small large data set)
  • On a dense graph representation
  • NSA Giga-Updates Per Second (GUPS, 16-bit
    64-bit)
  • Fetch-and-increment a stream of random
    addresses
  • Sparse matrix-vector product
  • Order 10000, nonzeros 177820
  • Computing a histogram
  • Used for image processing of a 16-bit greyscale
    image 1536 x 1536
  • 2 algorithms 64-elements sorting kernel
    privatization
  • Also used in sorting
  • 2D unstructured mesh adaptation
  • initial grid 4802 triangles, final grid 24010

6
Power and Performance on BLAS-2
  • 100x100 matrix vector multiplication (column
    layout)
  • VIRAM result compiled, others hand-coded or Atlas
    optimized
  • VIRAM performance improves with larger matrices
  • VIRAM power includes on-chip main memory
  • 8-lane version of VIRAM nearly doubles MFLOPS

7
Performance Comparison
  • IRAM designed for media processing
  • Low power was a higher priority than high
    performance
  • IRAM (at 200MHz) is better for apps with
    sufficient parallelism

8
Power Efficiency
  • Huge power/performance advantage in VIRAM from
    both
  • PIM technology
  • Data parallel execution model (compiler-controlled
    )

9
Power Efficiency
  • Same data on a log plot
  • Includes both low power processors (Mobile PIII)
  • The same picture for operations/cycle

10
Which Problems are Limited by Bandwidth?
  • What is the bottleneck in each case?
  • Transitive and GUPS are limited by bandwidth
    (near 6.4GB/s peak)
  • SPMV and Mesh limited by address generation and
    bank conflicts
  • For Histogram there is insufficient parallelism

11
Summary of 1-PIM Results
  • Programmability advantage
  • All vectorized by the VIRAM compiler (Cray
    vectorizer)
  • With restructuring and hints from programmers
  • Performance advantage
  • Large on applications limited only by bandwidth
  • More address generators/sub-banks would help
    irregular performance
  • Performance/Power advantage
  • Over both low power and high performance
    processors
  • Both PIM and data parallelism are key

12
Analysis of a Multi-PIM System
  • Machine Parameters
  • Floating point performance
  • PIM-node dependent
  • Application dependent, not theoretical peak
  • Amount of memory per processor
  • Use 1/10th Algorithm data
  • Communication Overhead
  • Time processor is busy sending a message
  • Cannot be overlapped
  • Communication Latency
  • Time across the network (can be overlapped)
  • Communication Bandwidth
  • Single node and bisection
  • Back-of-the envelope calculations !

13
Real Data from an Old Machine (T3E)
  • UPC uses a global address space
  • Non-blocking remote put/get model
  • Does not cache remote data

14
Running Sparse MVM on a Pflop PIM
  • 1 GHz 8 pipes 8 ALUs/Pipe 64 GFLOPS/node
    peak
  • 8 Address generators limit performance to 16
    Gflops
  • 500ns latency, 1 cycle put/get overhead, 100
    cycle MP overhead
  • Programmability differences too packing vs.
    global address space

15
Effect of Memory Size
  • For small memory nodes or smaller problem sizes
  • Low overhead is more important
  • For large memory nodes and large problems packing
    is better

16
Conclusions
  • Performance advantage for PIMS depends on
    application
  • Need fine-grained parallelism to utilize on-chip
    bandwidth
  • Data parallelism is one model with the usual
    trade-offs
  • Hardware and programming simplicity
  • Limited expressibility
  • Largest advantages for PIMS are power and
    packaging
  • Enables Peta-scale machine
  • Multiprocessor PIMs should be easier to program
  • At least at scale of current machines (Tflops)
  • Can we bget rid of the current programming model
    hierarchy?

17
The End
18
Benchmarks
  • Kernels
  • Designed to stress memory systems
  • Some taken from the Data Intensive Systems
    Stressmarks
  • Unit and constant stride memory
  • Dense matrix-vector multiplication
  • Transitive-closure
  • Constant stride
  • FFT
  • Indirect addressing
  • NSA Giga-Updates Per Second (GUPS)
  • Sparse Matrix Vector multiplication
  • Histogram calculation (sorting)
  • Frequent branching a well and irregular memory
    acess
  • Unstructured mesh adaptation

19
Conclusions and VIRAM Future Directions
  • VIRAM outperforms Pentium III on Scientific
    problems
  • With lower power and clock rate than the Mobile
    Pentium
  • Vectorization techniques developed for the Cray
    PVPs applicable.
  • PIM technology provides low power, low cost
    memory system.
  • Similar combination used in Sony Playstation.
  • Small ISA changes can have large impact
  • Limited in-register permutations sped up 1K FFT
    by 5x.
  • Memory system can still be a bottleneck
  • Indexed/variable stride costly, due to address
    generation.
  • Future work
  • Ongoing investigations into impact of lanes,
    subbanks
  • Technical paper in preparation expect
    completion 09/01
  • Run benchmark on real VIRAM chips
  • Examine multiprocessor VIRAM configurations

20
Management Plan
  • Roles of different groups and PIs
  • Senior researchers working on particular class of
    benchmarks
  • Parry sorting and histograms
  • Sherry sparse matrices
  • Lenny unstructured mesh adaptation
  • Brian simulation
  • Jin and Hyun specific benchmarks
  • Plan to hire additional postdoc for next year
    (focus on Imagine)
  • Undergrad model used for targeted benchmark
    efforts
  • Plan for using computational resources at NERSC
  • Few resourced used, except for comparisons

21
Future Funding Prospects
  • FY2003 and beyond
  • DARPA initiated DIS program
  • Related projects are continuing under Polymorphic
    Computing
  • New BAA coming in High Productivity Systems
  • Interest from other DOE labs (LANL) in general
    problem
  • General model
  • Most architectural research projects need
    benchmarking
  • Work has higher quality if done by people who
    understand apps.
  • Expertise for hardware projects is different
    system level design, circuit design, etc.
  • Interest from both IRAM and Imagine groups show
    level of interest

22
Long Term Impact
  • Potential impact on Computer Science
  • Promote research of new architectures and
    micro-architectures
  • Understand future architectures
  • Preparation for procurements
  • Provide visibility of NERSC in core CS research
    areas
  • Correlate applications DOE vs. large market
    problems
  • Influence future machines through research
    collaborations

23
Benchmark Performance on IRAM Simulator
  • IRAM (200 MHz, 2 W) versus Mobile Pentium III
    (500 MHz, 4 W)

24
Project Goals for FY02 and Beyond
  • Use established data-intensive scientific
    benchmarks with other emerging architectures
  • IMAGINE (Stanford Univ.)
  • Designed for graphics and image/signal processing
  • Peak 20 GLOPS (32-bit FP)
  • Key features vector processing, VLIW, a
    streaming memory system. (Not a PIM-based
    design.)
  • Preliminary discussions with Bill Dally.
  • DIVA (DARPA-sponsored USC/ISI)
  • Based on PIM smart memory design, but for
    multiprocessors
  • Move computation to data
  • Designed for irregular data structures and
    dynamic databases.
  • Discussions with Mary Hall about benchmark
    comparisons

25
Media Benchmarks
  • FFT uses in-register permutations, generalized
    reduction
  • All others written in C with Cray vectorizing
    compiler

26
Integer Benchmarks
  • Strided access important, e.g., RGB
  • narrow types limited by address generation
  • Outer loop vectorization and unrolling used
  • helps avoid short vectors
  • spilling can be a problem

27
Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized
  • Future work
  • Write more documentation, add better test cases
    as we find them
  • Incorporate media benchmarks, AMR code, library
    of frequently-used compiler flags pragmas

28
Status of benchmarking work
  • Two performance models
  • simulator (vsim-p), and trace analyzer (vsimII)
  • Recent work on vsim-p
  • Refining the performance model for
    double-precision FP performance.
  • Recent work on vsimII
  • Making the backend modular
  • Goal Model different architectures w/ same ISA.
  • Fixing bugs in the memory model of the VIRAM-1
    backend.
  • Better comments in code for better
    maintainability.
  • Completing a new backend for a new decoupled
    cluster architecture.

29
Comparison with Mobile Pentium
  • GUPS VIRAM gets 6x more GUPS

Data element width 16 bit 32 bit 64 bit
Mobile Pentium GUPS .045 .046 .036
VIRAM GUPS .295 .295 .244
Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
30
Sparse CG
  • Solve Ax b Sparse matrix-vector
    multiplication dominates.
  • Traditional CRS format requires
  • Indexed load/store for X/Y vectors
  • Variable vector length, usually short
  • Other formats for better vectorization
  • CRS with narrow band (e.g., RCM ordering)
  • Smaller strides for X vector
  • Segmented-Sum (Modified the old code developed
    for Cray PVP)
  • Long vector length, of same size
  • Unit stride
  • ELL format make all rows the same length by
    padding zeros
  • Long vector length, of same size
  • Extra flops

31
SMVM Performance
  • DIS matrix N 10000, M 177820 ( 17 nonzeros
    per row)
  • IRAM results (MFLOPS)
  • Mobile PIII (500 MHz)
  • CRS 35 MFLOPS

SubBanks 1 2 4 8
CRS 91 106 109 110
CRS banded 110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)
32
2D Unstructured Mesh Adaptation
  • Powerful tool for efficiently solving
    computational problems with evolving physical
    features (shocks, vortices, shear layers, crack
    propagation)
  • Complicated logic and data structures
  • Difficult to achieve high efficiently
  • Irregular data access patterns (pointer chasing)
  • Many conditionals / integer intensive
  • Adaptation is tool for making numerical solution
    cost effective
  • Three types of element subdivision

33
Vectorization Strategy and Performance Results
  • Color elements based on vertices (not edges)
  • Guarantees no conflicts during vector operations
  • Vectorize across each subdivision (12, 13, 14)
    one color at a time
  • Difficult many conditionals, low flops,
    irregular data access, dependencies
  • Initial grid 4802 triangles, Final grid 24010
    triangles
  • Preliminary results demonstrate VIRAM 4.5x faster
    than Mobile Pentium III 500
  • Higher code complexity (requires graph coloring
    reordering)

Pentium III 500 1 Lane 2 Lanes 4 Lanes
61 18 14 13
Time (ms)
Write a Comment
User Comments (0)
About PowerShow.com