Latency vs. Bandwidth Which Matters More? - PowerPoint PPT Presentation

About This Presentation
Title:

Latency vs. Bandwidth Which Matters More?

Description:

www.eecs.berkeley.edu – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 68
Provided by: brg
Category:

less

Transcript and Presenter's Notes

Title: Latency vs. Bandwidth Which Matters More?


1
Latency vs. BandwidthWhich Matters More?
  • Katherine Yelick
  • U.C. Berkeley and LBNL

Joint with with Xiaoye Li, Lenny Oliker, Brian
Gaeke, Parry Husbands (LBNL) The Berkeley IRAM
group Dave Patterson, Joe Gebis, Dave Judd,
Christoforos Kozyrakis, Sam Williams, The
Berkeley Bebop group Jim Demmel, Rich Vuduc, Ben
Lee, Rajesh Nishtala,
2
Blame the Memory Bus
  • Many scientific applications run at less than 10
    of hardware peak, even on a single processor
  • The trend is to blame the memory bus
  • Is this accurate?
  • Need to understand bottlenecks to
  • Design better machines
  • Design better algorithms
  • Two parts
  • Algorithm bottlenecks on microprocessors
  • Bottlenecks on a PIM system, VIRAM

Note this is latency, not bandwidth.
3
Memory Intensive Applications
  • Poor performance is especially problematic for
    memory-intensive applications
  • Low ratio of arithmetic operations to memory
  • Irregular memory access patterns
  • Example
  • Sparse matrix-vector multiply (dominant kernel of
    NAS CG)
  • Many scientific applications do this by some
    perspective
  • Compute y y Ax
  • Matrix is stored as two main arrays
  • Column index array (int)
  • Value array (floating point)
  • For each element yi compute
  • Sj xindexj valuej
  • So latency (to x) dominates, right?
  • Irregular
  • Not necessarily in cache

x
y
4
Performance Model is Revealing
  • A simple analytical model for sparse matvec
    kernel
  • loads from memory cost of load loads from
    cache
  • Two versions
  • Only compulsory misses to source vector, x
  • All accesses to x produce a miss to memory
  • Conclusion
  • Cache misses to source (memory latency) is not
    the dominant cost
  • PAPI measurements confirm
  • So bandwidth to the matrix dominates, right?

5
Memory Bandwidth Measurements
  • Yes, but be careful about how you measure
    bandwidth
  • Not a constant

6
An Architectural Probe
  • Sqmat is a tunable probe to measure architectures
  • Stream of small matrices
  • Square each matrix to some power computational
    intensity
  • The stream may be direct (dense), or indirect
    (sparse)
  • If indirect, how frequently is there a non-unit
    stride jump
  • Parameters
  • Matrix size within stream
  • Computational Intensity
  • Indirection (yes/no)
  • unit strides before jump

. . .
. . .
7
Cost of Indirection
  • Adding a second load stream for indexes into
    stream has a big effect on some machines
  • This is truly a bandwidth issue

8
Cost of Irregularity
Opteron
Itanium2
Power3
Power4
  • Slowdown relative to the previous slide results
  • Even a tiny bit of irregularity (1/S) can have a
    big effect

9
What Does This Have to Do with PIMs?
  • Performance of Sqmat on PIMs and others for 3x3
    matrices, squared 10 times (high computational
    intensity!)
  • Imagine much faster for long streams, slower for
    short ones

10
VIRAM Overview
  • Technology IBM SA-27E
  • 0.18mm CMOS, 6 metal layers
  • 290 mm2 die area
  • 225 mm2 for memory/logic
  • Transistor count 130M
  • 13 MB of DRAM
  • Power supply
  • 1.2V for logic, 1.8V for DRAM
  • Typical power consumption 2.0 W
  • 0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
    0.3 W (misc)
  • MIPS Scalar core 4-lane vector
  • Peak vector performance
  • 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
    operations)
  • 3.2/6.4 /12.8 Gops w. madd
  • 1.6 Gflops (single-precision)

11
Vector IRAM ISA Summary
Scalar
MIPS64 scalar instruction set
s.int u.int s.fp d.fp
.v .vv .vs .sv
Vector ALU
alu op
unit stride constant stride indexed
Vector Memory
s.int u.int
load store
ALU operations integer, floating-point,
fixed-point and DSP, convert, logical, vector
processing, flag processing
12
VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/X1
Fortran95
SV2/VIRAM
  • Based on the Crays production compiler
  • Challenges
  • narrow data types and scalar/vector memory
    consistency
  • Advantages relative to media-extensions
  • powerful addressing modes and ISA independent of
    datapath width

13
Compiler and OS Enhancements
  • Compiler based on Cray PDGCS
  • Outer-loop vectorization
  • Strided and indexed vector loads and stores
  • Vectorization of loops with if statements
  • Full predicated execution of vector instructions
    using flag registers
  • Vectorization of reductions and FFTs
  • Instructions for simple, intra-register
    permutations
  • Automatic for reductions, manual (or StreamIT)
    for FFTs
  • Vectorization of loops with break statements
  • Software speculation support for vector loads
  • OS development
  • MMU-based virtual memory
  • OS performance
  • Dirty and valid bits for registers to reduce
    context switch overhead

14
HW Resources Visible to Software
Vector IRAM
Pentium III
  • Software (applications/compiler/OS) can control
  • Main memory, registers, execution datapaths

15
VIRAM Chip Statistics
Technology IBM SA-27E, 0.18um CMOS, 6 layers of copper Deep trench DRAM cell, full speed logic
Area 270 mm2 65 mm2 logic, 140 mm2 for DRAM
Transistors 130 millions 7.5M logic, 122.5 DRAM
Supply 1.2V logic, 1.8V DRAM, 3.3V I/O
Clock 200 MHz
Power 2W 0.5W MIPS core, 1W vector unit, 0.5W DRAM-I/O
Package 304-lead quad ceramic package (125 signal I/Os)
Crossbar BW 12.8 Gbytes/s per direction (load or store, peak)
Peak Performance Integer wo. madd 1.6/3.2/6.4 Gops (64b/32b/16b) Integer w. madd 3.2/6.4/12.8 Gops (64b/32b/16b) FP 1.6 Gflops (32b, wo. madd)
16
VIRAM Design Statistics
RTL model 170K lines of Verilog
Design Methodology Synthesized MIPS core, vector unit control, FP datapath Full-custom vector reg. file, crossbar, integer datapaths Macros DRAM, SRAM for caches
IP Sources UC Berkeley (Vector coprocessor, crossbar, I/O) MIPS Technologies (MIPS core) IBM (DRAM/SRAM macros) MIT (FP Datapath)
Verification 566K lines of directed tests (9.8M lines of assembly) 4 months of random testing on 20 linux workstations
Design team 5 graduate students
Status Place route, chip assembly
Tape-out October, 2002
Design time 2.5 years
17
VIRAM Chip
  • Taped out to IBM in October 02
  • Received wafers in June 2003.
  • Chips were thinned, diced, and packaged.
  • Parts were sent to ISI, who produced test boards.

MIPS
4 64-bit Vector Lanes
I/O
18
Demonstration System
  • Based on the MIPS Malta development board
  • PCI, Ethernet, AMR, IDE, USB, CompactFlash,
    parallel, serial
  • VIRAM daughter-card
  • Designed at ISI-East
  • VIRAM processor
  • Galileo GT64120 chipset
  • 1 DIMM slot for external DRAM
  • Software support and OS
  • Monitor utility for debugging
  • Modified version of MIPS Linux

19
Benchmarks for Scientific Problems
  • Dense and Sparse Matrix-vector multiplication
  • Compare to tuned codes on conventional machines
  • Transitive-closure (small large data set)
  • On a dense graph representation
  • NSA Giga-Updates Per Second (GUPS, 16-bit
    64-bit)
  • Fetch-and-increment a stream of random
    addresses
  • Sparse matrix-vector product
  • Order 10000, nonzeros 177820
  • Computing a histogram
  • Used for image processing of a 16-bit greyscale
    image 1536 x 1536
  • 2 algorithms 64-elements sorting kernel
    privatization
  • Also used in sorting
  • 2D unstructured mesh adaptation
  • initial grid 4802 triangles, final grid 24010

20
Sparse MVM Performance
  • Performance is matrix-dependent lp matrix
  • compiled for VIRAM using independent pragma
  • sparse column layout
  • Sparsity-optimized for other machines
  • sparse row (or blocked row) layout

MFLOPS
21
Power and Performance on BLAS-2
  • 100x100 matrix vector multiplication (column
    layout)
  • VIRAM result compiled, others hand-coded or Atlas
    optimized
  • VIRAM performance improves with larger matrices
  • VIRAM power includes on-chip main memory
  • 8-lane version of VIRAM nearly doubles MFLOPS

22
Performance Comparison
  • IRAM designed for media processing
  • Low power was a higher priority than high
    performance
  • IRAM (at 200MHz) is better for apps with
    sufficient parallelism

23
Power Efficiency
  • Same data on a log plot
  • Includes both low power processors (Mobile PIII)
  • The same picture for operations/cycle

24
Which Problems are Limited by Bandwidth?
  • What is the bottleneck in each case?
  • Transitive and GUPS are limited by bandwidth
    (near 6.4GB/s peak)
  • SPMV and Mesh limited by address generation and
    bank conflicts
  • For Histogram there is insufficient parallelism

25
Summary of 1-PIM Results
  • Programmability advantage
  • All vectorized by the VIRAM compiler (Cray
    vectorizer)
  • With restructuring and hints from programmers
  • Performance advantage
  • Large on applications limited only by bandwidth
  • More address generators/sub-banks would help
    irregular performance
  • Performance/Power advantage
  • Over both low power and high performance
    processors
  • Both PIM and data parallelism are key

26
Alternative VIRAM Designs
  • VIRAM-4Lane
  • 4 lanes, 8 Mbytes
  • 190 mm2
  • 3.2 Gops at 200MHz

VIRAM-2Lanes 2 lanes, 4 Mbytes 120 mm2 1.6
Gops at 200MHz
VIRAM-Lite 1 lanes, 2 Mbytes 60 mm2 0.8 Gops
at 200MHz
27
Compiled Multimedia Performance
integer
floating-point
  • Single executable for multiple implementations
  • Linear scaling with number of lanes
  • Remember, this is a 200MHz, 2W processor

28
Third Party Comparison (I)
VIRAM
Imagine
Imagine
VIRAM
VIRAM
Imagine
PPC-G4
Pentium III
Pentium III
PPC-G4
Pentium III
PPC-G4
29
Third Party Comparison (II)
VIRAM
VIRAM
Imagine
VIRAM
Imagine
Imagine
PPC-G4
Pentium III
PPC-G4
Pentium III
PPC-G4
Pentium III
30
Vectors VS. SIMD or VLIW
  • SIMD
  • Short, fixed-length, vector extensions
  • Require wide issue or ISA change to scale
  • They dont support vector memory accesses
  • Difficult to compile for
  • Performance wasted for pack/unpack, shifts,
    rotates
  • VLIW
  • Architecture for instruction level parallelism
  • Orthogonal to vectors for data parallelism
  • Inefficient for data parallelism
  • Large code size (3X for IA-64?)
  • Extra work for software (scheduling more
    instructions)
  • Extra work for hardware (decode more
    instructions)

31
Vector Vs. Wide Word SIMD Example
  • Vector instruction sets have
  • Strided and scatter/gather load/store operations
  • SIMD extensions load contiguous memory
  • Implementation-independent vector length
  • SIMD extensions change ISA with bit wide in
    hardware
  • Simple example conversion from RGB to YUV
  • Thanks to Christoforos Kozyrakis
  • Y ( 9798R 19235G 3736B) / 32768
  • U (-4784R - 9437G 4221B) / 32768
    128
  • V (20218R 16941G 3277B) / 32768
    128

32
VIRAM Code
  • RGBtoYUV
  • vlds.u.b r_v, r_addr, stride3, addr_inc
    load R
  • vlds.u.b g_v, g_addr, stride3, addr_inc
    load G
  • vlds.u.b b_v, b_addr, stride3, addr_inc
    load B
  • xlmul.u.sv o1_v, t0_s, r_v
    calculate Y
  • xlmadd.u.sv o1_v, t1_s, g_v
  • xlmadd.u.sv o1_v, t2_s, b_v
  • vsra.vs o1_v, o1_v, s_s
  • xlmul.u.sv o2_v, t3_s, r_v
    calculate U
  • xlmadd.u.sv o2_v, t4_s, g_v
  • xlmadd.u.sv o2_v, t5_s, b_v
  • vsra.vs o2_v, o2_v, s_s
  • vadd.sv o2_v, a_s, o2_v
  • xlmul.u.sv o3_v, t6_s, r_v
    calculate V
  • xlmadd.u.sv o3_v, t7_s, g_v
  • xlmadd.u.sv o3_v, t8_s, b_v
  • vsra.vs o3_v, o3_v, s_s
  • vadd.sv o3_v, a_s, o3_v
  • vsts.b o1_v, y_addr, stride3, addr_inc
    store Y

33
MMX Code (1)
  • RGBtoYUV
  • movq mm1, eax
  • pxor mm6, mm6
  • movq mm0, mm1
  • psrlq mm1, 16
  • punpcklbw mm0, ZEROS
  • movq mm7, mm1
  • punpcklbw mm1, ZEROS
  • movq mm2, mm0
  • pmaddwd mm0, YR0GR
  • movq mm3, mm1
  • pmaddwd mm1, YBG0B
  • movq mm4, mm2
  • pmaddwd mm2, UR0GR
  • movq mm5, mm3
  • pmaddwd mm3, UBG0B
  • punpckhbw mm7, mm6
  • pmaddwd mm4, VR0GR
  • paddd mm0, mm1
  • paddd mm4, mm5
  • movq mm5, mm1
  • psllq mm1, 32
  • paddd mm1, mm7
  • punpckhbw mm6, ZEROS
  • movq mm3, mm1
  • pmaddwd mm1, YR0GR
  • movq mm7, mm5
  • pmaddwd mm5, YBG0B
  • psrad mm0, 15
  • movq TEMP0, mm6
  • movq mm6, mm3
  • pmaddwd mm6, UR0GR
  • psrad mm2, 15
  • paddd mm1, mm5
  • movq mm5, mm7
  • pmaddwd mm7, UBG0B
  • psrad mm1, 15
  • pmaddwd mm3, VR0GR

34
MMX Code (2)
  • paddd mm6, mm7
  • movq mm7, mm1
  • psrad mm6, 15
  • paddd mm3, mm5
  • psllq mm7, 16
  • movq mm5, mm7
  • psrad mm3, 15
  • movq TEMPY, mm0
  • packssdw mm2, mm6
  • movq mm0, TEMP0
  • punpcklbw mm7, ZEROS
  • movq mm6, mm0
  • movq TEMPU, mm2
  • psrlq mm0, 32
  • paddw mm7, mm0
  • movq mm2, mm6
  • pmaddwd mm2, YR0GR
  • movq mm0, mm7
  • pmaddwd mm7, YBG0B
  • movq mm4, mm6
  • pmaddwd mm6, UR0GR
  • movq mm3, mm0
  • pmaddwd mm0, UBG0B
  • paddd mm2, mm7
  • pmaddwd mm4,
  • pxor mm7, mm7
  • pmaddwd mm3, VBG0B
  • punpckhbw mm1,
  • paddd mm0, mm6
  • movq mm6, mm1
  • pmaddwd mm6, YBG0B
  • punpckhbw mm5,
  • movq mm7, mm5
  • paddd mm3, mm4
  • pmaddwd mm5, YR0GR
  • movq mm4, mm1
  • pmaddwd mm4, UBG0B
  • psrad mm0, 15

35
MMX Code (3)
  • pmaddwd mm7, UR0GR
  • psrad mm3, 15
  • pmaddwd mm1, VBG0B
  • psrad mm6, 15
  • paddd mm4, OFFSETD
  • packssdw mm2, mm6
  • pmaddwd mm5, VR0GR
  • paddd mm7, mm4
  • psrad mm7, 15
  • movq mm6, TEMPY
  • packssdw mm0, mm7
  • movq mm4, TEMPU
  • packuswb mm6, mm2
  • movq mm7, OFFSETB
  • paddd mm1, mm5
  • paddw mm4, mm7
  • psrad mm1, 15
  • movq ebx, mm6
  • packuswb mm4,
  • movq ecx, mm4
  • packuswb mm5, mm3
  • add ebx, 8
  • add ecx, 8
  • movq edx, mm5
  • dec edi
  • jnz RGBtoYUV

36
Summary
  • Combination of Vectors and PIM
  • Simple execution model for hardware pushes
    complexity to compiler
  • Low power/footprint/etc.
  • PIM provides bandwidth needed by vectors
  • Vectors hid latency effectively
  • Programmability
  • Programmable from high level language
  • More compact instruction stream
  • Works well for
  • Applications with fine-grained data parallelism
  • Memory intensive problems
  • Both scientific and multimedia applications

37
The End
38
Algorithm Space
Search
Two-sided dense linear algebra
FFTs
Grobner Basis (Symbolic LU)
Sorting
Reuse
Sparse iterative solvers
Asynchronous discrete even simulation
One-sided dense linear algebra
Sparse direct solvers
Regularity
39
VIRAM Overview
  • MIPS core (200 MHz)
  • Single-issue, 8 Kbyte ID caches
  • Vector unit (200 MHz)
  • 32 64b elements per register
  • 256b datapaths, (16b, 32b, 64b ops)
  • 4 address generation units
  • Main memory system
  • 13 MB of on-chip DRAM in 8 banks
  • 12.8 GBytes/s peak bandwidth
  • Typical power consumption 2.0 W
  • Peak vector performance
  • 1.6/3.2/6.4 Gops wo. multiply-add
  • 1.6 Gflops (single-precision)
  • Fabrication by IBM
  • Tape-out in O(1 month)

40
Benchmarks for Scientific Problems
  • Dense Matrix-vector multiplication
  • Compare to hand-tuned codes on conventional
    machines
  • Transitive-closure (small large data set)
  • On a dense graph representation
  • NSA Giga-Updates Per Second (GUPS, 16-bit
    64-bit)
  • Fetch-and-increment a stream of random
    addresses
  • Sparse matrix-vector product
  • Order 10000, nonzeros 177820
  • Computing a histogram
  • Used for image processing of a 16-bit greyscale
    image 1536 x 1536
  • 2 algorithms 64-elements sorting kernel
    privatization
  • Also used in sorting
  • 2D unstructured mesh adaptation
  • initial grid 4802 triangles, final grid 24010

41
Power and Performance on BLAS-2
  • 100x100 matrix vector multiplication (column
    layout)
  • VIRAM result compiled, others hand-coded or Atlas
    optimized
  • VIRAM performance improves with larger matrices
  • VIRAM power includes on-chip main memory
  • 8-lane version of VIRAM nearly doubles MFLOPS

42
Performance Comparison
  • IRAM designed for media processing
  • Low power was a higher priority than high
    performance
  • IRAM (at 200MHz) is better for apps with
    sufficient parallelism

43
Power Efficiency
  • Huge power/performance advantage in VIRAM from
    both
  • PIM technology
  • Data parallel execution model (compiler-controlled
    )

44
Power Efficiency
  • Same data on a log plot
  • Includes both low power processors (Mobile PIII)
  • The same picture for operations/cycle

45
Which Problems are Limited by Bandwidth?
  • What is the bottleneck in each case?
  • Transitive and GUPS are limited by bandwidth
    (near 6.4GB/s peak)
  • SPMV and Mesh limited by address generation and
    bank conflicts
  • For Histogram there is insufficient parallelism

46
Summary of 1-PIM Results
  • Programmability advantage
  • All vectorized by the VIRAM compiler (Cray
    vectorizer)
  • With restructuring and hints from programmers
  • Performance advantage
  • Large on applications limited only by bandwidth
  • More address generators/sub-banks would help
    irregular performance
  • Performance/Power advantage
  • Over both low power and high performance
    processors
  • Both PIM and data parallelism are key

47
Analysis of a Multi-PIM System
  • Machine Parameters
  • Floating point performance
  • PIM-node dependent
  • Application dependent, not theoretical peak
  • Amount of memory per processor
  • Use 1/10th Algorithm data
  • Communication Overhead
  • Time processor is busy sending a message
  • Cannot be overlapped
  • Communication Latency
  • Time across the network (can be overlapped)
  • Communication Bandwidth
  • Single node and bisection
  • Back-of-the envelope calculations !

48
Real Data from an Old Machine (T3E)
  • UPC uses a global address space
  • Non-blocking remote put/get model
  • Does not cache remote data

49
Running Sparse MVM on a Pflop PIM
  • 1 GHz 8 pipes 8 ALUs/Pipe 64 GFLOPS/node
    peak
  • 8 Address generators limit performance to 16
    Gflops
  • 500ns latency, 1 cycle put/get overhead, 100
    cycle MP overhead
  • Programmability differences too packing vs.
    global address space

50
Effect of Memory Size
  • For small memory nodes or smaller problem sizes
  • Low overhead is more important
  • For large memory nodes and large problems packing
    is better

51
Conclusions
  • Performance advantage for PIMS depends on
    application
  • Need fine-grained parallelism to utilize on-chip
    bandwidth
  • Data parallelism is one model with the usual
    trade-offs
  • Hardware and programming simplicity
  • Limited expressibility
  • Largest advantages for PIMS are power and
    packaging
  • Enables Peta-scale machine
  • Multiprocessor PIMs should be easier to program
  • At least at scale of current machines (Tflops)
  • Can we bget rid of the current programming model
    hierarchy?

52
Benchmarks
  • Kernels
  • Designed to stress memory systems
  • Some taken from the Data Intensive Systems
    Stressmarks
  • Unit and constant stride memory
  • Dense matrix-vector multiplication
  • Transitive-closure
  • Constant stride
  • FFT
  • Indirect addressing
  • NSA Giga-Updates Per Second (GUPS)
  • Sparse Matrix Vector multiplication
  • Histogram calculation (sorting)
  • Frequent branching a well and irregular memory
    acess
  • Unstructured mesh adaptation

53
Conclusions and VIRAM Future Directions
  • VIRAM outperforms Pentium III on Scientific
    problems
  • With lower power and clock rate than the Mobile
    Pentium
  • Vectorization techniques developed for the Cray
    PVPs applicable.
  • PIM technology provides low power, low cost
    memory system.
  • Similar combination used in Sony Playstation.
  • Small ISA changes can have large impact
  • Limited in-register permutations sped up 1K FFT
    by 5x.
  • Memory system can still be a bottleneck
  • Indexed/variable stride costly, due to address
    generation.
  • Future work
  • Ongoing investigations into impact of lanes,
    subbanks
  • Technical paper in preparation expect
    completion 09/01
  • Run benchmark on real VIRAM chips
  • Examine multiprocessor VIRAM configurations

54
Management Plan
  • Roles of different groups and PIs
  • Senior researchers working on particular class of
    benchmarks
  • Parry sorting and histograms
  • Sherry sparse matrices
  • Lenny unstructured mesh adaptation
  • Brian simulation
  • Jin and Hyun specific benchmarks
  • Plan to hire additional postdoc for next year
    (focus on Imagine)
  • Undergrad model used for targeted benchmark
    efforts
  • Plan for using computational resources at NERSC
  • Few resourced used, except for comparisons

55
Future Funding Prospects
  • FY2003 and beyond
  • DARPA initiated DIS program
  • Related projects are continuing under Polymorphic
    Computing
  • New BAA coming in High Productivity Systems
  • Interest from other DOE labs (LANL) in general
    problem
  • General model
  • Most architectural research projects need
    benchmarking
  • Work has higher quality if done by people who
    understand apps.
  • Expertise for hardware projects is different
    system level design, circuit design, etc.
  • Interest from both IRAM and Imagine groups show
    level of interest

56
Long Term Impact
  • Potential impact on Computer Science
  • Promote research of new architectures and
    micro-architectures
  • Understand future architectures
  • Preparation for procurements
  • Provide visibility of NERSC in core CS research
    areas
  • Correlate applications DOE vs. large market
    problems
  • Influence future machines through research
    collaborations

57
Benchmark Performance on IRAM Simulator
  • IRAM (200 MHz, 2 W) versus Mobile Pentium III
    (500 MHz, 4 W)

58
Project Goals for FY02 and Beyond
  • Use established data-intensive scientific
    benchmarks with other emerging architectures
  • IMAGINE (Stanford Univ.)
  • Designed for graphics and image/signal processing
  • Peak 20 GLOPS (32-bit FP)
  • Key features vector processing, VLIW, a
    streaming memory system. (Not a PIM-based
    design.)
  • Preliminary discussions with Bill Dally.
  • DIVA (DARPA-sponsored USC/ISI)
  • Based on PIM smart memory design, but for
    multiprocessors
  • Move computation to data
  • Designed for irregular data structures and
    dynamic databases.
  • Discussions with Mary Hall about benchmark
    comparisons

59
Media Benchmarks
  • FFT uses in-register permutations, generalized
    reduction
  • All others written in C with Cray vectorizing
    compiler

60
Integer Benchmarks
  • Strided access important, e.g., RGB
  • narrow types limited by address generation
  • Outer loop vectorization and unrolling used
  • helps avoid short vectors
  • spilling can be a problem

61
Status of benchmarking software release
Optimized vector histogram code
Optimized
Optimized GUPS inner loop
GUPS Docs
Pointer Jumping w/Update
Vector histogram code generator
GUPS C codes
Conjugate Gradient (Matrix)
Neighborhood
Pointer Jumping
Transitive
Field
Standard random number generator
Test cases (small and large working sets)
Build and test scripts (Makefiles, timing,
analysis, ...)
Unoptimized
  • Future work
  • Write more documentation, add better test cases
    as we find them
  • Incorporate media benchmarks, AMR code, library
    of frequently-used compiler flags pragmas

62
Status of benchmarking work
  • Two performance models
  • simulator (vsim-p), and trace analyzer (vsimII)
  • Recent work on vsim-p
  • Refining the performance model for
    double-precision FP performance.
  • Recent work on vsimII
  • Making the backend modular
  • Goal Model different architectures w/ same ISA.
  • Fixing bugs in the memory model of the VIRAM-1
    backend.
  • Better comments in code for better
    maintainability.
  • Completing a new backend for a new decoupled
    cluster architecture.

63
Comparison with Mobile Pentium
  • GUPS VIRAM gets 6x more GUPS

Data element width 16 bit 32 bit 64 bit
Mobile Pentium GUPS .045 .046 .036
VIRAM GUPS .295 .295 .244
Transitive
Pointer
Update
VIRAM30-50 faster than P-III
Ex. time for VIRAM rises much more slowly w/ data
size than for P-III
64
Sparse CG
  • Solve Ax b Sparse matrix-vector
    multiplication dominates.
  • Traditional CRS format requires
  • Indexed load/store for X/Y vectors
  • Variable vector length, usually short
  • Other formats for better vectorization
  • CRS with narrow band (e.g., RCM ordering)
  • Smaller strides for X vector
  • Segmented-Sum (Modified the old code developed
    for Cray PVP)
  • Long vector length, of same size
  • Unit stride
  • ELL format make all rows the same length by
    padding zeros
  • Long vector length, of same size
  • Extra flops

65
SMVM Performance
  • DIS matrix N 10000, M 177820 ( 17 nonzeros
    per row)
  • IRAM results (MFLOPS)
  • Mobile PIII (500 MHz)
  • CRS 35 MFLOPS

SubBanks 1 2 4 8
CRS 91 106 109 110
CRS banded 110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)
66
2D Unstructured Mesh Adaptation
  • Powerful tool for efficiently solving
    computational problems with evolving physical
    features (shocks, vortices, shear layers, crack
    propagation)
  • Complicated logic and data structures
  • Difficult to achieve high efficiently
  • Irregular data access patterns (pointer chasing)
  • Many conditionals / integer intensive
  • Adaptation is tool for making numerical solution
    cost effective
  • Three types of element subdivision

67
Vectorization Strategy and Performance Results
  • Color elements based on vertices (not edges)
  • Guarantees no conflicts during vector operations
  • Vectorize across each subdivision (12, 13, 14)
    one color at a time
  • Difficult many conditionals, low flops,
    irregular data access, dependencies
  • Initial grid 4802 triangles, Final grid 24010
    triangles
  • Preliminary results demonstrate VIRAM 4.5x faster
    than Mobile Pentium III 500
  • Higher code complexity (requires graph coloring
    reordering)

Pentium III 500 1 Lane 2 Lanes 4 Lanes
61 18 14 13
Time (ms)
Write a Comment
User Comments (0)
About PowerShow.com