Tools for High Performance Scientific Computing - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Tools for High Performance Scientific Computing

Description:

Titanium – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 79
Provided by: KathyY5
Category:

less

Transcript and Presenter's Notes

Title: Tools for High Performance Scientific Computing


1
Tools for High Performance Scientific Computing
Kathy YelickU.C. Berkeley
  • http//www.cs.berkeley.edu/yelick/

2
HPC Problems and Approaches
  • Parallel machines are too hard to program
  • Users left behind with each new major
    generation
  • Efficiency is too low
  • Even after a large programming effort
  • Single digit efficiency numbers are common
  • Approach
  • Titanium A modern (Java-based) language that
    provides performance transparency
  • Sparsity Self-tuning scientific kernels
  • IRAM Integrated processor-in-memory

3
Titanium A Global Address Space Language Based
on Java
  • Faculty
  • Susan Graham
  • Paul Hilfinger
  • Katherine Yelick
  • Alex Aiken
  • LBNL collaborators
  • Phillip Colella
  • Peter McQuorquodale
  • Mike Welcome
  • Students
  • Dan Bonachea
  • Szu-Huey Chang
  • Carrie Fei
  • Ben Liblit
  • Robert Lin
  • Geoff Pike
  • Jimmy Su
  • Ellen Tsai
  • Mike Welcome (LBNL)
  • Siu Man Yau

http//titanium.cs.berkeley.edu/
4
Global Address Space Programming
  • Intermediate point between message passing and
    shared memory
  • Program consists of a collection of processes.
  • Fixed at program startup time, like MPI
  • Local and shared data, as in shared memory model
  • But, shared data is partitioned over local
    processes
  • Remote data stays remote on distributed memory
    machines
  • Processes communicate by reads/writes to shared
    variables
  • Note These are not data-parallel languages
  • Examples are UPC, Titanium, CAF, Split-C
  • E.g., http//upc.nersc.gov

5
Titanium Overview
  • Object-oriented language based on Java with
  • Scalable parallelism
  • SPMD model with global address space
  • Multidimensional arrays
  • points and index sets as first-class values
  • Immutable classes
  • user-definable non-reference types for
    performance
  • Operator overloading
  • by demand from our user community
  • Semi-automated memory management
  • uses memory regions for high performance

6
SciMark Benchmark
  • Numerical benchmark for Java, C/C
  • Five kernels
  • FFT (complex, 1D)
  • Successive Over-Relaxation (SOR)
  • Monte Carlo integration (MC)
  • Sparse matrix multiply
  • dense LU factorization
  • Results are reported in Mflops
  • Download and run on your machine from
  • http//math.nist.gov/scimark2
  • C and Java sources also provided

Roldan Pozo, NIST, http//math.nist.gov/Rpozo
7
SciMark Java vs. C(Sun UltraSPARC 60)
Sun JDK 1.3 (HotSpot) , javac -0 Sun cc -0
SunOS 5.7
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
8
Can we do better without the JVM?
  • Pure Java with a JVM (and JIT)
  • Within 2x of C and sometimes better
  • OK for many users, even those using high end
    machines
  • Depends on quality of both compilers
  • We can try to do better using a traditional
    compilation model
  • E.g., Titanium compiler at Berkeley
  • Compiles Java extension to C
  • Does not optimize Java arrays or for loops
    (prototype)

9
Java Compiled by Titanium Compiler
10
SciMark on Pentium III (550 MHz)
11
SciMark on Pentium III (550 MHz)
12
Language Support for Performance
  • Multidimensional arrays
  • Contiguous storage
  • Support for sub-array operations without copying
  • Support for small objects
  • E.g., complex numbers
  • Called immutables in Titanium
  • Sometimes called value classes
  • Unordered loop construct
  • Programmer specifies iteration independent
  • Eliminates need for dependence analysis short
    term solution? Used by vectorizing compilers.

13
Optimizing Parallel Code
  • Compiler writers would like to move code around
  • The hardware folks also want to build hardware
    that dynamically moves operations around
  • When is reordering correct?
  • Because the programs are parallel, there are more
    restrictions, not fewer
  • The reason is that we have to preserve semantics
    of what may be viewed by other processors

14
Sequential Consistency
  • Given a set of executions from n processors, each
    defines a total order Pi.
  • The program order is the partial order given by
    the union of these Pi s.
  • The overall execution is sequentially consistent
    if there exists a correct total order that is
    consistent with the program order.

write x 1 read y ? 0
When this is serialized, the read and write
semantics must be preserved
write y 3 read z? 2
read x? 1 read y ? 3
15
Use of Memory Fences
  • Memory fences can turn a weak memory model into
    sequential consistency under proper
    synchronization
  • Add a read-fence to acquire lock operation
  • Add a write fence to release lock operation
  • In general, a language can have a stronger model
    than the machine it runs if the compiler is
    clever
  • The language may also have a weaker model, if the
    compiler does any optimizations

16
Compiler Analysis Overview
  • When compiling sequential programs, compute
    dependencies
  • Valid if y not in expr1 and x not in expr2
    (roughly)
  • When compiling parallel code, we need to consider
    accesses by other processors.

x expr1 y expr2
y expr2 x expr1
Initially flag data 0 Proc A Proc
B data 1 while (flag 0) flag 1
... ...data...
17
Cycle Detection
  • Processors define a program order on accesses
    from the same thread
  • P is the union of these total orders
  • Memory system define an access order on
    accesses to the same variable
  • A is access order (read/write
    write/write pairs)
  • A violation of sequential consistency is cycle in
    P U A ShashSnir

18
Cycle Analysis Intuition
  • Definition is based on execution model, which
    allows you to answer the question Was this
    execution sequentially consistent?
  • Intuition
  • Time cannot flow backwards
  • Need to be able to construct total order
  • Examples (all variables initially 0)

write data 1 read data 1
write data 1 read flag 1
write flag 1 read data 0
write flag 1 read flag 0
19
Cycle Detection Generalization
  • Generalizes to arbitrary numbers of variables and
    processors
  • Cycles may be arbitrarily long, but it is
    sufficient to consider only minimal cycles with 1
    or 2 consecutive stops per processor
  • Can simplify the analysis by assuming all
    processors run a copy of the same code

write x write y read y
read y write
x
20
Static Analysis for Cycle Detection
  • Approximate P by the control flow graph
  • Approximate A by undirected conflict edges
  • Bi-directional edge between accesses to the same
    variable in which at least one is a write
  • It is still correct if the conflict edge set is a
    superset of the reality
  • Let the delay set D be all edges from P that
    are part of a minimal cycle
  • The execution order of D edge must be preserved
    other P edges may be reordered (modulo usual
    rules about serial code)

21
Cycle Detection in Practice
  • Cycle detection was implemented in a prototype
    version of the Split-C and Titanium compilers.
  • Split-C version used many simplifying
    assumptions.
  • Titanium version had too many conflict edges.
  • What is needed to make it practical?
  • Finding possibly-concurrent program blocks
  • Use SPMD model rather than threads to simplify
  • Or apply data race detection work for Java
    threads
  • Compute conflict edges
  • Need good alias analysis
  • Reduce size by separating shared/private
    variables
  • Synchronization analysis

22
Communication Optimizations
  • Data on an old machine, UCB NOW, using a simple
    subset of C

Time (normalized)
23
Global Address Space
  • To run shared memory programs on distributed
    memory hardware, we replace references (pointers)
    by global ones
  • May point to remote data
  • Useful in building large, complex data structures
  • Easy to port shared-memory programs
    (functionality is correct)
  • Uniform programming model across machines
  • Especially true for cluster of SMPs
  • Usual implementation
  • Each reference contains
  • Processor id (or process id on cluster of SMPs)
  • And a memory address on that processor

24
Use of Global / Local
  • Global pointers are more expensive than local
  • When data is remote, it turns into a remote read
    or write) which is a message call of some kind
  • When the data is not remote, there is still an
    overhead
  • space (processor number memory address)
  • dereference time (check to see if local)
  • Conclusion not all references should be global
    -- use normal references when possible.
  • Titanium adds local qualifier to language

25
Local Pointer Analysis
  • Compiler can infer locals using Local
    Qualification Inference
  • Data structures must be well partitioned

26
Region-Based Memory Management
Other processes
  • Processes allocate locally
  • References can be passed to other processes

Process 0
LOCAL HEAP
LOCAL HEAP
class C int val... C gv // global
pointer C local lv // local pointer if
(thisProc() 0) lv new C() gv
broadcast lv from 0 gv.val ... ...
gv.val
27
Parallel Applications
  • Genome Application
  • Heart simulation
  • AMR elliptic and hyperbolic solvers
  • Scalable Poisson for infinite domains
  • Genome application
  • Several smaller benchmarks EM3D, MatMul, LU,
    FFT, Join,

28
Heart Simulation
  • Problem compute blood flow in the heart
  • Modeled as an elastic structure in an
    incompressible fluid.
  • The immersed boundary method Peskin and
    McQueen.
  • 20 years of development in model
  • Many other applications blood clotting, inner
    ear, paper making, embryo growth, and more
  • Can be used for design
    of prosthetics
  • Artificial heart valves
  • Cochlear implants

29
AMR Gas Dynamics
  • Developed by McCorquodale and Colella
  • 2D Example (3D supported)
  • Mach-10 shock on solid surface
    at
    oblique angle
  • Future Self-gravitating gas dynamics package

30
Benchmarks for GAS Languages
  • EEL End to end latency or time spent sending a
    short message between two processes.
  • BW Large message network bandwidth
  • Parameters of the LogP Model
  • L Latencyor time spent on the network
  • During this time, processor can be doing other
    work
  • O Overhead or processor busy time on the
    sending or receiving side.
  • During this time, processor cannot be doing other
    work
  • We distinguish between send and recv overhead
  • G gap the rate at which messages can be
    pushed onto the network.
  • P the number of processors
  • This work was done with the UPC group at LBL

31
LogP Overhead Latency
  • Non-overlapping overhead
  • Send and recv overhead can overlap

P0
osend
orecv
P1
EEL osend L orecv
EEL f(osend, L, orecv)
32
Benchmarks
  • Designed to measure the network parameters
  • Also provide gap as function of queue depth
  • Measured for best case in general
  • Implemented once in MPI
  • For portability and comparison to target specific
    layer
  • Implemented again in target specific
    communication layer
  • LAPI
  • ELAN
  • GM
  • SHMEM
  • VIPL

33
Results EEL and Overhead
34
Results Gap and Overhead
35
Send Overhead Over Time
  • Overhead has not improved significantly T3D was
    best
  • Lack of integration lack of attention in software

36
Summary
  • Global address space languages offer alternative
    to MPI for large machines
  • Easier to use shared data structures
  • Recover users left behind on shared memory?
  • Performance tuning still possible
  • Implementation
  • Small compiler effort given lightweight
    communication
  • Portable communication layer GASNet
  • Difficulty with small message performance on IBM
    SP platform

37
Future Plans
  • Merge communication layer with UPC
  • Unified Parallel C has broad vendor support.
  • Uses some execution model as Titanium
  • Push vendors to expose low-overhead communication
  • Automated communication overlap
  • Analysis and refinement of cache optimizations
  • Additional support for unstructured grids
  • Conjugate gradient and particle methods are
    motivations
  • Better uniprocessor optimizations, possibly new
    arrays

38
Sparsity Self-Tuning Scientific Kernels
Faculty James Demmel Katherine Yelick Graduate
Students Rich Vuduc Eun-Jim Im
  • Undergraduates
  • Shoaib Kamil
  • Rajesh Nishtala
  • Benjamin Lee
  • Hyun-Jin Moon
  • Atilla Gyulassy
  • Tuyet-Linh Phan

http//www.cs.berkeley.edu/yelick/sparsity
39
Context High-Performance Libraries
  • Application performance dominated by a few
    computational kernels
  • Today Kernels hand-tuned by vendor or user
  • Performance tuning challenges
  • Performance is a complicated function of kernel,
    architecture, compiler, and workload
  • Tedious and time-consuming
  • Successful automated approaches
  • Dense linear algebra PHiPAC/ATLAS
  • Signal processing FFTW/SPIRAL/UHFFT

40
Tuning pays off ATLAS
Extends applicability of PHIPAC Incorporated in
Matlab (with rest of LAPACK)
41
Tuning Sparse Matrix Kernels
  • Performance tuning issues in sparse linear
    algebra
  • Indirect, irregular memory references
  • High bandwidth requirements, poor instruction mix
  • Performance depends on architecture, kernel, and
    matrix
  • How to select data structures, implementations?
    at run-time?
  • Typical performance lt 10 machine peak
  • Our approach to automatic tuning for each
    kernel,
  • Identify and generate a space of implementations
  • Search the space to find the fastest one (models,
    experiments)

42
Sparsity System Organization
  • Optimizations depend on machine and matrix
    structure
  • Choosing optimization is expensive

Representative Matrix
Data Structure Definition Code
Sparsity machine profiler
Machine Profile
Sparsity optimizer
Matrix Conversion routine
Maximum vectors
43
Sparse Kernels and Optimizations
  • Kernels
  • Sparse matrix-vector multiply (SpMV) yAx
  • Sparse triangular solve (SpTS) xT-1b
  • yATAx, yAATx
  • Powers (yAkx), sparse triple-product (RART),
  • Optimization (implementation) space
  • A has special structure (e.g., symmetric, banded,
    )
  • Register blocking
  • Cache blocking
  • Multiple dense vectors (x)
  • Hybrid data structures (e.g., splitting,
    switch-to-dense, )
  • Matrix reordering

44
Register Blocking Optimization
  • Identify a small dense blocks of nonzeros.
  • Fill in extra zeros to complete blocks
  • Use an optimized multiplication code for the
    particular block size.

2x2 register blocked matrix
2 1 4 3
0 2 1 2
1 2
1 0
0 3 3 1
2 5 1 4
3 0 3 2
0 4 1 2
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1
  • Improves register reuse, lowers indexing
    overhead.
  • Filling in zeros increases storage and computation

45
Register Blocking Performance Model
  • Estimate performance of register blocking
  • Estimated raw performance Mflop/s of dense
    matrix in sparse rxc blocked format
  • Estimated overhead to fill in rxc blocks
  • Maximize over rxc
  • Estimated raw performance
  • Estimated overhead
  • Use sampling to further reduce time, row and
    column dimensions are computed separately

46
Machine Profiles Computed Offline
Register blocking performance for a dense matrix
in sparse format.
333 MHz Sun Ultra 2i
500 MHz Intel Pentium III
375 MHz IBM Power3
800 MHz Intel Itanium
47
Register Blocked SpMV Performance Ultra 2i
(See upcoming SC02 paper for a detailed
analysis.)
48
Register Blocked SpMV Performance P-III
49
Register Blocked SpMV Performance Power3
Additional low-level performance tuning is likely
to help on the Power3.
50
Register Blocked SpMV Performance Itanium
51
Multiple Vector Optimization
  • Better potential for reuse A is reused
  • Loop unrolled codes multiplying across vectors
    are generated by a code generator.

x
j1
y
a
i2
y
ij
i1
  • Allows reuse of matrix elements.
  • Choosing the of vectors affects both
    performance and higher level algorithm.

52
Multiple Vector Performance Itanium
53
Multiple Vector Performance Pentium 4
54
Exploiting Additional Matrix Structure
  • Symmetry (numerical or structural)
  • Reuse matrix entries
  • Can combine with register blocking, multiple
    vectors,
  • Large matrices with random structure
  • E.g., Latent Semantic Indexing (LSI) matrices
  • Technique cache blocking
  • Store matrix as 2i x 2j sparse submatrices
  • Useful when source vector is large
  • Currently, search to find fastest size

55
Symmetric SpMV Performance Pentium 4
56
Cache Blocking Optimization
  • Keeping part of source vector in cache

Source vector (x)

Destination Vector (y)
Sparse matrix(A)
  • Improves cache reuse of source vector.
  • Used for nearly random nonzero patterns
  • When source vector does not fit in cache

57
Cache Blocked SpMV on LSI Matrix Ultra 2i
58
Tuning Sparse Triangular Solve (SpTS)
  • Compute xL-1b where T sparse lower (upper)
    triangular, x b dense
  • L arising in sparse LU factorization have rich
    dense substructure
  • Dense trailing triangle can account for 2090 of
    matrix non-zeros
  • SpTS optimizations
  • Split into sparse trapezoid and dense trailing
    triangle
  • Use dense BLAS (DTRSV) on dense triangle
  • Use Sparsity register blocking on sparse part
  • Tuning parameters
  • Size of dense trailing triangle
  • Register block size

59
Example Sparse Triangular Factor
  • Raefsky4 (structural problem) SuperLU colmmd
  • N19779, nnz12.6 M

60
Sparse/Dense Partitioning for SpTS
  • Partition L into sparse (L1,L2) and dense LD
  • Perform SpTS in three steps
  • Sparsity optimizations for (1)(2) DTRSV for (3)

61
SpTS Performance Itanium
(See POHLL 02 workshop paper, at ICS 02.)
62
SpTS Performance Power3
63
Sustainable Memory Bandwidth
64
Summary and Future Work
  • Applying new optimizations
  • Other split data structures (variable block,
    diagonal, )
  • Matrix reordering to create block structure
  • Structural symmetry
  • New kernels (triple product RART, powers Ak, )
  • Tuning parameter selection
  • Building an automatically tuned sparse matrix
    library
  • Extending the Sparse BLAS
  • Leverage existing sparse compilers as code
    generation infrastructure
  • More thoughts on this topic tomorrow

65
IRAM Intelligent RAM
  • Faculty
  • Dave Patterson
  • Katherine Yelick
  • Graduate Students
  • Christoforos Kozyrakis
  • Joe Gebis
  • Sam Williams
  • Manikandan Narayanan
  • Iakovos Kosmidakis
  • Iaonnis Kosmidakis
  • LBNL Collaborators (Benchmarking)
  • Parry Husbands
  • Brain Gaeke
  • Xiaoye Li
  • Leonid Oliker
  • Staff
  • Dave Judd
  • Steve Pope

66
Motivation
  • Observation Current cache-based supercomputers
    perform at a small fraction of peak for memory
    intensive problems (particularly irregular ones)
  • E.g. Optimized Sparse Matrix-Vector
    Multiplication runs at 20 of floating point
    peak on 1.5GHz P4
  • Even worse when parallel efficiency considered
  • Overall lt10 efficiency is typical for many
    applications
  • Performance directly related to memory system
    design
  • But gap between processor performance and DRAM
    access times continues to grow (60/yr vs. 7/yr)
  • Is memory bandwidth the problem?

67
VIRAM Overview
  • MIPS core (200 MHz)
  • Main memory system
  • 13 MB of on-chip DRAM
  • Large on-chip bandwidth6.4 GBytes/s peak to
    vector unit
  • Vector unit
  • Efficient way to express fine-grained parallelism
    and exploit bandwidth
  • Typical power consumption 2.0 W
  • Peak vector performance
  • 1.6/3.2/6.4 Gops
  • 1.6 Gflops (single-precision)
  • Fabrication by IBM
  • Taping out now
  • Our results use simulator with Crays vcc compiler

68
Our Task
  • Evaluate use of processor-in-memory (PIM) chips
    as a building block for high performance machines
  • For now focus on serial performance
  • Benchmark VIRAM on Scientific Computing kernels
  • Originally for multimedia applications
  • Can we use on-chip DRAM for vector processing vs.
    the conventional SRAM? (DRAM denser)
  • Isolate performance limiting features of
    architectures
  • More than just memory bandwidth

69
Benchmarks Considered
  • Most taken for DARPAs DIS Benchmark Suite
  • Transitive-closure (small large data set)
  • NSA Giga-Updates Per Second (GUPS, 16-bit
    64-bit)
  • Fetch-and-increment a stream of random
    addresses
  • Sparse matrix-vector product
  • Order 10000, nonzeros 177820
  • Computing a histogram
  • Different algorithms 64-elements sorting kernel
    privatization retry
  • 2D unstructured mesh adaptation

Transitive GUPS SPMV Histogram Mesh
Ops/step 2 1 2 1 N/A
Mem/step 2 ld 1 st 2 ld 2 st 3 ld 2 ld 1 st N/A
70
Power and Performance on BLAS-2
  • 100x100 matrix vector multiplication (column
    layout)
  • VIRAM result compiled, others hand-coded or Atlas
    optimized
  • VIRAM performance improves with larger matrices
  • VIRAM power includes on-chip main memory
  • 8-lane version of VIRAM nearly doubles MFLOPS

71
Performance Comparison
  • IRAM designed for media processing
  • Low power was a higher priority than high
    performance
  • IRAM is better for apps with sufficient
    parallelism

72
Power Efficiency
  • Huge power/performance advantage in VIRAM from
    both
  • PIM technology
  • Data parallel execution model (compiler-controlled
    )

73
Power Efficiency
  • Same data on a log plot
  • Includes low power processors (Mobile PIII)
  • The same picture for operations/cycle

74
Is Memory Bandwidth the Limit?
  • What is the bottleneck in each case?
  • Transitive and GUPS are limited by bandwidth
    (near 6.4GB/s peak)
  • SPMV and Mesh limited by address generation and
    bank conflicts
  • For Histogram there is insufficient parallelism

75
Computation Memory Balance
Imagine SRF Imagine Memory IRAM SX-6 Itanium
Clock Rate (MHz) 500 500 200 500 800
Bandwidth (Gbytes/s) 32 2.7 6.4 32 2.1
Single precision flop rate (Gflop/s) 20 20 1.6 8 3.2
Ratio (flop/word) 2.5 30 1 1 6.1
  • Approximate

76
Vector Add Example
  • Vector add operation is memory-intensive
  • 2 loads, 1 store, 1 arithmetic op
  • Imagine runs at small fraction of peak due to
    high computation to memory ratio.
  • VIRAM - 370 MOPS (23.13 of peak)
  • Imagine - 170 MOPS (0.85 of peak)

77
Imagine Streams vs. IRAM Vectors
  • Imagine advantages
  • 8 SIMD VLIW clusters gt higher absolute peak at
    low control overhead
  • Extra level of memory (local registers) are good
    for short vectors
  • To high latency to off-chip memory, need
  • Long vectors (streams of length gtgt 64)
  • Good temporal locality
  • VIRAM advantages
  • High memory bandwidth helps many applications
  • 64 element vectors are sufficient to hide memory
    latency
  • Only one level of memory hierarchy to worry about
    (Two levels in Imagine)
  • Programmability
  • VIRAM has C compiler, which is easy to use and is
    proven technology
  • Imagines Stream C and Kernel C give users more
    control

78
Summary
  • Both IRAM and Imagine depend on parallelism
  • Programmability advantage to VIRAM
  • All benchmarks vectorized by the VIRAM compiler
    (Cray vectorizer)
  • With restructuring and hints from programmers
  • Performance advantage of VIRAM
  • Large on applications limited only by bandwidth
  • More address generators/sub-banks would help
    irregular performance
  • Performance/Power advantage of VIRAM
  • Over both low power and high performance
    processors
  • Both PIM and data parallelism are key
  • Imagine results preliminary
  • Great peak performance for programs with good
    temporal locality
Write a Comment
User Comments (0)
About PowerShow.com