Communication Support for Global Address Space Languages - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Communication Support for Global Address Space Languages

Description:

Communication Support for Global Address Space Languages. Kathy Yelick, ... Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 47
Provided by: kathyy
Category:

less

Transcript and Presenter's Notes

Title: Communication Support for Global Address Space Languages


1
Communication Support for Global Address Space
Languages
  • Kathy Yelick, Christian Bell, Dan Bonachea,
  • Yannick Cote, Jason Duell, Paul Hargrove,
  • Parry Husbands, Costin Iancu, Mike Welcome
  • NERSC/LBNL, U.C. Berkeley, and Concordia U.

2
Outline
  • What is a Global Address Space Language?
  • Programming advantages
  • Potential performance advantage
  • Application example
  • Possible optimizations
  • LogP Model
  • Cost on current networks

3
Two Programming Models
  • Shared memory
  • Programming is easier
  • Can build large shared data structures
  • Machines dont scale
  • Typically, SMPs lt 16 processors, DSM lt 128
    processors
  • Performance is hard to predict and control
  • Message passing
  • Machines easier to build and scale from commodity
    parts
  • Programmer has control over performance
  • Programming is harder
  • Distributed data structures only in the
    programmers mind
  • Tedious packing/unpacking of irregular data
    structures
  • Losing programmers with each machine generation

4
Global Address-Space Languages
  • Unified Parallel C (UPC)
  • Extension of C with distributed arrays
  • UPC efforts
  • IDA t3e implementation based on old gcc
  • NERSC Open64 implementation generic runtime
  • GMU (documentation) and UMD (benchmarking)
  • Compaq (Alpha cluster and CMPI compiler (with
    MTU))
  • Cray, Sun, and HP (implementations)
  • Intrepid (SGI compiler and t3e compiler)
  • Titanium (Berkeley)
  • Extension of Java without the JVM
  • Compiler available from http//titanium.cs.berkele
    y.edu
  • Runs on most machines (shared, distributed, and
    hybrid)
  • Some experience calling libraries in other
    languages
  • CAF (Rice and U. Minnesota)

5
Global Address Space Programming
  • Intermediate point between message passing and
    shared memory
  • Program consists of a collection of processes.
  • Fixed at program startup time, like MPI
  • Local and shared data, as in shared memory model
  • But, shared data is partitioned over local
    processes
  • Remote data stays remote on distributed memory
    machines
  • Processes communicate by reads/writes to shared
    variables
  • Examples are UPC, Titanium, CAF, Split-C
  • Note These are not data-parallel languages
  • Compiler does not have to map the n-way loop to p
    processors

6
UPC Pointers
  • Pointers may point to shared or private variables
  • Same syntax for use, just add qualifier
  • shared int sp
  • int lp
  • sp is a pointer to an integer residing in the
    shared memory space.
  • sp is called a shared pointer (somewhat sloppy).
  • Private pointers are faster -- aliasing common

x 3
Shared
sp
sp
sp
Global address space
Private
7
Shared Arrays in UPC
  • Shared array elements are spread across the
    threads
  • shared int xTHREADS /One element per
    thread /
  • shared int y3THREADS / 3 elements per
    thread /
  • shared int z3THREADS / 3 elements per
    thread, cyclic /
  • In the pictures below
  • Assume THREADS 4
  • Elements with affinity to processor 0 are marked

This is really a 2D array
x
y
blocked
z
cyclic
8
Example Problem
  • Relaxation on a mesh (structured or not)
  • Also known as Sparse matrix-vector multiply

v
Color indicates the owner processor
  • Implementation strategies
  • Read values of across edges, either local or
    remote
  • Prefetch remote
  • Remote processor writes values (into a ghost)
  • Remote processor packs values, and ship as a
    block

9
Communication Requirements
  • One-sided communication
  • origin can read or write the memory of a target
    node, with no explicit interaction by the target
  • Low latency for small messages
  • Hide latency with non-blocking accesses (UPC
    relaxed) low software overhead
  • Overlap communication with communication
  • Overlap communication with computation
  • Support for bulk, scatter/gather, and collective
    operations (as in MPI)
  • Portability to a number of architectures

10
Performance Advantage of Global Address Space
Languages
  • Sparse matrix-vector multiplication on a T3E
  • UPC model with remote reads is fastest
  • Small message (1 word)
  • Hand-coded prefetching
  • Thanks to Bob Lucas
  • Explanations
  • MPI on the T3E isnt very good
  • Remote read/write is fundamentally faster than
    two-sided message passing

11
Optimization Opportunities
  • Introducing non-blocking communication
  • Currently hand optimized in Titanium code gen
  • Small message versions of algorithms on IBM SP

12
How Hard is the Compiler Problem?
  • Split-C, UPC, and Titanium experience
  • Small effort
  • Relied on lightweight communication
  • Distinguish between
  • Single thread/process analysis
  • Global, cross-thread analysis
  • Two-sided communication, gets-to-puts, strong
    consistency semantics with non-blocking
    implementation
  • Support for application level optimization key
  • Bulk communication, scatter-gather, etc.

13
Portable Runtime Support
  • Developing a runtime layer that can be easily
    ported and tuned to multiple architectures.

Direct implementations of parts of full GASNet
UPCNet Global pointers (opaque type with rich
set of pointer operations), memory management,
job startup, etc.
Generic support for UPC, CAF, Titanium
GASNet Extended API Supports put, get, locks,
barrier, bulk, scatter/gather
GASNet Core API Small interface based on
Active Messages
Core sufficient for functional implementation
14
Portable Runtime Support
  • Full runtime designed to be used by multiple
    compilers
  • NERSC compiler based on Open64
  • Intrepid compiler based on gcc
  • Communication layer designed to run on multiple
    machines
  • Hardware shared memory (direct load/store)
  • IBM SP (LAPI)
  • Myrinet 2K (GM)
  • Quadrics (Elan3)
  • Dolphin
  • VIA and Infiniband in anticipation of future
    networks
  • MPI for portability
  • Use communication micro-benchmarks to choose
    optimizations

15
Core API Active Messages
  • Super-Lightweight RPC
  • Unordered, reliable delivery with "user"-provided
    handlers
  • Request/reply messages
  • 3 sizes small (lt32 bytes),medium (lt512 bytes),
    large (DMA)
  • Very general - provides extensibility
  • Available for implementing compiler-specific
    operations
  • scatter-gather or strided memory access, remote
    allocation,
  • Already implemented on a number of interconnects
  • MPI, LAPI, UDP/Ethernet, Via, Myrinet, and others
  • Allow a number of message servicing paradigms
  • Interrupts, main-thread polling, NIC-thread
    polling or some combination

16
Extended API Remote memory operations
  • Want an orthogonal, expressive, high-performance
    interface
  • Scalars and Bulk contiguous data
  • Blocking and non-blocking (returns a handle)
  • Also have a non-blocking form where the handle is
    implicit
  • Non-blocking synchronization
  • Sync on a particular operation (using a handle)
  • Sync on a list of handles (some or all)
  • Sync on all pending reads, writes or both (for
    implicit handles)
  • Allow polling (trysync) or blocking (waitsync)
  • Misc. characteristics
  • gets specify a destination memory address (also
    have register-mem ops)
  • Remote addresses expressed as (node id, virtual
    address)
  • Loopback is supported
  • Handles need not be explicitly freed
  • Knows nothing about local UPC threads, but is
    thread-safe on platforms with POSIX threads

17
Extended API Remote Memory
  • API for remote gets/puts
  • void get (void dest, int node, void
    src, int numbytes)
  • handle get_nb (void dest, int node, void
    src, int numbytes)
  • void get_nbi(void dest, int node, void
    src, int numbytes)
  • void put (int node, void src, void
    src, int numbytes)
  • handle put_nb (int node, void src, void
    src, int numbytes)
  • void put_nbi(int node, void src, void
    src, int numbytes)
  • "nb" non-blocking with explicit handle
  • "nbi" non-blocking with implicit handle
  • Also have "value" forms for register transfers
  • Recognize and optimize common sizes with macros
  • Extensibility of core API allows easily adding
    other more complicated access patterns
    (scatter/gather, strided, etc)

18
Extended API Remote Memory
  • API for get/put synchronization
  • Non-blocking ops with explicit handles
  • int try_syncnb(handle)
  • void wait_syncnb(handle)
  • int try_syncnb_some(handle , int numhandles)
  • void wait_syncnb_some(handle , int numhandles)
  • int try_syncnb_all(handle , int numhandles)
  • void wait_syncnb_all(handle , int numhandles)
  • Non-blocking ops with implicit handles
  • int try_syncnbi_gets()
  • void wait_syncnbi_gets()
  • int try_syncnbi_puts()
  • void wait_syncnbi_puts()
  • int try_syncnbi_all() // gets puts
  • void wait_syncnbi_all()

19
Extended API Other operations
  • Basic job control
  • Init, exit
  • Job layout queries get node rank node count
  • Common user interface for job startup
  • Synchronization
  • Named split-phase barrier (wait notify)
  • Locking support
  • Core API provides "handler-safe" locks for
    implementing upc_locks
  • May also provide atomic compareswap or
    fetchincrement
  • Collective communication
  • Broadcast, exchange, reductions, scans?
  • Other
  • Performance monitoring (counters)
  • Debugging support?

20
Software Overhead
  • Overhead cost cannot be hidden with overlap
  • Shown here for 8-byte messages (put or send)
  • Compare to 1.5 usec for CM5 using Active Messages

21
Small Message Bandwidth
  • If overhead fills all time, there is no potential
    for overlapping computation

95
22
Latency (Including Overhead)
23
Large Message Bandwidth
24
What to Take Away
  • Opportunity to influence vendors to expose
    lighter weight communication
  • Overhead is most important
  • Then gap (inverse bandwidth)
  • Then latency
  • Global address space languages
  • Easier first implementation
  • Incremental performance tuning
  • Proposal for a GASNet
  • Two layers full interface core

25
End of Slides
26
Performance Characteristics
  • LogP model is useful for understanding small
    message performance and overlap
  • L latency across the network
  • o overhead (sending and receiving busy time)
  • g gap between messages (1/rate)
  • P number of processors

or
Os
L (latency)
g
27
Questions
  • Why Active Messages at the bottom?
  • Changing the PC is the minimum work
  • What about machines with sophisticated NICs?
  • Handled by direct implementation of full API
  • Why not MPI-2 one-sided?
  • Designed for application level
  • Too much synchronization required for runtime
  • Why not ARMCI?
  • Similar goals, but not designed for small
    (non-blocking) messages

28
Implications for Communication
  • Fast small message read/write simplifies
    programming
  • Non-blocking read/write may be introduced by the
    programmer or compiler
  • UPC has relaxed to indicate that an access need
    not happen immediately
  • Bulk and scatter/gather support will be useful
    (as in MPI)
  • Non-blocking versions may also be useful

29
Overview of NERSC Effort
  • Three components
  • Compilers
  • IBM SP platform and PC clusters are main targets
  • Portable compiler infrasturucture (UPC-gtC)
  • Optimization of communication and global pointers
  • Runtime systems for multiple compilers
  • Allow use by other languages (Titanium and CAF)
  • And in other UPC compilers
  • Performance evaluation
  • Applications and benchmarks
  • Currently looking at NAS PB
  • Evaluating language and compilers
  • Plan to do a larger application next year

30
NERSC UPC Compiler
  • Compiler being developed by Costin Iancu
  • Based on Open64 compiler for C
  • Originally developed at SGI
  • Has IA64 backend with some ongoing development
  • Software available on SourceForge
  • Can use as C to C translator
  • Can either generate before most optimizations
  • Or after, but this is known to be buggy right now
  • Status
  • Parses and type-checks UPC
  • Finishing code generation for UPC-gtC translator
  • Code generation for SMPs underway

31
Compiler Optimizations
  • Based on lessons learned from
  • Titanium UPC in Java
  • Split-C one of the UPC predecessors
  • Optimizations
  • Pointer optimizations
  • Optimization of phase-less pointers
  • Turn global pointers into local ones
  • Overlap
  • Split-phase
  • Merge synchs at barrier
  • Aggregation

Split-C data on CM-5
32
Possible Optimizations
  • Use of lightweight communication
  • Converting reads to writes (or reverse)
  • Overlapping communication with communication
  • Overlapping communication with computation
  • Aggregating small messages into larger ones

33
MPI vs. LAPI on the IBM SP
  • LAPI generally faster than MPI
  • Non-Blocking (relaxed) faster than blocking

34
Overlapping Computation IBM SP
  • Nearly all software overhead no computation
    overlap
  • Recall 36 usec blocking, 12 usec nonblocking

35
Conclusions for IBM SP
  • LAPI is better the MPI
  • Reads/Writes roughly the same cost
  • Overlapping communication with communication
    (pipelining) is important
  • Overlapping communication with computation
  • Important if no communication overlap
  • Minimal value if gt 2 messages overlapped
  • Large messages are still much more efficient
  • Generally noisy data hard to control

36
Other Machines
  • Observations
  • Low latency reveals programming advantage
  • T3E is still much better than the other networks

usec
37
Future Plans
  • This month
  • Draft of runtime spec
  • Draft of GASNet spec
  • This year
  • Initial runtime implementation on shared memory
  • Runtime implementation on distributed memory
    (M2K, SP)
  • NERSC compiler release 1.0b for IBM SP
  • Next year
  • Compiler release for PC cluster
  • Development of CLUMP compiler
  • Begin large application effort
  • More GASNet implementations
  • Advanced analysis and optimizations

38
Read/Write Behavior
  • Negligible difference between blocking read and
    write performance

39
Overlapping Communication
  • Effects of pipelined communication are
    significant
  • 8 overlapped messages are sufficient to saturate
    NI

Queue depth
40
Overlapping Computation
  • Same experiment, but fix total amount of
    computation

41
SPMV on Compaq/Quadrics
  • Seeing 15 usec latency for small msgs
  • Data for 1 thread per node

42
Optimization Strategy
  • Optimizations of communication is key to making
    UPC more usable
  • Two problems
  • Analysis of code to determine which optimizations
    are legal
  • Use of performance models to select
    transformations to improve performance
  • Focus on the second problem here

43
Runtime Status
  • Characterizing network performance
  • Low latency (low overhead) -gt programmability
  • Specification of portable runtime
  • Communication layer (UPC, Titanium, Co-Array
    Fortran)
  • Built on small core layer interoperability a
    major concern
  • Full runtime has memory management, job startup,
    etc.

usec
44
What is UPC?
  • UPC is an explicitly parallel language
  • Global address space
    can read/write remote memory
  • Programmer control over
    layout and scheduling
  • From Split-C, AC, PCP
  • Why a new language?
  • Easier to use than MPI, especially for program
    with complicated data structures
  • Possibly faster on some machines, but current
    goal is comparable performance

p0
p1
p2
45
Background
  • UPC efforts elsewhere
  • IDA t3e implementation based on old gcc
  • GMU (documentation) and UMC (benchmarking)
  • Compaq (Alpha cluster and CMPI compiler (with
    MTU))
  • Cray, Sun, and HP (implementations)
  • Intrepid (SGI compiler and t3e compiler)
  • UPC Book
  • T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick
  • Three components of NERSC effort
  • Compilers (SP and PC clusters) optimization
    (DOE/UPC)
  • Runtime systems for multiple compilers
    (DOE/Pmodels NSA)
  • Applications and benchmarks
    (DOE/UPC)

46
Overlapping Computation on Quadrics
8-Byte non-blocking put on Compaq/Quadrics
Write a Comment
User Comments (0)
About PowerShow.com