Communication Support for Global Address Space Languages - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Communication Support for Global Address Space Languages

Description:

Communication Support for Global Address Space Languages. Kathy Yelick, ... Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 47

Provided by: kathyy

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Communication Support for Global Address Space Languages

1
Communication Support for Global Address Space
Languages

Kathy Yelick, Christian Bell, Dan Bonachea,
Yannick Cote, Jason Duell, Paul Hargrove,
Parry Husbands, Costin Iancu, Mike Welcome
NERSC/LBNL, U.C. Berkeley, and Concordia U.

2
Outline

What is a Global Address Space Language?
Programming advantages
Potential performance advantage
Application example
Possible optimizations
LogP Model
Cost on current networks

3
Two Programming Models

Shared memory
Programming is easier
Can build large shared data structures
Machines dont scale
Typically, SMPs lt 16 processors, DSM lt 128
processors
Performance is hard to predict and control
Message passing
Machines easier to build and scale from commodity
parts
Programmer has control over performance
Programming is harder
Distributed data structures only in the
programmers mind
Tedious packing/unpacking of irregular data
structures
Losing programmers with each machine generation

4
Global Address-Space Languages

Unified Parallel C (UPC)
Extension of C with distributed arrays
UPC efforts
IDA t3e implementation based on old gcc
NERSC Open64 implementation generic runtime
GMU (documentation) and UMD (benchmarking)
Compaq (Alpha cluster and CMPI compiler (with
MTU))
Cray, Sun, and HP (implementations)
Intrepid (SGI compiler and t3e compiler)
Titanium (Berkeley)
Extension of Java without the JVM
Compiler available from http//titanium.cs.berkele
y.edu
Runs on most machines (shared, distributed, and
hybrid)
Some experience calling libraries in other
languages
CAF (Rice and U. Minnesota)

5
Global Address Space Programming

Intermediate point between message passing and
shared memory
Program consists of a collection of processes.
Fixed at program startup time, like MPI
Local and shared data, as in shared memory model
But, shared data is partitioned over local
processes
Remote data stays remote on distributed memory
machines
Processes communicate by reads/writes to shared
variables
Examples are UPC, Titanium, CAF, Split-C
Note These are not data-parallel languages
Compiler does not have to map the n-way loop to p
processors

6
UPC Pointers

Pointers may point to shared or private variables
Same syntax for use, just add qualifier
shared int sp
int lp
sp is a pointer to an integer residing in the
shared memory space.
sp is called a shared pointer (somewhat sloppy).
Private pointers are faster -- aliasing common

x 3
Shared
sp
sp
sp
Global address space
Private
7
Shared Arrays in UPC

Shared array elements are spread across the
threads
shared int xTHREADS /One element per
thread /
shared int y3THREADS / 3 elements per
thread /
shared int z3THREADS / 3 elements per
thread, cyclic /
In the pictures below
Assume THREADS 4
Elements with affinity to processor 0 are marked

This is really a 2D array
x
y
blocked
z
cyclic
8
Example Problem

Relaxation on a mesh (structured or not)
Also known as Sparse matrix-vector multiply

v
Color indicates the owner processor

Implementation strategies
Read values of across edges, either local or
remote
Prefetch remote
Remote processor writes values (into a ghost)
Remote processor packs values, and ship as a
block

9
Communication Requirements

One-sided communication
origin can read or write the memory of a target
node, with no explicit interaction by the target
Low latency for small messages
Hide latency with non-blocking accesses (UPC
relaxed) low software overhead
Overlap communication with communication
Overlap communication with computation
Support for bulk, scatter/gather, and collective
operations (as in MPI)
Portability to a number of architectures

10
Performance Advantage of Global Address Space
Languages

Sparse matrix-vector multiplication on a T3E

UPC model with remote reads is fastest
Small message (1 word)
Hand-coded prefetching
Thanks to Bob Lucas
Explanations
MPI on the T3E isnt very good
Remote read/write is fundamentally faster than
two-sided message passing

11
Optimization Opportunities

Introducing non-blocking communication
Currently hand optimized in Titanium code gen
Small message versions of algorithms on IBM SP

12
How Hard is the Compiler Problem?

Split-C, UPC, and Titanium experience
Small effort
Relied on lightweight communication
Distinguish between
Single thread/process analysis
Global, cross-thread analysis
Two-sided communication, gets-to-puts, strong
consistency semantics with non-blocking
implementation
Support for application level optimization key
Bulk communication, scatter-gather, etc.

13
Portable Runtime Support

Developing a runtime layer that can be easily
ported and tuned to multiple architectures.

Direct implementations of parts of full GASNet
UPCNet Global pointers (opaque type with rich
set of pointer operations), memory management,
job startup, etc.
Generic support for UPC, CAF, Titanium
GASNet Extended API Supports put, get, locks,
barrier, bulk, scatter/gather
GASNet Core API Small interface based on
Active Messages
Core sufficient for functional implementation
14
Portable Runtime Support

Full runtime designed to be used by multiple
compilers
NERSC compiler based on Open64
Intrepid compiler based on gcc
Communication layer designed to run on multiple
machines
Hardware shared memory (direct load/store)
IBM SP (LAPI)
Myrinet 2K (GM)
Quadrics (Elan3)
Dolphin
VIA and Infiniband in anticipation of future
networks
MPI for portability
Use communication micro-benchmarks to choose
optimizations

15
Core API Active Messages

Super-Lightweight RPC
Unordered, reliable delivery with "user"-provided
handlers
Request/reply messages
3 sizes small (lt32 bytes),medium (lt512 bytes),
large (DMA)
Very general - provides extensibility
Available for implementing compiler-specific
operations
scatter-gather or strided memory access, remote
allocation,
Already implemented on a number of interconnects
MPI, LAPI, UDP/Ethernet, Via, Myrinet, and others
Allow a number of message servicing paradigms
Interrupts, main-thread polling, NIC-thread
polling or some combination

16
Extended API Remote memory operations

Want an orthogonal, expressive, high-performance
interface
Scalars and Bulk contiguous data
Blocking and non-blocking (returns a handle)
Also have a non-blocking form where the handle is
implicit
Non-blocking synchronization
Sync on a particular operation (using a handle)
Sync on a list of handles (some or all)
Sync on all pending reads, writes or both (for
implicit handles)
Allow polling (trysync) or blocking (waitsync)
Misc. characteristics
gets specify a destination memory address (also
have register-mem ops)
Remote addresses expressed as (node id, virtual
address)
Loopback is supported
Handles need not be explicitly freed
Knows nothing about local UPC threads, but is
thread-safe on platforms with POSIX threads

17
Extended API Remote Memory

API for remote gets/puts
void get (void dest, int node, void
src, int numbytes)
handle get_nb (void dest, int node, void
src, int numbytes)
void get_nbi(void dest, int node, void
src, int numbytes)
void put (int node, void src, void
src, int numbytes)
handle put_nb (int node, void src, void
src, int numbytes)
void put_nbi(int node, void src, void
src, int numbytes)
"nb" non-blocking with explicit handle
"nbi" non-blocking with implicit handle
Also have "value" forms for register transfers
Recognize and optimize common sizes with macros
Extensibility of core API allows easily adding
other more complicated access patterns
(scatter/gather, strided, etc)