UPC at NERSC/LBNL - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

UPC at NERSC/LBNL

Description:

Portable compiler infrasturucture (UPC- C) Optimization of communication and global pointers ... (Alpha cluster and C MPI compiler (with MTU)) Cray, Sun, and HP ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 24

Provided by: yel3

Learn more at: https://people.eecs.berkeley.edu

Category:

Tags: lbnl | nersc | upc | compiler | predecessors

Transcript and Presenter's Notes

Title: UPC at NERSC/LBNL

1
UPC at NERSC/LBNL

Kathy Yelick, Christian Bell, Dan Bonachea,
Yannick Cote, Jason Duell, Paul Hargrove,
Parry Husbands, Costin Iancu, Mike Welcome
NERSC, U.C. Berkeley, and Concordia U.

2
Overview of NERSC Effort

Three components
Compilers
IBM SP platform and PC clusters are main targets
Portable compiler infrasturucture (UPC-gtC)
Optimization of communication and global pointers
Runtime systems for multiple compilers
Allow use by other languages (Titanium and CAF)
And in other UPC compilers
Performance evaluation
Applications and benchmarks
Currently looking at NAS PB
Evaluating language and compilers
Plan to do a larger application next year

3
NERSC UPC Compiler

Compiler being developed by Costin Iancu
Based on Open64 compiler for C
Originally developed at SGI
Has IA64 backend with some ongoing development
Software available on SourceForge
Can use as C to C translator
Can either generate before most optimizations
Or after, but this is known to be buggy right now
Status
Parses and type-checks UPC
Finishing code generation for UPC-gtC translator
Code generation for SMPs underway

4
Compiler Optimizations

Based on lessons learned from
Titanium UPC in Java
Split-C one of the UPC predecessors
Optimizations
Pointer optimizations

Optimization of phase-less pointers
Turn global pointers into local ones
Overlap
Split-phase
Merge synchs at barrier
Aggregation

Split-C data on CM-5
5
Portable Runtime Support

Developing a runtime layer that can be easily
ported and tuned to multiple architectures.

Direct implementations of parts of full GASNet
UPCNet Global pointers (opaque type with rich
set of pointer operations), memory management,
job startup, etc.
Generic support for UPC, CAF, Titanium
GASNet Extended API Supports put, get, locks,
barrier, bulk, scatter/gather
GASNet Core API Small interface based on
Active Messages
Core sufficient for functional implementation
6
Portable Runtime Support

Full runtime designed to be used by multiple
compilers
NERSC compiler based on Open64
Intrepid compiler based on gcc
Communication layer designed to run on multiple
machines
Hardware shared memory (direct load/store)
IBM SP (LAPI)
Myrinet 2K (GM)
Quadrics (Elan3)
Dolphin
VIA and Infiniband in anticipation of future
networks
MPI for portability
Use communication micro-benchmarks to choose
optimizations

7
Possible Optimizations

Use of lightweight communication
Converting reads to writes (or reverse)
Overlapping communication with communication
Overlapping communication with computation
Aggregating small messages into larger ones

8
MPI vs. LAPI on the IBM SP

LAPI generally faster than MPI
Non-Blocking (relaxed) faster than blocking

9
Overlapping Computation IBM SP

Nearly all software overhead no computation
overlap
Recall 36 usec blocking, 12 usec nonblocking

10
Conclusions for IBM SP

LAPI is better the MPI
Reads/Writes roughly the same cost
Overlapping communication with communication
(pipelining) is important
Overlapping communication with computation
Important if no communication overlap
Minimal value if gt 2 messages overlapped
Large messages are still much more efficient
Generally noisy data hard to control

11
Other Machines

Observations
Low latency reveals programming advantage
T3E is still much better than the other networks

usec
12
Applications Status

Short term goal
Evaluate language and compilers using small
applications
Longer term, identify large application

Conjugate Gradient
Show advantage of t3e network model and UPC
Performance on Compaq machine worse
Serial code
Communication performance
Simple n2 particle simulation
Currently working on NAS MG
Need for shared array arithmetic optimizations

13
Future Plans

This month
Draft of runtime spec
Draft of GASNet spec
This year
Initial runtime implementation on shared memory
Runtime implementation on distributed memory
(M2K, SP)
NERSC compiler release 1.0b for IBM SP
Next year
Compiler release for PC cluster
Development of CLUMP compiler
Begin large application effort
More GASNet implementations
Advanced analysis and optimizations

14
Runtime Breakout

How many runtime systems?
Compaq MTU
LBNL/Intrepid
Language issues
Locks
Richard Stallmans ?
upc_phaseof for pointers with indef. block size
Misc
Runtime extensions
Strided and scatter/gather memcopy

15
Read/Write Behavior

Negligible difference between blocking read and
write performance

16
Overlapping Communication

Effects of pipelined communication are
significant
8 overlapped messages are sufficient to saturate
NI

Queue depth
17
Overlapping Computation

Same experiment, but fix total amount of
computation

18
SPMV on Compaq/Quadrics

Seeing 15 usec latency for small msgs
Data for 1 thread per node

19
Optimization Strategy

Optimizations of communication is key to making
UPC more usable
Two problems
Analysis of code to determine which optimizations
are legal
Use of performance models to select
transformations to improve performance
Focus on the second problem here

20
Runtime Status

Characterizing network performance
Low latency (low overhead) -gt programmability
Specification of portable runtime
Communication layer (UPC, Titanium, Co-Array
Fortran)
Built on small core layer interoperability a
major concern
Full runtime has memory management, job startup,
etc.

usec
21
What is UPC?

UPC is an explicitly parallel language
Global address space
can read/write remote memory
Programmer control over
layout and scheduling
From Split-C, AC, PCP
Why a new language?
Easier to use than MPI, especially for program
with complicated data structures
Possibly faster on some machines, but current
goal is comparable performance

p0
p1
p2
22
Background

UPC efforts elsewhere
IDA t3e implementation based on old gcc
GMU (documentation) and UMC (benchmarking)
Compaq (Alpha cluster and CMPI compiler (with
MTU))
Cray, Sun, and HP (implementations)
Intrepid (SGI compiler and t3e compiler)
UPC Book
T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick
Three components of NERSC effort
Compilers (SP and PC clusters) optimization
(DOE/UPC)
Runtime systems for multiple compilers
(DOE/Pmodels NSA)
Applications and benchmarks
(DOE/UPC)

23
Overlapping Computation on Quadrics
8-Byte non-blocking put on Compaq/Quadrics

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Unified Parallel C (UPC) PowerPoint PPT Presentation

Unified Parallel C (UPC) - Data movement: broadcast, scatter, gather, ... Computational: reduce, prefix, ... Should non-blocking communication be a first class language citizen? Synchronization ... | PowerPoint PPT presentation | free to view

CS 267 Unified Parallel C (UPC) PowerPoint PPT Presentation

CS 267 Unified Parallel C (UPC) - Slides adapted from some by Tarek El-Ghazawi (GWU) CS267 Lecture: UPC ... Most parallel programs are written using either: Message passing ... CSC, Cray ... | PowerPoint PPT presentation | free to view

Latency vs. Bandwidth Which Matters More? PowerPoint PPT Presentation

Latency vs. Bandwidth Which Matters More? - Latency vs. Bandwidth Which Matters More? Katherine Yelick U.C. Berkeley and LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL) | PowerPoint PPT presentation | free to view

CS 267 Partitioned Global Address Space Programming with Unified Parallel C (UPC) PowerPoint PPT Presentation

CS 267 Partitioned Global Address Space Programming with Unified Parallel C (UPC) - CS267 Lecture 2 * CS267 Lecture 2 * CS267 ... Irregular computation is less clear (multi-physics ... 2048 1.200 1.100 1.075 1.250 1.212 1.200 1.142 2 ... | PowerPoint PPT presentation | free to view

An Introduction to Unified Parallel C UPC PowerPoint PPT Presentation

An Introduction to Unified Parallel C UPC - A number of threads (i.e. processes) working independently in a SPMD fashion ... Distributed Arrays Directory Style ... build directories of distributed ... | PowerPoint PPT presentation | free to view

ASCI Site Review PowerPoint PPT Presentation

ASCI Site Review - Compiler-generated code Compiler-specific runtime system GASNet Extended API GASNet Core API Network Hardware U.C. Berkeley and LBNL http://upc.nersc.gov | PowerPoint PPT presentation | free to view

Unified Parallel C UPC PowerPoint PPT Presentation

Unified Parallel C UPC - Applications: NAS parallel benchmarks (CG & MG) Standard benchmarks written in UPC by GWU ... Benchmark written in bulk synchronous style. Performance is ... | PowerPoint PPT presentation | free to view

VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP PowerPoint PPT Presentation

VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP - VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UPC Compiler Future ... VHO. inline. IPA. PREOPT. LNO. lslsl. Integrate with GasNet and the UPC runtime ... | PowerPoint PPT presentation | free to view

UPC and Titanium PowerPoint PPT Presentation

UPC and Titanium - University of California, Berkeley and. Lawrence Berkeley National Laboratory ... Pointer size/representation easily reconfigured ... | PowerPoint PPT presentation | free to view

UPC at CRDLBNL PowerPoint PPT Presentation

UPC at CRDLBNL - Higher WHIRL. Lower WHIRL. Compiler based on Open64. Multiple ... Intermediate form called WHIRL. Leverage standard optimizations and analyses. Pointer analysis ... | PowerPoint PPT presentation | free to view

NERSC and Blue Planet PowerPoint PPT Presentation

NERSC and Blue Planet - System designers did not truly understand current and future scientific applications ... LLNL/ASCI has become very interested in Blue Planet ... | PowerPoint PPT presentation | free to view

Tools for High Performance Scientific Computing PowerPoint PPT Presentation

Tools for High Performance Scientific Computing - Parallel machines are too hard to program. Users 'left behind' ... Carrie Fei. Ben Liblit. Robert Lin. Geoff Pike. Jimmy Su. Ellen Tsai. Mike Welcome (LBNL) ... | PowerPoint PPT presentation | free to view

SC05 Tutorial PowerPoint PPT Presentation

SC05 Tutorial - SC05 Tutorial | PowerPoint PPT presentation | free to view

CS 267 Unified Parallel C (UPC) PowerPoint PPT Presentation

CS 267 Unified Parallel C (UPC) - Most parallel programs are written using either: Message ... The idiom in the previous is very common. Loop over all; work on those owned by this proc ... | PowerPoint PPT presentation | free to view

Unified Parallel C PowerPoint PPT Presentation

Unified Parallel C - EEL End to end latency or time spent sending a short message between two processes. ... Results: EEL and Overhead. Results: Gap and Overhead. Send Overhead ... | PowerPoint PPT presentation | free to view

Clusters and their Applications PowerPoint PPT Presentation

Clusters and their Applications - Some s by Jim Demmel, David Culler, Horst Simon, and Erich Strohmaier | PowerPoint PPT presentation | free to view

Unified Parallel C at NERSC PowerPoint PPT Presentation

Unified Parallel C at NERSC - Top 500 Supercomputers. Listing of the 500 most powerful computers in the world ... Maxwells Equations on an Unstructured 3D Mesh: Explicit Method ... | PowerPoint PPT presentation | free to view

Ernest Orlando Lawrence Berkeley National Laboratory PowerPoint PPT Presentation

Ernest Orlando Lawrence Berkeley National Laboratory - Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, | PowerPoint PPT presentation | free to view

Global Address Space Languages PowerPoint PPT Presentation

Global Address Space Languages - Titanium is (almost) strict superset ... Titanium introduces immutable classes. also known as 'value classes' in the PL literature ... | PowerPoint PPT presentation | free to view

Titanium: A HighLevel Parallel Language PowerPoint PPT Presentation

Titanium: A HighLevel Parallel Language - Several optimizations in Titanium compiler (tc) over the past year ... Uses dynamic load balancing within Titanium. Heart Simulation ... | PowerPoint PPT presentation | free to view

Simulation of the Heart and other Organs PowerPoint PPT Presentation

Simulation of the Heart and other Organs - Simulation of the Heart and other Organs | PowerPoint PPT presentation | free to view

Tools for High Performance Scientific Computing PowerPoint PPT Presentation

Tools for High Performance Scientific Computing - Titanium | PowerPoint PPT presentation | free to view

Communication Support for Global Address Space Languages PowerPoint PPT Presentation

Communication Support for Global Address Space Languages - Communication Support for Global Address Space Languages. Kathy Yelick, ... Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome ... | PowerPoint PPT presentation | free to view

Unified Parallel C PowerPoint PPT Presentation

Unified Parallel C - ... 6240.00 2749.96 3560.00 1627.10 2064.00 16.00 15097.00 14111.00 15097.00 14103.00 72824.87 77054.00 39634.16 42944.00 23935.44 24478.00 16938.23 16111.00 32 ... | PowerPoint PPT presentation | free to view

The Future of Numerical Linear Algebra Libraries: Automatic Tuning of Sparse Matrix Kernels The Next LAPACK and ScaLAPACK PowerPoint PPT Presentation

The Future of Numerical Linear Algebra Libraries: Automatic Tuning of Sparse Matrix Kernels The Next LAPACK and ScaLAPACK - Jack Dongarra, Victor Eijkhout, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov ... calls to ILAENV() to get block sizes, etc. Not systematically tuned ... | PowerPoint PPT presentation | free to view

Impact of the Cardiac Heart Flow Alpha Project PowerPoint PPT Presentation

Impact of the Cardiac Heart Flow Alpha Project - Impact of the Cardiac Heart Flow Alpha Project Kathy Yelick EECS Department U.C. Berkeley | PowerPoint PPT presentation | free to view

CS267/E233 Applications of Parallel Computers www.cs.berkeley.edu/~demmel/cs267_Spr09 Lecture 1: Introduction PowerPoint PPT Presentation

CS267/E233 Applications of Parallel Computers www.cs.berkeley.edu/~demmel/cs267_Spr09 Lecture 1: Introduction - Now put 1 Tbyte of storage in a 0.3 mm x ... recreate 3D sound over ear buds. Hearing Augmenter ... What do commercial and CSE applications have in common? ... | PowerPoint PPT presentation | free to view