UPC at NERSC/LBNL - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

UPC at NERSC/LBNL

Description:

Portable compiler infrasturucture (UPC- C) Optimization of communication and global pointers ... (Alpha cluster and C MPI compiler (with MTU)) Cray, Sun, and HP ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 24
Provided by: yel3
Category:

less

Transcript and Presenter's Notes

Title: UPC at NERSC/LBNL


1
UPC at NERSC/LBNL
  • Kathy Yelick, Christian Bell, Dan Bonachea,
  • Yannick Cote, Jason Duell, Paul Hargrove,
  • Parry Husbands, Costin Iancu, Mike Welcome
  • NERSC, U.C. Berkeley, and Concordia U.

2
Overview of NERSC Effort
  • Three components
  • Compilers
  • IBM SP platform and PC clusters are main targets
  • Portable compiler infrasturucture (UPC-gtC)
  • Optimization of communication and global pointers
  • Runtime systems for multiple compilers
  • Allow use by other languages (Titanium and CAF)
  • And in other UPC compilers
  • Performance evaluation
  • Applications and benchmarks
  • Currently looking at NAS PB
  • Evaluating language and compilers
  • Plan to do a larger application next year

3
NERSC UPC Compiler
  • Compiler being developed by Costin Iancu
  • Based on Open64 compiler for C
  • Originally developed at SGI
  • Has IA64 backend with some ongoing development
  • Software available on SourceForge
  • Can use as C to C translator
  • Can either generate before most optimizations
  • Or after, but this is known to be buggy right now
  • Status
  • Parses and type-checks UPC
  • Finishing code generation for UPC-gtC translator
  • Code generation for SMPs underway

4
Compiler Optimizations
  • Based on lessons learned from
  • Titanium UPC in Java
  • Split-C one of the UPC predecessors
  • Optimizations
  • Pointer optimizations
  • Optimization of phase-less pointers
  • Turn global pointers into local ones
  • Overlap
  • Split-phase
  • Merge synchs at barrier
  • Aggregation

Split-C data on CM-5
5
Portable Runtime Support
  • Developing a runtime layer that can be easily
    ported and tuned to multiple architectures.

Direct implementations of parts of full GASNet
UPCNet Global pointers (opaque type with rich
set of pointer operations), memory management,
job startup, etc.
Generic support for UPC, CAF, Titanium
GASNet Extended API Supports put, get, locks,
barrier, bulk, scatter/gather
GASNet Core API Small interface based on
Active Messages
Core sufficient for functional implementation
6
Portable Runtime Support
  • Full runtime designed to be used by multiple
    compilers
  • NERSC compiler based on Open64
  • Intrepid compiler based on gcc
  • Communication layer designed to run on multiple
    machines
  • Hardware shared memory (direct load/store)
  • IBM SP (LAPI)
  • Myrinet 2K (GM)
  • Quadrics (Elan3)
  • Dolphin
  • VIA and Infiniband in anticipation of future
    networks
  • MPI for portability
  • Use communication micro-benchmarks to choose
    optimizations

7
Possible Optimizations
  • Use of lightweight communication
  • Converting reads to writes (or reverse)
  • Overlapping communication with communication
  • Overlapping communication with computation
  • Aggregating small messages into larger ones

8
MPI vs. LAPI on the IBM SP
  • LAPI generally faster than MPI
  • Non-Blocking (relaxed) faster than blocking

9
Overlapping Computation IBM SP
  • Nearly all software overhead no computation
    overlap
  • Recall 36 usec blocking, 12 usec nonblocking

10
Conclusions for IBM SP
  • LAPI is better the MPI
  • Reads/Writes roughly the same cost
  • Overlapping communication with communication
    (pipelining) is important
  • Overlapping communication with computation
  • Important if no communication overlap
  • Minimal value if gt 2 messages overlapped
  • Large messages are still much more efficient
  • Generally noisy data hard to control

11
Other Machines
  • Observations
  • Low latency reveals programming advantage
  • T3E is still much better than the other networks

usec
12
Applications Status
  • Short term goal
  • Evaluate language and compilers using small
    applications
  • Longer term, identify large application
  • Conjugate Gradient
  • Show advantage of t3e network model and UPC
  • Performance on Compaq machine worse
  • Serial code
  • Communication performance
  • Simple n2 particle simulation
  • Currently working on NAS MG
  • Need for shared array arithmetic optimizations

13
Future Plans
  • This month
  • Draft of runtime spec
  • Draft of GASNet spec
  • This year
  • Initial runtime implementation on shared memory
  • Runtime implementation on distributed memory
    (M2K, SP)
  • NERSC compiler release 1.0b for IBM SP
  • Next year
  • Compiler release for PC cluster
  • Development of CLUMP compiler
  • Begin large application effort
  • More GASNet implementations
  • Advanced analysis and optimizations

14
Runtime Breakout
  • How many runtime systems?
  • Compaq MTU
  • LBNL/Intrepid
  • Language issues
  • Locks
  • Richard Stallmans ?
  • upc_phaseof for pointers with indef. block size
  • Misc
  • Runtime extensions
  • Strided and scatter/gather memcopy

15
Read/Write Behavior
  • Negligible difference between blocking read and
    write performance

16
Overlapping Communication
  • Effects of pipelined communication are
    significant
  • 8 overlapped messages are sufficient to saturate
    NI

Queue depth
17
Overlapping Computation
  • Same experiment, but fix total amount of
    computation

18
SPMV on Compaq/Quadrics
  • Seeing 15 usec latency for small msgs
  • Data for 1 thread per node

19
Optimization Strategy
  • Optimizations of communication is key to making
    UPC more usable
  • Two problems
  • Analysis of code to determine which optimizations
    are legal
  • Use of performance models to select
    transformations to improve performance
  • Focus on the second problem here

20
Runtime Status
  • Characterizing network performance
  • Low latency (low overhead) -gt programmability
  • Specification of portable runtime
  • Communication layer (UPC, Titanium, Co-Array
    Fortran)
  • Built on small core layer interoperability a
    major concern
  • Full runtime has memory management, job startup,
    etc.

usec
21
What is UPC?
  • UPC is an explicitly parallel language
  • Global address space
    can read/write remote memory
  • Programmer control over
    layout and scheduling
  • From Split-C, AC, PCP
  • Why a new language?
  • Easier to use than MPI, especially for program
    with complicated data structures
  • Possibly faster on some machines, but current
    goal is comparable performance

p0
p1
p2
22
Background
  • UPC efforts elsewhere
  • IDA t3e implementation based on old gcc
  • GMU (documentation) and UMC (benchmarking)
  • Compaq (Alpha cluster and CMPI compiler (with
    MTU))
  • Cray, Sun, and HP (implementations)
  • Intrepid (SGI compiler and t3e compiler)
  • UPC Book
  • T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick
  • Three components of NERSC effort
  • Compilers (SP and PC clusters) optimization
    (DOE/UPC)
  • Runtime systems for multiple compilers
    (DOE/Pmodels NSA)
  • Applications and benchmarks
    (DOE/UPC)

23
Overlapping Computation on Quadrics
8-Byte non-blocking put on Compaq/Quadrics
Write a Comment
User Comments (0)
About PowerShow.com