VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP

About This Presentation

Title:

VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP

Description:

VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UPC Compiler Future ... VHO. inline. IPA. PREOPT. LNO. lslsl. Integrate with GasNet and the UPC runtime ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 21

Provided by: costin3

Learn more at: https://upc.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: VHO. inline. IPA. PREOPT. LNO. lslsl. WOPT. RVI1. UP

1
NERSC/LBNL UPC CompilerStatus Report

Costin Iancu
and
the UCB/LBL UPC group

2
UPC Compiler Status Report

Current Status
UPC-to-C translator implemented in open64.
Compliant with rev 1.0 of the UPC spec.
Translates the GWU test suite and test
programs from Intrepid.

3
UPC Compiler Future Work
lslsl
WOPT RVI1
4
UPC Compiler Future Work

Integrate with GasNet and the UPC runtime
Test runtime and translator (32/64 bit)
Investigate interaction between translator and
optimization packages (legal C code)
UPC specific optimizations
Open64 code generator

lslsl
WOPT RVI1
5
UPC Optimizations - Problems

Shared pointer - logical tuple (addr, thread,
phase)
void addr int thread int phase
Expensive pointer arithmetic and address
generation
pi -gt p.phase(p.phasei)B
p.thread(p.thread(p.phasei)/B)T
Parallelism expressed by forall and affinity test
Overhead of fine grained communication can become
prohibitive

6
Translated UPC Code

include ltupc.hgt
shared float a, b
int main()
int i, k
upc_forall(k7 k lt234 k ak)
upc_forall(i 0 i lt 1000 i 333)
ak bk1

7
UPC Optimizations

Generic scalar and loop optimizations
(unrolling, pipelining)
Address generation optimizations
Eliminate run-time tests
Table lookup / Basis vectors
Simplify pointer/address arithmetic
Address components reuse
Localization
Communication optimizations
Vectorization
Message combination
Message pipelining
Prefetching for irregular data accesses

8
Run-Time Test Elimination

Problem find sequence of local memory locations
that processor P accesses during the computation
Well explored in the context of HPF
Several techniques proposed for for block-cyclic
distributions
table lookup (Chatterjee,Kennedy)
basis vectors (Ramanujam, Thirumalai)
UPC layouts cyclic, pure block, indefinite
block size - particular case of block cyclic

9
Table Array Address Lookup

upc_forall(il iltu is ai)
ai EXP()

compute T, next, start base startmem i
startoffset while (base lt endmem) base
EXP() base Ti i nexti Table
based lookup (Kennedy)
10
Array Address Lookup

Encouraging results speedups between 50200
versus run-time resolution
Lookup time vs space tradeoff . Kennedy
introduces a demand-driven technique
UPC arrays simpler than HPF arrays
UPC language restrictions no aliasing between
pointers with different block sizes
Existing HPF techniques also applicable to UPC
pointer based programs

11
Address Arithmetic Simplification

Address Components Reuse
Idea view shared pointers as three separate
components (A, T, P) (addr, thread, phase)
Exploit the implicit reuse of the thread and
phase fields
Pointer Localization
Determine which accesses can be performed using
local pointers
Optimize for indefinite block size
Requires heap analysis/LQI and a similar
dependency analysis to the lookup techniques

12
Communication Optimizations

Message Vectorization hoist and prefetch an
array slice.
Message Combination combine messages with the
same target processor into a larger message
Communication Pipelining separate the
initiation of a communication operation by its
completion and overlap communication and
computation

13
Communication Optimizations

Some optimizations are complementary
ChoiSnyder (Paragon/T3D -PVM/shmem),
Krishnamurthy (CM5), Chakrabarti (SP2/Now)
Speedups in the range 10-40
Optimizations more effective for high latency
transport layers (PVM/Now) 25 speedup vs 10
speedup (shmem/SP2)

14
Prefetching of Irregular Data Accesses

For serial programs hide cache latency
Simpler for parallel programs hide
communication latency
Irregular data accesses
Array based programs abi
Irregular data structures (pointer based)

15
Prefetching of Irregular Data Accesses

Array based programs
Well explored topic (inspector-executor
Saltz)
Irregular data structures
Not very well explored in the context of SPMD
programs.
Serial techniques jump pointers, linearization
(Mowry)
Is there a good case for it?

16
Conclusions

We start with a clean slate
Infrastructure for pointer analysis, array
dependency analysis already in open64
Communication optimizations and address
calculation optimizations share common analyses
Address calculation optimizations are likely to
offer better performance improvements at this
stage

17
The End
18
Address Arithmetic Simplification

Address Components Reuse
Idea view shared pointers as three separate
components (A, T, P) (addr, thread, phase)
Exploit the implicit reuse of the thread and
phase fields
shared B float aN,bN upc_forall(ililtui
sai)
ai bik