SHMEM Programming Model - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

SHMEM Programming Model

Description:

Nuts & Bolts. Collective Communication. Broadcast ... A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 28
Provided by: burtg
Category:

less

Transcript and Presenter's Notes

Title: SHMEM Programming Model


1
SHMEM Programming Model
  • Hung-Hsun Su
  • UPC Group, HCS lab
  • 1/23/2004

2
Outline
  • Background
  • Nuts and Bolts
  • GPSHMEM
  • Performance
  • Conclusion
  • Reference

3
BackgroundWhat is SHMEM?
  • SHard MEMory library
  • Based on SPMD model
  • Available for C / Fortran
  • Hybrid Message Passing / Shared Memory
    Programming Model
  • Message Passing Like
  • Explicit communication, replication and
    synchronization
  • Specification of remote data location (processor
    id) is required
  • Shard Memory like
  • Provides logically shared memory system view
  • Communication require processor on one side only
  • Allows any processor element (PE) to access
    memory in a remote PE without involving the
    microprocessor on the remote PE (put / get)
  • Non-blocking data transfer

4
BackgroundWhat is SHMEM?
  • Must know the address of a variable on the remote
    processor for transfer
  • same on all PEs
  • Remotely accessible data objects (Symmetric
    Vars.)
  • Global variables
  • Local static variables
  • Variables in common blocks
  • Fortran variables modified by a !DIR SYMMETRIC
    directive
  • C variables modified by a pragma symmetric
    directive

5
BackgroundWhy program in SHMEM?
  • Easier to program in than MPI / PVM
  • Low latency, high bandwidth data transfer
  • Puts
  • Gets
  • Provide efficient collective communication
  • Gather / Scatter
  • All-to-all
  • Broadcast
  • Reductions
  • Provide mechanisms to implement mutual exclusion
  • Atomic swap
  • Locking
  • Provide synchronization mechanisms

6
BackgroundSupported Platforms
  • SHMEM
  • Cray T3D, T3E, PVP
  • SGI Irix, Origin
  • Compaq SC
  • IBM SP
  • Quadrics Linux Cluster
  • SCI (?)
  • GPSHMEM (Version 1.0)
  • IBM SP
  • SGI Origin
  • Cray J90, T3E
  • Unix/Linux
  • Windows NT
  • Myrinet (?)

7
Nuts BoltsInitialization
  • Include header shmem.h / shmem.fh to access the
    library
  • shmem_init() Initializes SHMEM
  • my_pe() Get the PE ID of local processor
  • num_pes() Get the total number of PE in the
    system
  • include ltstdio.hgt
  • include ltstdlib.hgt
  • include "shmem.h
  • int main(int argc, char argv)
  • int my_pe, num_peshmem_init()my_pe
    my_pe()num_pe num_pes()printf("Hello World
    from process d of d\n", my_pe,
    num_pes)exit(0)

8
Nuts BoltsData Transfer
  • Put
  • Specific Variable
  • void shmem_TYPE_p(TYPE addr, TYPE value, int pe)
  • TYPE double, float, int, long, short
  • Contiguous Object
  • void shmem_put(void target, const void source,
    size_t len, int pe)
  • void shmem_TYPE_put(TYPE target, const
    TYPEsource, size_t len, int pe)
  • TYPE double, float, int, long, longdouble,
    longlong, short
  • void shmem_putSS(void target, const void
    source, size_t len, int pe)
  • Storage Size (SS) 32, 64 (default), 128, mem
    (any size)

9
Nuts BoltsData Transfer
  • Get
  • Specific Variable
  • void shmem_TYPE_g(TYPE addr, TYPE value, int pe)
  • TYPE double, float, int, long, short
  • Contiguous Object
  • void shmem_get(void target, const void source,
    size_t len, int pe)
  • void shmem_TYPE_get(TYPE target, const
    TYPEsource, size_t len, int pe)
  • TYPE double, float, int, long, longdouble,
    longlong, short
  • void shmem_getSS(void target, const void
    source, size_t len, int pe)
  • Storage Size (SS) 32, 64 (default), 128, mem
    (any size)

10
Nuts BoltsCollective Communication
  • Broadcast
  • void shmem_broadcast(void target, void source,
    int nlong, int PE_root, int PE_start, int
    PE_group, int PE_size, long pSync)
  • One-to-all communication
  • Collection
  • void shmem_collect(void target, void source,
    int nlong, int PE_start, int PE_group, int
    PE_size, long pSync)
  • void shmem_fcollect(void target, void source,
    int nlong, int PE_start, int PE_group, int
    PE_size, long pSync)
  • Concatenates data items from the source array
    into the target array over the defined set of
    PEs. The resultant target array consists of the
    contribution from the 1st PE, followed by 1st PE
    2nd PE, etc.

pSync - symmetric work array. Every element of
this array must be initialized with the value
_SHMEM_SYNC_VALUE before any of the PEs in the
active set enter the routine. Use to prevent
overlapping collective communication
11
Nuts BoltsSynchronization
  • Barrier
  • void shmem_barrier_all(void)
  • Suspend all operations until all PE calls this
    function
  • void shmem_barrier(int PE_start, int PE_group,
    int PE_size, long pSync)
  • Barrier operation on subset of PEs
  • Wait
  • Suspend until a remote PE writes a value NOT
    equal to the one specified
  • void shmem_wait(long var, long value)
  • void shmem_TYPE_wait(TYPE var, TYPE value)
  • TYPE int, long, longlong, short
  • Conditional Wait
  • Same as wait except the comparison can now be gt,
    gt, , !, lt, lt
  • void shmem_wait_until(long var, int cond, long
    value)

12
Nuts BoltsSynchronization
  • Fence
  • All put operations issued to a particular PE
    prior to call are guaranteed to be delivered
    before any subsequent remote write operation to
    the same PE which follows the call
  • Ensures ordering of remote write (put) operations
  • Quiet
  • Waits for completion of all outstanding remote
    writes initiated from the calling PE

13
Nuts BoltsAtomic Operations
  • Atomic Swap
  • Unconditional
  • long shmem_swap(long target, long value, int pe)
  • Conditional
  • int shmem_int_cswap(int target, int cond, int
    value, int pe)
  • Arithmetic
  • add, increment
  • int shmem_int_fadd(int target, int value, int
    pe)

14
Nuts BoltsCollective Reduction
  • Collective logical operations
  • and, or, xor
  • void shmem_int_and_to_all(int target, int
    source, int nreduce, int PE_start, int PE_group,
    int PE_size, int pWrk, long pSync)
  • Collective comparison operations
  • max, min
  • void shmem_double_max_to_all(double target,
    double source, int nreduce, int PE_start, int
    PE_group, int PE_size, double pWrk, long pSync)
  • Collective arithmetic operations
  • product, sum
  • void shmem_double_prod_to_all(double target,
    double source, int nreduce, int PE_start, int
    PE_group, int PE_size, double pWrk, long pSync)

15
Nuts BoltsOther
  • Address Manipulation
  • shmem_ptr - Returns a pointer to a data object on
    a remote PE
  • Cache Control
  • shmem_clear_cache_inv - Disables automatic cache
    coherency mode
  • shmem_set_cache_inv - Enables automatic cache
    coherency mode
  • shmem_set_cache_line_inv - Enables automatic line
    cache coherency mode
  • shmem_udcflush - Makes the entire user data cache
    coherent
  • shmem_udcflush_line - Makes coherent a cache line

16
Nuts BoltsExample (Array copy)
14. / Initialize and send on PE 1 / 15. if(me
1) 16. for(i0 ilt8 i) 17. sourcei
i1 18. shmem_put64(dest, source,
8sizeof(dest0)/8, 0) 19. 20. 21. / Make
sure the transfer is complete / 22.
shmem_barrier_all() 23. 24. / Print from the
receiving PE / 25. if(me 0) 26.
_shmem_udcflush() 27. printf(" DEST ON PE 0")
28. for(i0 ilt8 i) 29. printf(" dc",
desti, (ilt7) ? ',' '\n') 30.
1. include ltstdio.hgt 2. include ltmpp/shmem.hgt
3. include ltintrinsics.hgt 4. 6. int me, npes,
i 7. int source8, dest8 8. main() 9.
10. / Get PE information / 11. me _my_pe()
12. npes _num_pes() 13.
17
GPSHMEM
  • AMES Lab / Pacific Northwest National Lab
    collaborative project
  • Communication library like SHMEM library, but
    tries to achieve full portability
  • Mostly the T3D components with some extensions
    of functionality
  • Research Quality at this point

ARMCI A Portable Remote Memory Copy Library for
Distributed Array Libraries and Compiler Run-time
Systems
18
Performance Latency (Origin 2000)
19
Performance Latency (T3E 600)
20
Performance Bandwidth
Taken from http//infm.cineca.it/documenti/incontr
o_infm/comunicazio/sld015.htm
21
Performance Bandwidth
22
Performance - Broadcast
23
Performance All to all
24
Performance Ocean
On SGI Origin 2000
25
Performance Radix
On SGI Origin 2000
26
Conclusion
  • Hybrid MP/Shard Memory programming model
  • Compare to MP
  • Pro.
  • Easier to use
  • Lower latency, higher bandwidth communication
  • More scalable (within limit)
  • Remote CPU not interrupted during transfer
  • Con.
  • Limited platform support (as of now)

27
Reference
  • Ricky A. Kendall et. al., GPSHMEM and other
    Parallel Programming Models Powerpoint
    presentation
  • Hongzhang Shan and Jaswinder Pal Singh, A
    Comparison of MPI, SHMEM and Cache-coherent
    Shared Address Space Programming Models on the
    SGI Origin2000 http//citeseer.nj.nec.com/rd/48418
    3212C2963482C12C0.252CDownload/http//citeseer
    .nj.nec.com/cache/papers/cs/14068/httpzSzzSzwww.c
    s.princeton.eduzSz7EshzzSzpaperszSzics99.pdf/a-co
    mparison-of-mpi.pdf
  • Quadrics SHMEM Programming Manual
    http//www.psc.edu/oneal/compaq/ShmemMan.pdf
  • Karl Feind, Shared Memory Access (SHMEM) Routines
  • Glenn Leucke et. al., The Performance and
    Scalability of SHMEM and MPI-2 One-Sided Routines
    on a SCI Origin 2000 and a Cray T3E-600
    http//dsg.port.ac.uk/Journals/PEMCS/papers/paper1
    9.pdf
  • Patrick H. Worley, CCSM Component Performance
    Benchmarking and Status of the CRAY X1 at ORNL

    http//www.csm.ornl.gov/worley/talks/index.html
Write a Comment
User Comments (0)
About PowerShow.com