Optimizing Collective Communication for Multicore - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Optimizing Collective Communication for Multicore

Description:

An operation called by all threads together to perform globally coordinated communication ... 8 threads share one FPU thus radix 2,4, & 8 serialize computation ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 25
Provided by: rajeshn2
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Collective Communication for Multicore


1
Optimizing Collective Communication for Multicore
  • By Rajesh Nishtala

2
What Are Collectives
  • An operation called by all threads together to
    perform globally coordinated communication
  • May involve a modest amount of computation, e.g.
    to combine values as they are communicated
  • Can be extended to teams (or communicators) in
    which they operate on a predefined subset of the
    threads
  • Focus on collectives in Single Program Multiple
    Data (SPMD) programming models

3
Some Collectives
  • Barrier ((MPI_Barrier())
  • A thread cannot exit a call to a barrier until
    all other threads have called the barrier
  • Broadcast (MPI_Bcast())
  • A root thread sends a copy of an array to all the
    other threads
  • Reduce-To-All (MPI_Allreduce())
  • Each thread contributes an operand to an
    arithmetic operation across all the threads
  • The result is then broadcast to all the threads
  • Exchange (MPI_Alltoall())
  • For all i, j lt N , thread i copies the jth piece
    of an input array to the ith slot of an output
    array located on thread i.

4
Why Are They Important?
  • Basic communication building blocks
  • Found in many parallel programming languages and
    libraries
  • Abstraction
  • If an application is written with collectives,
    passes the responsibility of tuning to the runtime

Percentage of runtime spent in collectives
5
Experimental Setup
  • Platforms
  • Sun Niagra2
  • 1 socket of 8 multi-threaded cores
  • Each core supports 8 hardware thread contexts for
    64 total threads
  • Intel Clovertown
  • 2 traditional quad-core sockets
  • BlueGene/P
  • 1 quad-core socket
  • MPI for Inter-process communication
  • shared memory MPICH2 1.0.7

6
Threads v. Processes (Niagra2)
  • Barrier Performance
  • Perform a barrier across all 64 threads
  • Threads arranged into processes in different ways
  • One extreme has one thread per process while
    other has 1 process with 64 threads
  • MPI_Barrier() called between processes
  • Flat barrier amongst threads
  • 2 orders of magnitude difference in performance!

7
Threads v. Processes (Niagra2) cont.
  • Other collectives see similar issues with scaling
    using processes
  • MPI Collectives called between processes while
    shared memory is leverage within a process

8
Intel Clovertown and BlueGene/P
  • Less threads per node
  • Differences are not as drastic but they are
    non-trivial

Intel Clovertown
BlueGene/P
9
Optimizing Barrier w/ Trees
  • Leveraging shared memory is a critical
    optimization
  • Flat trees are dont scale
  • Use to aid parallelism
  • Requires two passes of a tree
  • First (UP) pass indicates that all threads have
    arrived.
  • Signal parent when all your children have arrived
  • Once root gets signal from all children then all
    threads have reported in
  • Second (DOWN) pass indicates that all threads
    have arrived
  • Wait for parent to send me a clear signal
  • Propagate clear signal down to my children

10
Example Tree Topologies
Radix 2 k-nomial tree (binomial)
Radix 4 k-nomial tree (quadnomial)
Radix 8 k-nomial tree (octnomial)
11
Barrier Performance Results
  • Time many back-to-back barriers
  • Flat tree is just one level with all threads
    reporting to thread 0
  • Leverages shared memory but non-scalable
  • Architecture Independent Tree (radix2)
  • Pick a generic good radix that is suitable for
    many platforms
  • Mismatched to architecture
  • Architecture Dependent Tree
  • Search overall radices to pick the tree that best
    matches the architecture

G O O D
12
Broadcast Performance Results
  • Time a latency sensitive Broadcast (8 Bytes)
  • Time Broadcast followed by Barrier and subtract
    time for Barrier
  • Yields an approximation for how long it takes for
    the last thread to get the data

13
Reduce To All Performance Results
  • 4kBytes (512 Doubles) Reduce-To-All
  • In addition to data movement we also want to
    parallelize the computation
  • In Flat approach, computation gets serialized at
    the root
  • Tree based approaches allow us to parallelize the
    computation amongst all the floating point units
  • 8 threads share one FPU thus radix 2,4, 8
    serialize computation in about the same way

14
Optimization Summary
  • Relying on flat trees is not enough for most
    collectives
  • Architecture dependent tuning is a further and
    important optimization

15
Extending the Results to a Cluster
  • Use one-rack of BlueGene/P (1024 nodes or 4096
    cores)
  • Reduce-To-All by having one thread representative
    thread make call to inter-node all reduce
  • Reduce the number of messages in the network
  • Vary the number of threads per process but use
    all cores
  • Relying purely on shared memory doesnt always
    yield the best performance
  • Reduces number of active cores working on
    computation drops
  • Can optimize so that computation is partitioned
    across cores
  • Not suitable for direct call to MPI_Allreduce()

16
Potential Synchronization Problem
  • 1. Broadcast variable x from root
  • 2. Have proc 1 set a new value for x on proc 4
  • broadcast x1 from proc 0
  • if(myid1)
  • put x5 to proc 4
  • else
  • / do nothing/

Proc 1 thinks collective is done
Put of x5 by proc 1 has been lost Proc 1
observes globally incomplete collective
17
Strict v. Loose Synchronization
  • A fix to the problem
  • Add barrier before/after the collective
  • Enforces global ordering of the operations
  • Is there a problem?
  • We want to decouple synchronization from data
    movement
  • Specify the synchronization requirements
  • Potential to aggregate synchronization
  • Done by the user or a smart compiler
  • How can we realize these gains in applications?

18
Conclusions
  • Processes ? Threads is a crucial optimization for
    single-node collective communication
  • Can use tree-based collectives to realize better
    performance, even for collectives on one node
  • Picking the correct tree that best matches
    architecture yields the best performance
  • Multicore adds to the (auto)tuning space for
    collective communication
  • Shared memory semantics allow us to create new
    loosely synchronized collectives

19
  • Questions?

20
  • Backup Slides

21
Threads and Processes
  • Threads
  • A sequence of instructions and an execution stack
  • Communication between threads through common and
    shared address space
  • No OS/Network involvement needed
  • Reasoning about inter-thread communication can be
    tricky
  • Processes
  • A set of threads and and an associated memory
    space
  • All threads within process share address space
  • Communication between processes must be managed
    through the OS
  • Inter-process communication is explicit but may
    be slow
  • More expensive to switch between processes

22
Experimental Platforms
Clovertown
Niagra2
BG/P
23
Specs
24
Details of Signaling
  • For optimum performance have many readers and one
    writer
  • Each thread sets a flag (a single word) that
    others will read
  • Every reader will get a copy of the cache-line
    and spin on that copy
  • When writer comes in and changes value of
    variable, cache-coherency system will handle
    broadcasting/updating the changes
  • Avoid atomic primitives
  • On way up the tree, child sets a flag indicating
    that subtree has arrived
  • Parent spins on that flag for each child
  • On way down, each child spins on parents flag
  • When its set, it indicates that the parent wants
    to broadcast the clear signal down
  • Flags must be on different cache lines to avoid
    false sharing
  • Need to switch back-and-forth between two sets of
    flags
Write a Comment
User Comments (0)
About PowerShow.com