Title: Optimizing Collective Communication for Multicore
1Optimizing Collective Communication for Multicore
2What Are Collectives
- An operation called by all threads together to
perform globally coordinated communication - May involve a modest amount of computation, e.g.
to combine values as they are communicated - Can be extended to teams (or communicators) in
which they operate on a predefined subset of the
threads - Focus on collectives in Single Program Multiple
Data (SPMD) programming models
3Some Collectives
- Barrier ((MPI_Barrier())
- A thread cannot exit a call to a barrier until
all other threads have called the barrier - Broadcast (MPI_Bcast())
- A root thread sends a copy of an array to all the
other threads - Reduce-To-All (MPI_Allreduce())
- Each thread contributes an operand to an
arithmetic operation across all the threads - The result is then broadcast to all the threads
- Exchange (MPI_Alltoall())
- For all i, j lt N , thread i copies the jth piece
of an input array to the ith slot of an output
array located on thread i.
4Why Are They Important?
- Basic communication building blocks
- Found in many parallel programming languages and
libraries - Abstraction
- If an application is written with collectives,
passes the responsibility of tuning to the runtime
Percentage of runtime spent in collectives
5Experimental Setup
- Platforms
- Sun Niagra2
- 1 socket of 8 multi-threaded cores
- Each core supports 8 hardware thread contexts for
64 total threads - Intel Clovertown
- 2 traditional quad-core sockets
- BlueGene/P
- 1 quad-core socket
- MPI for Inter-process communication
- shared memory MPICH2 1.0.7
6Threads v. Processes (Niagra2)
- Barrier Performance
- Perform a barrier across all 64 threads
- Threads arranged into processes in different ways
- One extreme has one thread per process while
other has 1 process with 64 threads - MPI_Barrier() called between processes
- Flat barrier amongst threads
- 2 orders of magnitude difference in performance!
7Threads v. Processes (Niagra2) cont.
- Other collectives see similar issues with scaling
using processes - MPI Collectives called between processes while
shared memory is leverage within a process
8Intel Clovertown and BlueGene/P
- Less threads per node
- Differences are not as drastic but they are
non-trivial
Intel Clovertown
BlueGene/P
9Optimizing Barrier w/ Trees
- Leveraging shared memory is a critical
optimization - Flat trees are dont scale
- Use to aid parallelism
- Requires two passes of a tree
- First (UP) pass indicates that all threads have
arrived. - Signal parent when all your children have arrived
- Once root gets signal from all children then all
threads have reported in - Second (DOWN) pass indicates that all threads
have arrived - Wait for parent to send me a clear signal
- Propagate clear signal down to my children
10Example Tree Topologies
Radix 2 k-nomial tree (binomial)
Radix 4 k-nomial tree (quadnomial)
Radix 8 k-nomial tree (octnomial)
11Barrier Performance Results
- Time many back-to-back barriers
- Flat tree is just one level with all threads
reporting to thread 0 - Leverages shared memory but non-scalable
- Architecture Independent Tree (radix2)
- Pick a generic good radix that is suitable for
many platforms - Mismatched to architecture
- Architecture Dependent Tree
- Search overall radices to pick the tree that best
matches the architecture
G O O D
12Broadcast Performance Results
- Time a latency sensitive Broadcast (8 Bytes)
- Time Broadcast followed by Barrier and subtract
time for Barrier - Yields an approximation for how long it takes for
the last thread to get the data
13Reduce To All Performance Results
- 4kBytes (512 Doubles) Reduce-To-All
- In addition to data movement we also want to
parallelize the computation - In Flat approach, computation gets serialized at
the root - Tree based approaches allow us to parallelize the
computation amongst all the floating point units - 8 threads share one FPU thus radix 2,4, 8
serialize computation in about the same way
14Optimization Summary
- Relying on flat trees is not enough for most
collectives - Architecture dependent tuning is a further and
important optimization
15Extending the Results to a Cluster
- Use one-rack of BlueGene/P (1024 nodes or 4096
cores) - Reduce-To-All by having one thread representative
thread make call to inter-node all reduce - Reduce the number of messages in the network
- Vary the number of threads per process but use
all cores - Relying purely on shared memory doesnt always
yield the best performance - Reduces number of active cores working on
computation drops - Can optimize so that computation is partitioned
across cores - Not suitable for direct call to MPI_Allreduce()
16Potential Synchronization Problem
- 1. Broadcast variable x from root
- 2. Have proc 1 set a new value for x on proc 4
- broadcast x1 from proc 0
- if(myid1)
- put x5 to proc 4
- else
- / do nothing/
Proc 1 thinks collective is done
Put of x5 by proc 1 has been lost Proc 1
observes globally incomplete collective
17Strict v. Loose Synchronization
- A fix to the problem
- Add barrier before/after the collective
- Enforces global ordering of the operations
- Is there a problem?
- We want to decouple synchronization from data
movement - Specify the synchronization requirements
- Potential to aggregate synchronization
- Done by the user or a smart compiler
- How can we realize these gains in applications?
18Conclusions
- Processes ? Threads is a crucial optimization for
single-node collective communication - Can use tree-based collectives to realize better
performance, even for collectives on one node - Picking the correct tree that best matches
architecture yields the best performance - Multicore adds to the (auto)tuning space for
collective communication - Shared memory semantics allow us to create new
loosely synchronized collectives
19 20 21Threads and Processes
- Threads
- A sequence of instructions and an execution stack
- Communication between threads through common and
shared address space - No OS/Network involvement needed
- Reasoning about inter-thread communication can be
tricky - Processes
- A set of threads and and an associated memory
space - All threads within process share address space
- Communication between processes must be managed
through the OS - Inter-process communication is explicit but may
be slow - More expensive to switch between processes
22Experimental Platforms
Clovertown
Niagra2
BG/P
23Specs
24Details of Signaling
- For optimum performance have many readers and one
writer - Each thread sets a flag (a single word) that
others will read - Every reader will get a copy of the cache-line
and spin on that copy - When writer comes in and changes value of
variable, cache-coherency system will handle
broadcasting/updating the changes - Avoid atomic primitives
- On way up the tree, child sets a flag indicating
that subtree has arrived - Parent spins on that flag for each child
- On way down, each child spins on parents flag
- When its set, it indicates that the parent wants
to broadcast the clear signal down - Flags must be on different cache lines to avoid
false sharing - Need to switch back-and-forth between two sets of
flags