Optimizing Collective Communication for Multicore - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Optimizing Collective Communication for Multicore

Description:

An operation called by all threads together to perform globally coordinated communication ... 8 threads share one FPU thus radix 2,4, & 8 serialize computation ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 25

Provided by: rajeshn2

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Collective Communication for Multicore

1
Optimizing Collective Communication for Multicore

By Rajesh Nishtala

2
What Are Collectives

An operation called by all threads together to
perform globally coordinated communication
May involve a modest amount of computation, e.g.
to combine values as they are communicated
Can be extended to teams (or communicators) in
which they operate on a predefined subset of the
threads
Focus on collectives in Single Program Multiple
Data (SPMD) programming models

3
Some Collectives

Barrier ((MPI_Barrier())
A thread cannot exit a call to a barrier until
all other threads have called the barrier
Broadcast (MPI_Bcast())
A root thread sends a copy of an array to all the
other threads
Reduce-To-All (MPI_Allreduce())
Each thread contributes an operand to an
arithmetic operation across all the threads
The result is then broadcast to all the threads
Exchange (MPI_Alltoall())
For all i, j lt N , thread i copies the jth piece
of an input array to the ith slot of an output
array located on thread i.

4
Why Are They Important?

Basic communication building blocks
Found in many parallel programming languages and
libraries
Abstraction
If an application is written with collectives,
passes the responsibility of tuning to the runtime

Percentage of runtime spent in collectives
5
Experimental Setup

Platforms
Sun Niagra2
1 socket of 8 multi-threaded cores
Each core supports 8 hardware thread contexts for
64 total threads
Intel Clovertown
2 traditional quad-core sockets
BlueGene/P
1 quad-core socket
MPI for Inter-process communication
shared memory MPICH2 1.0.7

6
Threads v. Processes (Niagra2)

Barrier Performance
Perform a barrier across all 64 threads
Threads arranged into processes in different ways
One extreme has one thread per process while
other has 1 process with 64 threads
MPI_Barrier() called between processes
Flat barrier amongst threads
2 orders of magnitude difference in performance!

7
Threads v. Processes (Niagra2) cont.

Other collectives see similar issues with scaling
using processes
MPI Collectives called between processes while
shared memory is leverage within a process

8
Intel Clovertown and BlueGene/P

Less threads per node
Differences are not as drastic but they are
non-trivial

Intel Clovertown
BlueGene/P
9
Optimizing Barrier w/ Trees

Leveraging shared memory is a critical
optimization
Flat trees are dont scale
Use to aid parallelism
Requires two passes of a tree
First (UP) pass indicates that all threads have
arrived.
Signal parent when all your children have arrived
Once root gets signal from all children then all
threads have reported in
Second (DOWN) pass indicates that all threads
have arrived
Wait for parent to send me a clear signal
Propagate clear signal down to my children

10
Example Tree Topologies
Radix 2 k-nomial tree (binomial)
Radix 4 k-nomial tree (quadnomial)
Radix 8 k-nomial tree (octnomial)
11
Barrier Performance Results

Time many back-to-back barriers
Flat tree is just one level with all threads
reporting to thread 0
Leverages shared memory but non-scalable
Architecture Independent Tree (radix2)
Pick a generic good radix that is suitable for
many platforms
Mismatched to architecture
Architecture Dependent Tree
Search overall radices to pick the tree that best
matches the architecture

G O O D
12
Broadcast Performance Results

Time a latency sensitive Broadcast (8 Bytes)
Time Broadcast followed by Barrier and subtract
time for Barrier
Yields an approximation for how long it takes for
the last thread to get the data

13
Reduce To All Performance Results

4kBytes (512 Doubles) Reduce-To-All
In addition to data movement we also want to
parallelize the computation
In Flat approach, computation gets serialized at
the root
Tree based approaches allow us to parallelize the
computation amongst all the floating point units
8 threads share one FPU thus radix 2,4, 8
serialize computation in about the same way

14
Optimization Summary

Relying on flat trees is not enough for most
collectives
Architecture dependent tuning is a further and
important optimization

15
Extending the Results to a Cluster

Use one-rack of BlueGene/P (1024 nodes or 4096
cores)
Reduce-To-All by having one thread representative
thread make call to inter-node all reduce
Reduce the number of messages in the network
Vary the number of threads per process but use
all cores
Relying purely on shared memory doesnt always
yield the best performance
Reduces number of active cores working on
computation drops
Can optimize so that computation is partitioned
across cores
Not suitable for direct call to MPI_Allreduce()

16
Potential Synchronization Problem

1. Broadcast variable x from root
2. Have proc 1 set a new value for x on proc 4
broadcast x1 from proc 0
if(myid1)
put x5 to proc 4
else
/ do nothing/

Proc 1 thinks collective is done
Put of x5 by proc 1 has been lost Proc 1
observes globally incomplete collective
17
Strict v. Loose Synchronization

A fix to the problem
Add barrier before/after the collective
Enforces global ordering of the operations
Is there a problem?
We want to decouple synchronization from data
movement
Specify the synchronization requirements
Potential to aggregate synchronization
Done by the user or a smart compiler
How can we realize these gains in applications?

18
Conclusions

Processes ? Threads is a crucial optimization for
single-node collective communication
Can use tree-based collectives to realize better
performance, even for collectives on one node
Picking the correct tree that best matches
architecture yields the best performance
Multicore adds to the (auto)tuning space for
collective communication
Shared memory semantics allow us to create new
loosely synchronized collectives

Questions?

Backup Slides

21
Threads and Processes

Threads
A sequence of instructions and an execution stack
Communication between threads through common and
shared address space
No OS/Network involvement needed
Reasoning about inter-thread communication can be
tricky
Processes
A set of threads and and an associated memory
space
All threads within process share address space
Communication between processes must be managed
through the OS
Inter-process communication is explicit but may
be slow
More expensive to switch between processes

22
Experimental Platforms
Clovertown
Niagra2
BG/P
23
Specs
24
Details of Signaling

For optimum performance have many readers and one
writer
Each thread sets a flag (a single word) that
others will read
Every reader will get a copy of the cache-line
and spin on that copy
When writer comes in and changes value of
variable, cache-coherency system will handle
broadcasting/updating the changes
Avoid atomic primitives
On way up the tree, child sets a flag indicating
that subtree has arrived
Parent spins on that flag for each child
On way down, each child spins on parents flag
When its set, it indicates that the parent wants
to broadcast the clear signal down
Flags must be on different cache lines to avoid
false sharing
Need to switch back-and-forth between two sets of
flags