Title: Synchronization and Costs for Shared Memory
1Topic 5
- Synchronization and Costs for Shared Memory
.... You will be assimilated. Resistance is
futile. Star Trek
2Synchronization
- The orchestration of two or more threads (or
processes) to complete a task in a correct manner
and avoid any data races - Data Race or Race Condition
- There is an anomaly of concurrent accesses by
two or more threads to a shared memory and at
least one of the accesses is a write - Atomicity and / or serialibility
3Atomicity
- Atomic ? From the Greek Atomos which means
indivisible - An All or None scheme
- An instruction (or a group of them) will appear
as if it was (they were) executed in a single try - All side effects of the instruction (s) in the
block are seen in its totality or not all - Side effects ? Writes and (Causal) Reads to the
variables inside the atomic block
4Atomicity
- Word aligned load and stores are atomic in almost
all architectures - Unaligned and bigger than word accesses are
usually not atomic - What happens when non-atomic operations goes
wrong - The final result will be a garbled combination of
values - Complete operations might be lost in the process
- Strong Versus Weak Atomicity
5Synchronization
- Applied to Shared Variables
- Synchronization might enforce ordering or not
- High level Synchronization types
- Semaphores
- Mutex
- Barriers
- Critical Sections
- Monitors
- Conditional Variables
6Semaphores
- Intelligent Counters of Resources
- Zero Means not available
- Abstract data which has two operations involved
- P ? probeer te verlagen try to decrease Waits
(Busy waits or sleeps) if the resource is not
available. - V ? verhoog increase. Frees the resource
- Binary V.S. Blocking V.S. Counting Semaphores
- Binary Initial Value will allow threads to
obtain it - Blocking Initial Value will block the threads
- Counting Initial Value is not zero
- Note P and V are atomic operations!!!!
7Mutex
- Mutual Exclusion Lock
- A binary semaphore to ensure that one thread (and
only one) will access the resource - P ? Lock the mutex
- V ? Unlock the mutex
- It doesnt enforce ordering
- Fine V.S. Coarse grained
8Barriers
- A high level programming construct
- Ensure that all participating threads will wait
at a program point for all other (participating)
threads to arrive, before they can continue - Types of Barriers
- Tree Barriers (Software Assisted)
- Centralized Barriers
- Tournament Barriers
- Fine grained Barriers
- Butterfly style Barriers
- Consistency Barriers (i.e. pragma omp flush)
9Critical Sections
- A piece of code that is executed by one and only
one thread at any point in time - If T1 finds CS in use, then it waits until the CS
is free for it to use it - Special Case
- Conditional Critical Sections Threads waits on a
given signal to resume execution. - Better implemented with lock free techniques
(i.e. Transactional Memory)
10Monitors and Conditional Variables
- A monitor consists of
- A set of procedures to work on shared variables
- A set of shared variables
- An invariant
- A lock to protect from access by other threads
- Conditional Variables
- The invariant in a monitor (but it can be used in
other schemes) - It is a signal place holder for other threads
activities
11Much More
- However, all of these are abstractions
- Major elements
- A synchronization element that ensure atomicity
- Locks!!!!
- A synchronization element that ensure ordering
- Barriers!!!!
- Implementations and types
- Common types of atomic primitives
- Read Modify Write Back cycles
- Synch Overhead may break a system
- Unnecessary consistency actions
- Communication cost between threads
- Why Distributed Memory Machines have implicit
synchronization?
12Topic 5a
13Implementation
- Atomic Primitives
- Fetch and F operations
- Read Modify Write Cycles
- Test and Set
- Fetch and Store
- Exchange register and memory
- Fetch and Add
- Compare and Swap
- Conditionally exchange the value of a memory
location
14Implementation
- Use by programmers to implement more complex
synchronization constructs - Waiting behavior
- Scheduler based The process / thread is
de-scheduled and will be scheduled in a future
time - Busy Wait The process / thread polls on the
resource until it is available - Dependent on the Hardware / OS / Scheduler
behavior
15Types of (Software) LocksThe Spin Lock Family
- The Simple Test and Set Lock
- Polls a shared Boolean variable A binary
semaphore - Uses Fetch and F operations to operate on the
binary semaphore - Expensive!!!!
- Waste bandwidth
- Generate Extra Busses transactions
- The test test and set approach
- Just poll when the lock is in use
16Types of (Software) LocksThe Spin Lock Family
- Delay based Locks
- Spin Locks in which a delay has been introduced
in testing the lock - Constant delay
- Exponentional Back-off
- Best Results
- The test test and set scheme is not needed
17Types of (Software) LocksThe Spin Lock Family
Pseudo code
enum LOCK_ACTIONS LOCKED, UNLOCKED void
acquire_lock(lock_t L) int delay 1 while(!
test_and_set(L, LOCKED) ) sleep(delay) de
lay 2 void release_lock(lock_t L) L
UNLOCKED
18Types of (Software) LocksThe Ticket Lock
- Reduce the of Fetch and F operations
- Only one per lock acquisition
- Strongly fair lock
- No starvation
- A FIFO service
- Implementation Two counters
- A Request and Release Counters
19Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
0
0
Request
Release
T1 acquires the lock
20Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
1
0
Request
Release
T2 requests the lock
21Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
2
0
Request
Release
T3 requests the lock
22Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
3
1
Request
Release
T1 releases the lock T2 gets the lock T4 requests
the lock
23Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
4
1
Request
Release
T5 requests the lock
24Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
5
1
Request
Release
T1 requests the lock
25Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
5
2
Request
Release
T2 releases the lock T3 acquires the lock
26Types of (Software) LocksThe Ticket Lock
- Reduce the number of Fetch and F operations
- Only read ops on the release counter
- However, still a lot of memory and network
bandwidth wasted. - Back off techniques also used
- Exponentional Back off
- A bad idea
- Constant Delay
- Minimum time of holding a lock
- Proportional Back off
- Dependent on how many are waiting for the lock
27Types of (Software) LocksThe Ticket Lock
Pseudocode
unsigned int next_ticket 0 unsigned int
now_serving 0 void acquire_lock() unsigned
int my_ticket fetch_and_increment(next_ticket)
while sleep(my_ticket - now_serving) if(now
_serving my_ticket) return void
release_lock() now_serving now_serving 1
28Types of (Software) LocksThe Array Based Queue
Lock
- Contention on the release counter
- Cache Coherence and memory traffic
- Invalidation of the counter variable and the
request to a single memory bank - Two elements
- An Array and a tail pointer that index such array
- The array is as big as the number of processor
- Fetch and store ? Address of the array element
- Fetch and increment ? Tail pointer
- FIFO ordering
29Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
The tail pointer points to the beginning of the
array
The all array elements except the first one are
marked to wait
30Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
T1 Gets the lock
31Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
T2 Requests
32Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
T3 requests
33Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Enter
Wait
Wait
Wait
Tail
T1 releases T2 Gets
34Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Enter
Wait
Wait
Wait
Tail
T4 Requests
35Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Enter
Wait
Wait
Wait
Tail
T1 requests
36Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Wait
Enter
Wait
Wait
Tail
T2 releases T3 gets
37Types of (Software) LocksThe Queue Locks
- It uses too much memory
- Linear space (relative to the number of
processors) per lock. - Array
- Easy to implement
- Linked List QNODE
- Cache management
38Types of (Software) LocksThe MCS Lock
- Characteristics
- FIFO ordering
- Spins on locally accessible flag variables
- Small amount of space per lock
- Works equally well on machines with and without
coherent caches - Similar to the QNODE implementation of queue
locks - QNODES are assigned to local memory
- Threads spins on local memory
39MCS How it works?
- Each processor enqueues its own private lock
variable into a queue and spins on it - key spin locally
- CC model spin in local cache
- DSM model spin in local private memory
- No contention
- On lock release, the releaser unlocks the next
lock in the queue - Only have bus/network contention on actual unlock
- No starvation (order of lock acquisitions defined
by the list)
40MCS Lock
- Requires atomic instruction
- compare-and-swap
- fetch-and-store
- If there is no compare-and-swap
- an alternative release algorithm
- extra complexity
- loss of strict FIFO ordering
- theoretical possibility of starvation
- Detail Mellor-Crummey and Scotts 1991 paper
41MCS Example
Init
Proc 1 gets
Proc 2 tries
CPU 3
- CPU 1 holds the real lock
- CPU 2, CPU 3 and CPU 4 spins on the flag
- When CPU 1 releases, it releases the lock and
change the flag variable of the next in the list
CPU 2
CPU 4
CPU 1
42ImplementationModern Alternatives
- Fetch and F operations
- They are restrictive
- Not all architecture support all of them
- Problem A general one atomic op is hard!!!
- Solution Provide two primitives to generate
atomic operations - Load Linked and Store Conditional
- Remember PowerPC lwarx and stwcx instructions
43An ExampleSwap
Exchange the contents of register R4 with memory
location pointed by R1
try mov R3, R4 ld R2, 0(R1) st R3,
0(R1) mov R4, R2
Not Atomic!!!!
44An ExampleAtomic Swap
Swap (Fetch and store) using ll and sc
try mov R3, R4 ll R2, 0(R1) sc R3,
0(R1) beqz R3, try mov R4, R2
In case that another processor writes to the
value pointed by R1 before the sc can complete,
the reservation (usually keep in register) is
lost. This means that the sc will fail and the
code will loop back and try again.
45Another ExampleFetch and Increment and Spin Lock
Fetch and Increment using ll-sc
try ll R2, 0(R1) addi R2, R2, 1 sc R2,
0(R1) beqz R2, try
Spin Lock using ll-sc
The exch instruction is equivalent to the Atomic
Swap Instruction Block presented earlier Assume
that the lock is not cacheable Note 0 ?
Unlocked 1 ? Locked
li R2, 1 lockit exch R2, 0(R1) bnez R2,
lockit
46Performance Penalty
- Example
- Suppose there are 10 processors on a bus that
each try to lock a variable simultaneously.
Assume that each bus transaction (read miss or
write miss) is 100 clock cycles long. You can
ignore the time of the actual read or write of a
lock held in the cache, as well as the time the
lock is held (they wont matter much!) Determine
the performance penalty.
47Answer
- It takes over 12,000 cycles total for all
processor to pass through the lock! - Note the contention of the lock and the
serialization of the bus transactions.
See example on pp 596, Henn/Patt, 3rd Ed.
48Performance Penalty
- Assume the same example as before (100 cycles per
bus transaction, 10 processors) but consider the
case of a queue lock which only updates on a miss
Paterson and Hennesy p 603
49Performance Penalty
- Answer
- First time n1
- Subsequent access 2(n-1)
- Total 3n 1
- 29 Bus cycles or 2900 clock cycles
50Implementing Locks Using Coherence
lockit ld R2, 0(R1) bnez R2, lockit li R2,
1 exch R2, 0(R1) bnez R2, lockit
lockit ll R2, 0(R1) bnez R2, lockit li R2,
1 sc R2, 0(R1) beqz R2, lockit
51Some Graphs
Increase in Network Latency on a Butterfly. Sixty
Processor
Performance of spin locks on a butterfly. The
x-axis represents processors and y-axis
represents time in microseconds
Extracted from Algorithms for Scalable
Synchronization on Shared. John M.
Mellor-Crummer and Michael L. Scott. January 1991
52Topic 5b
53The Barrier Construct
- The idea for software barriers
- A program point in which all participating
threads wait for each other to arrive to this
point before continuing - Difficulty
- Overhead of synchronizing the threads
- Network and Memory bandwidth issues
- Implementation
- Centralized
- Simple to implement with locks
- Tree based
- Better with bandwidth
54Centralized Barriers
- A normal barrier in which all threads /
processors waits for each other serially - Typical Implementation
- Two spin locks
- One waits for all threads to arrives
- One keeps tally of the arrived threads
- A thread arrives to the barrier and increment the
counter by one (atomically) - Check if you are the last one
- If you arent then wait
- If you are, unblock (awake) the rest of the
threads
55Centralized Barrier
Pseudo Code
int count 0 bool sense true void
central_barrier() lock(L) if (count 0)
sense 0 count unlock(L) if(count
PROCESSORS) sense 1 count
0 else spin(sense 1)
It may deadlock or malfunction
56Centralized Barrier
T1 arrives to the barrier, increments count and
spins
T1
T2 arrives to the barrier, increments count and
spins
T2
T3 arrives to the barrier, increments count and
change sense
T3
T3 is delayed and T1 do Work
Barrier 1 Work Barrier 2
T1 reaches the next barrier, increments count and
it is delayed
T3 starts again and reset the count
T2 and T3 arrives to the barrier and forever spin
57Centralized Barrier
Pseudo Code Reverse Sense Barrier
int count 0 bool sense true void
central_barrier() static bool local_sense
true local_sense ! local_sense lock(L) cou
nt if(count PROCESSORS) count
0 sense local_sense unlock(L) spin(sen
se local_sense)
It will wait since the spin target can be either
from the previous barrier (old local_sense) or
from the current barrier (local_sense)
58Centralized Barrier
Performance
Suppose there are 10 processors on a bus that
each try to execute a barrier simultaneously.
Assume that each bus transaction is 100 clock
cycles, as before. You can ignore the time of the
actual read or write of a lock held in the cache
as the time to execute other non-synchronization
operations in the barrier implementation.
Determine the number of bus transactions required
for all 10 processors to reach the barrier, be
released from the barrier and exit the barrier.
Assume that the bus is totally fair, so that
every pending request is serviced before a new
request and that the processors are equally fast.
Dont worry about counting the processors out of
the barrier. How long will the entire process
take?
Patterson and Hennesy Page 598
59Centralized Barrier
- Steps through the barrier
- Assume that ll-sc lock is used
- LL the lock ? i times
- SC the lock ? i times
- Load Count ? 1 time
- LL the lock again ? i -1 times
- Store Count ? 1 time
- Store lock ? 1 time
- Load sense ? 2 times
- Total transaction for the ith processor 3i 4
- Total (3n2 11n)/2 1
- 204 bus cycles ? 20,400 clock cycles
60Tree Type Barriers
- The software combining tree barrier
- A shared variable becomes a tree of access
- Each parent node will combine the results of each
its children - A group of processor per leaf
- Last processor update the leaf and then moves up
- A two pass scheme
- From down to up ? Update count
- From up to down ? Update sense and resume
- Objectives
- Reduces Memory Contention
- Disadvantages
- Spins on memory locations which positions cannot
be statically determinated
61Tree Type Barriers
- Butterfly Barrier
- Based on the Butterfly network scheme for
broadcasting and reduction - Pairwise optimizations
- At step k Processor i signals processor i xor 2k
- In case that the number of processors are not a
power of two then existing processor will
participate. - Max Synchronizations 2 Floor(log2 P)
62Tree Type Barriers
- Dissemination Barrier
- Similar to Butterfly but with less maximum
synchronization operations ? floor(log2P) - At step k Processor i signals processor (i 2k)
mod P - Advantages
- The flags that each processor spins are
statically assigned (Better locality)
63Tree Type Barriers
- Tournament Barriers
- A tree style barrier
- A round of the tournament
- A level of the tree
- Winners are statically decided
- No fetch and F operations are needed
- Processor i sets a flag that is being awaited by
processor j, then processor i drops from the
tournament and j continues - The final processor wakes all others
- Types
- CREW (concurrent read exclusive write) Global
variable to signal back - EREW (exclusive read exclusive write) Separate
flags in which each processor spins separate.
64Bibliography
- Paterson and Hennessy. Chapter 6
Multiprocessors and Thread Level Parallelism - Mellor-Crummey, John Scott, Michael. Algorithms
for Scalable Synchronization on Shared Memory
Multiprocessors. January 1991.