Synchronization and Costs for Shared Memory

About This Presentation

Title:

Synchronization and Costs for Shared Memory

Description:

The orchestration of two or more threads (or processes) to complete a task in a ... The final result will be a garbled combination of values ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 65

Provided by: josephb1

Learn more at: https://www.capsl.udel.edu

Category:

more less

Transcript and Presenter's Notes

Title: Synchronization and Costs for Shared Memory

1
Topic 5

Synchronization and Costs for Shared Memory

.... You will be assimilated. Resistance is
futile. Star Trek
2
Synchronization

The orchestration of two or more threads (or
processes) to complete a task in a correct manner
and avoid any data races
Data Race or Race Condition
There is an anomaly of concurrent accesses by
two or more threads to a shared memory and at
least one of the accesses is a write
Atomicity and / or serialibility

3
Atomicity

Atomic ? From the Greek Atomos which means
indivisible
An All or None scheme
An instruction (or a group of them) will appear
as if it was (they were) executed in a single try
All side effects of the instruction (s) in the
block are seen in its totality or not all
Side effects ? Writes and (Causal) Reads to the
variables inside the atomic block

4
Atomicity

Word aligned load and stores are atomic in almost
all architectures
Unaligned and bigger than word accesses are
usually not atomic
What happens when non-atomic operations goes
wrong
The final result will be a garbled combination of
values
Complete operations might be lost in the process
Strong Versus Weak Atomicity

5
Synchronization

Applied to Shared Variables
Synchronization might enforce ordering or not
High level Synchronization types
Semaphores
Mutex
Barriers
Critical Sections
Monitors
Conditional Variables

6
Semaphores

Intelligent Counters of Resources
Zero Means not available
Abstract data which has two operations involved
P ? probeer te verlagen try to decrease Waits
(Busy waits or sleeps) if the resource is not
available.
V ? verhoog increase. Frees the resource
Binary V.S. Blocking V.S. Counting Semaphores
Binary Initial Value will allow threads to
obtain it
Blocking Initial Value will block the threads
Counting Initial Value is not zero
Note P and V are atomic operations!!!!

7
Mutex

Mutual Exclusion Lock
A binary semaphore to ensure that one thread (and
only one) will access the resource
P ? Lock the mutex
V ? Unlock the mutex
It doesnt enforce ordering
Fine V.S. Coarse grained

8
Barriers

A high level programming construct
Ensure that all participating threads will wait
at a program point for all other (participating)
threads to arrive, before they can continue
Types of Barriers
Tree Barriers (Software Assisted)
Centralized Barriers
Tournament Barriers
Fine grained Barriers
Butterfly style Barriers
Consistency Barriers (i.e. pragma omp flush)

9
Critical Sections

A piece of code that is executed by one and only
one thread at any point in time
If T1 finds CS in use, then it waits until the CS
is free for it to use it
Special Case
Conditional Critical Sections Threads waits on a
given signal to resume execution.
Better implemented with lock free techniques
(i.e. Transactional Memory)

10
Monitors and Conditional Variables

A monitor consists of
A set of procedures to work on shared variables
A set of shared variables
An invariant
A lock to protect from access by other threads
Conditional Variables
The invariant in a monitor (but it can be used in
other schemes)
It is a signal place holder for other threads
activities

11
Much More

However, all of these are abstractions
Major elements
A synchronization element that ensure atomicity
Locks!!!!
A synchronization element that ensure ordering
Barriers!!!!
Implementations and types
Common types of atomic primitives
Read Modify Write Back cycles
Synch Overhead may break a system
Unnecessary consistency actions
Communication cost between threads
Why Distributed Memory Machines have implicit
synchronization?

12
Topic 5a

Locks

13
Implementation

Atomic Primitives
Fetch and F operations
Read Modify Write Cycles
Test and Set
Fetch and Store
Exchange register and memory
Fetch and Add
Compare and Swap
Conditionally exchange the value of a memory
location

14
Implementation

Use by programmers to implement more complex
synchronization constructs
Waiting behavior
Scheduler based The process / thread is
de-scheduled and will be scheduled in a future
time
Busy Wait The process / thread polls on the
resource until it is available
Dependent on the Hardware / OS / Scheduler
behavior

15
Types of (Software) LocksThe Spin Lock Family

The Simple Test and Set Lock
Polls a shared Boolean variable A binary
semaphore
Uses Fetch and F operations to operate on the
binary semaphore
Expensive!!!!
Waste bandwidth
Generate Extra Busses transactions
The test test and set approach
Just poll when the lock is in use

16
Types of (Software) LocksThe Spin Lock Family

Delay based Locks
Spin Locks in which a delay has been introduced
in testing the lock
Constant delay
Exponentional Back-off
Best Results
The test test and set scheme is not needed

17
Types of (Software) LocksThe Spin Lock Family
Pseudo code
enum LOCK_ACTIONS LOCKED, UNLOCKED void
acquire_lock(lock_t L) int delay 1 while(!
test_and_set(L, LOCKED) ) sleep(delay) de
lay 2 void release_lock(lock_t L) L
UNLOCKED
18
Types of (Software) LocksThe Ticket Lock

Reduce the of Fetch and F operations
Only one per lock acquisition
Strongly fair lock
No starvation
A FIFO service
Implementation Two counters
A Request and Release Counters

19
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
0
0
Request
Release
T1 acquires the lock
20
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
1
0
Request
Release
T2 requests the lock
21
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
2
0
Request
Release
T3 requests the lock
22
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
3
1
Request
Release
T1 releases the lock T2 gets the lock T4 requests
the lock
23
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
4
1
Request
Release
T5 requests the lock
24
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
5
1
Request
Release
T1 requests the lock
25
Types of (Software) LocksThe Ticket Lock
T1
T2
T3
T4
T5
5
2
Request
Release
T2 releases the lock T3 acquires the lock
26
Types of (Software) LocksThe Ticket Lock

Reduce the number of Fetch and F operations
Only read ops on the release counter
However, still a lot of memory and network
bandwidth wasted.
Back off techniques also used
Exponentional Back off
A bad idea
Constant Delay
Minimum time of holding a lock
Proportional Back off
Dependent on how many are waiting for the lock

27
Types of (Software) LocksThe Ticket Lock
Pseudocode
unsigned int next_ticket 0 unsigned int
now_serving 0 void acquire_lock() unsigned
int my_ticket fetch_and_increment(next_ticket)
while sleep(my_ticket - now_serving) if(now
_serving my_ticket) return void
release_lock() now_serving now_serving 1
28
Types of (Software) LocksThe Array Based Queue
Lock

Contention on the release counter
Cache Coherence and memory traffic
Invalidation of the counter variable and the
request to a single memory bank
Two elements
An Array and a tail pointer that index such array
The array is as big as the number of processor
Fetch and store ? Address of the array element
Fetch and increment ? Tail pointer
FIFO ordering

29
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
The tail pointer points to the beginning of the
array
The all array elements except the first one are
marked to wait
30
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
T1 Gets the lock
31
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
T2 Requests
32
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Enter
Wait
Wait
Wait
Wait
Tail
T3 requests
33
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Enter
Wait
Wait
Wait
Tail
T1 releases T2 Gets
34
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Enter
Wait
Wait
Wait
Tail
T4 Requests
35
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Enter
Wait
Wait
Wait
Tail
T1 requests
36
Types of (Software) LocksThe Array Based Queue
Lock
T4
T1
T2
T3
T5
Wait
Wait
Enter
Wait
Wait
Tail
T2 releases T3 gets
37
Types of (Software) LocksThe Queue Locks

It uses too much memory
Linear space (relative to the number of
processors) per lock.
Array
Easy to implement
Linked List QNODE
Cache management

38
Types of (Software) LocksThe MCS Lock

Characteristics
FIFO ordering
Spins on locally accessible flag variables
Small amount of space per lock
Works equally well on machines with and without
coherent caches
Similar to the QNODE implementation of queue
locks
QNODES are assigned to local memory
Threads spins on local memory

39
MCS How it works?

Each processor enqueues its own private lock
variable into a queue and spins on it
key spin locally
CC model spin in local cache
DSM model spin in local private memory
No contention
On lock release, the releaser unlocks the next
lock in the queue
Only have bus/network contention on actual unlock
No starvation (order of lock acquisitions defined
by the list)

40
MCS Lock

Requires atomic instruction
compare-and-swap
fetch-and-store
If there is no compare-and-swap
an alternative release algorithm
extra complexity
loss of strict FIFO ordering
theoretical possibility of starvation
Detail Mellor-Crummey and Scotts 1991 paper

41
MCS Example
Init
Proc 1 gets
Proc 2 tries
CPU 3

CPU 1 holds the real lock
CPU 2, CPU 3 and CPU 4 spins on the flag
When CPU 1 releases, it releases the lock and
change the flag variable of the next in the list

CPU 2
CPU 4
CPU 1
42
ImplementationModern Alternatives

Fetch and F operations
They are restrictive
Not all architecture support all of them
Problem A general one atomic op is hard!!!
Solution Provide two primitives to generate
atomic operations
Load Linked and Store Conditional
Remember PowerPC lwarx and stwcx instructions

43
An ExampleSwap
Exchange the contents of register R4 with memory
location pointed by R1
try mov R3, R4 ld R2, 0(R1) st R3,
0(R1) mov R4, R2
Not Atomic!!!!
44
An ExampleAtomic Swap
Swap (Fetch and store) using ll and sc
try mov R3, R4 ll R2, 0(R1) sc R3,
0(R1) beqz R3, try mov R4, R2
In case that another processor writes to the
value pointed by R1 before the sc can complete,
the reservation (usually keep in register) is
lost. This means that the sc will fail and the
code will loop back and try again.
45
Another ExampleFetch and Increment and Spin Lock
Fetch and Increment using ll-sc
try ll R2, 0(R1) addi R2, R2, 1 sc R2,
0(R1) beqz R2, try
Spin Lock using ll-sc
The exch instruction is equivalent to the Atomic
Swap Instruction Block presented earlier Assume
that the lock is not cacheable Note 0 ?
Unlocked 1 ? Locked
li R2, 1 lockit exch R2, 0(R1) bnez R2,
lockit
46
Performance Penalty

Example
Suppose there are 10 processors on a bus that
each try to lock a variable simultaneously.
Assume that each bus transaction (read miss or
write miss) is 100 clock cycles long. You can
ignore the time of the actual read or write of a
lock held in the cache, as well as the time the
lock is held (they wont matter much!) Determine
the performance penalty.

47
Answer

It takes over 12,000 cycles total for all
processor to pass through the lock!
Note the contention of the lock and the
serialization of the bus transactions.

See example on pp 596, Henn/Patt, 3rd Ed.
48
Performance Penalty

Assume the same example as before (100 cycles per
bus transaction, 10 processors) but consider the
case of a queue lock which only updates on a miss

Paterson and Hennesy p 603
49
Performance Penalty

Answer
First time n1
Subsequent access 2(n-1)
Total 3n 1
29 Bus cycles or 2900 clock cycles

50
Implementing Locks Using Coherence
lockit ld R2, 0(R1) bnez R2, lockit li R2,
1 exch R2, 0(R1) bnez R2, lockit
lockit ll R2, 0(R1) bnez R2, lockit li R2,
1 sc R2, 0(R1) beqz R2, lockit
51
Some Graphs
Increase in Network Latency on a Butterfly. Sixty
Processor
Performance of spin locks on a butterfly. The
x-axis represents processors and y-axis
represents time in microseconds
Extracted from Algorithms for Scalable
Synchronization on Shared. John M.
Mellor-Crummer and Michael L. Scott. January 1991
52
Topic 5b

Barriers

53
The Barrier Construct

The idea for software barriers
A program point in which all participating
threads wait for each other to arrive to this
point before continuing
Difficulty
Overhead of synchronizing the threads
Network and Memory bandwidth issues
Implementation
Centralized
Simple to implement with locks
Tree based
Better with bandwidth

54
Centralized Barriers

A normal barrier in which all threads /
processors waits for each other serially
Typical Implementation
Two spin locks
One waits for all threads to arrives
One keeps tally of the arrived threads
A thread arrives to the barrier and increment the
counter by one (atomically)
Check if you are the last one
If you arent then wait
If you are, unblock (awake) the rest of the
threads

55
Centralized Barrier
Pseudo Code
int count 0 bool sense true void
central_barrier() lock(L) if (count 0)
sense 0 count unlock(L) if(count
PROCESSORS) sense 1 count
0 else spin(sense 1)
It may deadlock or malfunction
56
Centralized Barrier
T1 arrives to the barrier, increments count and
spins
T1
T2 arrives to the barrier, increments count and
spins
T2
T3 arrives to the barrier, increments count and
change sense
T3
T3 is delayed and T1 do Work
Barrier 1 Work Barrier 2
T1 reaches the next barrier, increments count and
it is delayed
T3 starts again and reset the count
T2 and T3 arrives to the barrier and forever spin
57
Centralized Barrier
Pseudo Code Reverse Sense Barrier
int count 0 bool sense true void
central_barrier() static bool local_sense
true local_sense ! local_sense lock(L) cou
nt if(count PROCESSORS) count
0 sense local_sense unlock(L) spin(sen
se local_sense)
It will wait since the spin target can be either
from the previous barrier (old local_sense) or
from the current barrier (local_sense)
58
Centralized Barrier
Performance
Suppose there are 10 processors on a bus that
each try to execute a barrier simultaneously.
Assume that each bus transaction is 100 clock
cycles, as before. You can ignore the time of the
actual read or write of a lock held in the cache
as the time to execute other non-synchronization
operations in the barrier implementation.
Determine the number of bus transactions required
for all 10 processors to reach the barrier, be
released from the barrier and exit the barrier.
Assume that the bus is totally fair, so that
every pending request is serviced before a new
request and that the processors are equally fast.
Dont worry about counting the processors out of
the barrier. How long will the entire process
take?
Patterson and Hennesy Page 598
59
Centralized Barrier

Steps through the barrier
Assume that ll-sc lock is used
LL the lock ? i times
SC the lock ? i times
Load Count ? 1 time
LL the lock again ? i -1 times
Store Count ? 1 time
Store lock ? 1 time
Load sense ? 2 times
Total transaction for the ith processor 3i 4
Total (3n2 11n)/2 1
204 bus cycles ? 20,400 clock cycles

60
Tree Type Barriers

The software combining tree barrier
A shared variable becomes a tree of access
Each parent node will combine the results of each
its children
A group of processor per leaf
Last processor update the leaf and then moves up
A two pass scheme
From down to up ? Update count
From up to down ? Update sense and resume
Objectives
Reduces Memory Contention
Disadvantages
Spins on memory locations which positions cannot
be statically determinated

61
Tree Type Barriers

Butterfly Barrier
Based on the Butterfly network scheme for
broadcasting and reduction
Pairwise optimizations
At step k Processor i signals processor i xor 2k
In case that the number of processors are not a
power of two then existing processor will
participate.
Max Synchronizations 2 Floor(log2 P)

62
Tree Type Barriers

Dissemination Barrier
Similar to Butterfly but with less maximum
synchronization operations ? floor(log2P)
At step k Processor i signals processor (i 2k)
mod P
Advantages
The flags that each processor spins are
statically assigned (Better locality)

63
Tree Type Barriers

Tournament Barriers
A tree style barrier
A round of the tournament
A level of the tree
Winners are statically decided
No fetch and F operations are needed
Processor i sets a flag that is being awaited by
processor j, then processor i drops from the
tournament and j continues
The final processor wakes all others
Types
CREW (concurrent read exclusive write) Global
variable to signal back
EREW (exclusive read exclusive write) Separate
flags in which each processor spins separate.

64
Bibliography

Paterson and Hennessy. Chapter 6
Multiprocessors and Thread Level Parallelism
Mellor-Crummey, John Scott, Michael. Algorithms
for Scalable Synchronization on Shared Memory
Multiprocessors. January 1991.

Write a Comment

User Comments (0)