Synchronization - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Synchronization

Description:

With snoopy write-through invalidation-based cache coherence ... With snoopy distributed-write cache coherence. Hardware Support for Mutex. Atomic RMW. ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 40

Provided by: Ken667

Category:

more less

Transcript and Presenter's Notes

Title: Synchronization

1
Synchronization

Kenneth Chiu

2
Reporting Performance with Percentages

Suppose A takes 100 s and B takes 80 s. Which is
correct?
B is 25 faster than A.
B is 20 faster than A.
Notes
1/100 .01
1/80 .0125

3
Introduction

Introduction
Architectures
Network
Bus
Simple Approaches
Spin on test-and-set
Spin on read
New Software Alternatives
Delays
Queuing
Hardware Solutions
Summary

4
Mutual Exclusion

Pure software solutions are inefficient
Dekkers algorithm
Petersons algorithm
Most ISAs provide atomic instructions
Test-and-set
Atomic increment
Compare-and-swap
Load-linked/store-conditional
Atomic exchange

5
Spinning (Busy-Waiting)

What is spin-lock?
Test
Test-Lock
Release
Isnt busy-waiting bad?
Blocking is expensive
Adaptive
What should happen on a uniprocessor?

6
Architectures

Introduction
Architectures
Network
Bus
Simple Approaches
Spin on test-and-set
Spin on read
New Software Alternatives
Delays
Queuing
Hardware Solutions
Summary

7
Architectures

Multistage interconnection network
Without coherent private caches
With invalidation-based cache coherence using
remote directories
Bus
Without coherent private caches
With snoopy write-through invalidation-based
cache coherence
With snoopy write-back invalidation-based cache
coherence
With snoopy distributed-write cache coherence

8
Hardware Support for Mutex

Atomic RMW. Requires
Read
Write
Arbitration
Locking
Usually collapsed into one or two transactions.

9
Multistage Networks

Requests forwarded through series of switches.
Cached copies (recorded in remote directory)
invalidated.
Computation can be done remotely.

10
Bus

Bus used for arbitration
Processor acquires the bus and raises a line

11
Simple Approaches

Introduction
Architectures
Network
Bus
Simple Approaches
Spin on test-and-set
Spin on read
New Software Alternatives
Delays
Queuing
Hardware Solutions
Summary

12
Three Concerns

To minimize delay before reacquisition, should
retry often.
Retrying often creates a lot of bus activity,
which degrades performance.
Complex algorithms may address first two, but
then will have high latency for uncontended locks.

13
Performance Under Contention Unimportant?

Could be argued
Highly parallel application by definition has no
lock with contention.
If you have contention, youve designed it wrong.
Specious argument
Non-linear behavior. Contention may be worse with
bad locks.
Cant always redesign the program.

14
Spin on Test-And-Set

Code
Initial condition
lock CLEAR
Lock
while (TestAndSet(lock) BUSY)
Unlock
lock CLEAR
Performance
Contention on the lock datum.
Architectures dont allow lock holder to have
priority.
What about priority inheritance?
General subsystem contention.

15
Spin on Read

Pseudo-Code
Lock
while (lock BUSY or TestAndSet(lock) BUSY)
Assume short-circuiting.
Performance
When busy, reads out of cache.
Upon release, either each copy updated
(distributed-write) or read miss.
Better?

16
Better?
W(i)
P1
R(m)
T(i)
P2
R(m)
T(i)
R
R
P3
R(m)
R(m)
R
R
P4
17
Spin on Read

Upon release, each processor will incur a read
miss. Each processor will then read the value.
One processor will read the new value and do the
first test-and-set.
Processors that miss the unlocked window will
resume spinning. Processors that see the lock
unlocked will try to test-and-test.
Each failing test-and-test will invalidate all
other cache copies, causing them to miss again.

18
Quiescence

Memory requests before will be slowed down.
Memory requests after unaffected.
This is a per-critical section overhead, so short
critical sections will suffer.
Fixed priority bus arbitration
Lock holder has highest priority so will not be
slowed.
Race conditions that will upset the priority.
If released before quiescence, will be processors
that are still contending for the lock.

19
Spin on Read Analysis

Several factors
Delay between detecting the lock has been
released, and attempting to acquire it with
test-and-set.
Invalidation occurs during the test-and-set.
Invalidation-based cache-coherence requires O(P)
bus/network cycles to broadcast a value.
Cant they snoop?
Broadcasting updates exacerbates the problem.
(Why?)
Since they all try the test-and-test.

20
Performance

Sequent shared-memory with 20 386 processors.
Write-back, invalidation-based.
Acquiring and releasing normally takes 5.6
microsecs.
Total elapsed time for all processors to execute
a CS one million times.
Lock, execute critical section, release, delay
Mean delay is equal to 5X time of the critical
section.
Whats the purpose of the delay?
Lock and data in separate cache lines.
Whats false sharing?

21
Performance
Ideal is with free spin-waiting.
22
Quiescence Time

How to measure?
A critical section that delays, then uses bus
heavily.
If delay is long enough, then time to execute the
critical section should be the same on one
processor as on all processors.
Perform a search.
What kind of search? (How would you pick the
times?)

23
Quiescence Time
24
New Software Alternatives

Introduction
Architectures
Network
Bus
Simple Approaches
Spin on test-and-set
Spin on read
New Software Alternatives
Delays
Queuing
Hardware Solutions
Summary

25
Inserting Delays

Two dimensions
Insertion location
After release
After every access
Delay length
static fixed for a particular processor
dynamic changes

26
CSMA/CD

Ethernet algorithm
Listen for idle line
Immediately try to transmit when idle
If collision occurs (how do we know?), then wait
and retry
Increase wait time exponentially
Bonus question why does Ethernet have a minimum
packet size?

27
Delay After Release Detection

Idea is to minimize unsuccessful test-and-set
instructions.
Two kinds of delay
Static
Each processor statically assigned a slot from 0
to N 1.
Number of slots can be adjusted.
Few processors
Many slots, high latency
Few slots, good performance
Many processors
Many slots, good performance
Few slots, high contention
Dynamic

28
Dynamic Delay

CSMA/CD collision has fundamentally different
properties.
In locking collision, the first locker
succeeded.
In CSMA/CD collision, no sender succeeded.
What happens with exponential backoff for 10
lockers?
Solution is to bound the delay.

29
Delay between References

Doesnt work well with dynamic delay.
Backoff continues while locking processor in
critical section.
Delay should be tied to number of spinning
processors, not the length of the critical
section.
Any possible alternatives to estimating the
number of spinning processors?

30
Performance