The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors

About This Presentation

Title:

The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors

Description:

Bus w/snoopy write through invalidation-based cache coherence. Bus with snoopy write-back invalidation ... Bus with snoopy distributed write cache coherence ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 24

Provided by: emersonmu

Category:

more less

Transcript and Presenter's Notes

Title: The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors

1
The Performance of Spin Lock Alternatives for
Shared-Memory Multiprocessors

Paper Thomas E. Anderson
Presentation Emerson Murphy-Hill

2
Introduction

Shared Memory Multiprocessors
Mutual exclusion required
Almost always hardware primitives provided
Direct mutual exclusion
Mutual exclusion through locking
Interest here short critical regions, spin locks
The problem spinning processors cost
communication bandwidth how can we cut it?

3
Range of Architectures

Two dimensions
Interconnect type (multistage network or bus)
Cache type
So six architectures considered
Multistage network without private caches
Multistage network, invalidation based cache
coherence using RD
Bus without coherent private cache
Bus w/snoopy write through invalidation-based
cache coherence
Bus with snoopy write-back invalidation based
cache coherence
Bus with snoopy distributed write cache coherence
Architectures generally read, modify, and write
atomically

4
Why Spinlocks are Slow

Tradeoff frequent polling gets you the lock
faster, but slows everyone else down
Latency is an issue some overhead for
complicated spinlock algorithm

5
A Spin-Waiting Algorithm

Spin on Test-and-Set
while(TestAndSet(lock) BUSY)
ltcriticial sectiongt
Lock CLEAR
Slow, because
Lock holder must content with non-lock holders
Spinning requests slow other requests

6
Another Spin-Waiting Algorithm

Spin on Read (Test-and-Test-and-Set)
while(lockBUSY or TestAndSet(lock)BUSY)
ltcriticial sectiongt
lock CLEAR
For architectures with per-processor cache
Like previous, but no network/bus communication
on read
For short critical sections, this is slow,
because the time to quiesce (all processors
resume spinning) dominates

7
Reasons Why Quiescence is Slow

Elapsed time between Read and Test-and-Set
All cached copies of a lock are invalidated on a
Test-and-Set, even if the test fails
Invalidation-based cache-coherence requires O(P)
bus/network cycles, because a written value has
to be propegated to every processor (the same
one!)

8
Validation
9
Validation (a bit more)
10
Now, Speed it Up

Author presents 5 alternative approaches
Interesting approach 4 are based on the
observation that communication during spin
waiting is like CSMA (Ethernet) networking
protocols

11
1/5 Static Delay on Lock Release

When a processor notices the lock has been
released, it waits a fixed amount of time before
trying a Test-And-Set
Each processor is assigned a static delay (slot)
Good performance
Fewer slots, fewer spinning processors
Many slots, more spinning processors

12
2/5 Backoff on Lock Release

Like Ethernet backoff
Wait a small amount of time between Read and
Test-and-Set
If processor collides with another processor, it
backs off for a greater random interval
Indirectly, processors base backoff interval on
the number of spinning processors
But

13
More on Backoff

Processors should not change their mean delay if
another processor acquires the lock
Maximum time to delay should be bounded
Initial delay on arrival should be a fraction of
the last delay

14
3/5 Static Delay before Reference

while(lockBUSY or TestAndSet(lock)BUSY)
delay()
ltcriticial sectiongt
Here you just check the lock less often
Good when
Checking frequently, and few other spinners
Checking infrequently, many spinners

15
4/5 Backoff before Reference

while(lockBUSY or TestAndSet(lock)BUSY)
delay()
delay randomBackoff()
ltcriticial sectiongt
Analogous to backoff on lock release
Both dynamic and static backoff are bad when the
critical section is long they just keep backing
off while the lock is being held

16
5/5 Queue

Cant estimate backoff by number of waiting
processes, cant keep a process queue (just as
slow as the lock!)
This authors contribution (finally)
Init flags0 HAS_LOCK
flags1..P-1 MUST_WAIT
queueLast 0
Lock myPlace ReadAndIncrement(queueLast)
while(flagsmyPlace mod PMUST_WAIT)
ltcritical sectiongt
Unlock flagsmyPlace mod P MUST_WAIT
flags(myPlace1) mod P HAS_LOCK

17
More on Queuing

Works especially well for multistage networks
each flag can be on a separate module, so a
single memory location isnt saturated with
requests
Works less well if theres a bus without cache
coherence, because we still have the problem that
each process has to poll for a single value in
one place
Lock latency is increased (overhead), so poor
performance when theres no contention

18
Benchmark Spin-lock Alternatives
19
Overhead vs. Number of Slots
20
Spin-waiting Overhead for a Burst
21
Network Hardware Solutions

Combining Networks
Multiple paths to same memory location
Hardware Queuing
Eliminates polling across the network
Goodmans Queue Links
Stores the name of the next processor in the
queue directly in each processors cache
Eliminates need for memory access for queuing

22
Bus Hardware Solutions

Invalidate cache copies ONLY when Test-and-Set
succeeds
Read broadcast
Whenever some other processor reads a value which
I know is invalid, I get a copy of that value too
(piggyback)
Eliminates the cascade of read-misses
Special handling of Test-and-Set
Cache and bus controllers dont mess with the bus
if the lock is busy
Essentially, doesnt do a test-and-set so long as
there is a possibility it might fail

23
Conclusions

Spin-locking performance doesnt scale
A variant of Ethernet backoff has good results
when there is little lock contention
Queuing (parallelizing lock handoff) has good
results when there are waiting processors
A little supportive hardware goes a long way
towards a healthy multiprocessor relationship

Write a Comment

User Comments (0)

About PowerShow.com

The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors - PowerPoint PPT Presentation

The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors

Bus w/snoopy write through invalidation-based cache coherence. Bus with snoopy write-back invalidation ... Bus with snoopy distributed write cache coherence ... – PowerPoint PPT presentation