The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors

Description:

Bus w/snoopy write through invalidation-based cache coherence. Bus with snoopy write-back invalidation ... Bus with snoopy distributed write cache coherence ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 24
Provided by: emersonmu
Category:

less

Transcript and Presenter's Notes

Title: The Performance of Spin Lock Alternatives for SharedMemory Multiprocessors


1
The Performance of Spin Lock Alternatives for
Shared-Memory Multiprocessors
  • Paper Thomas E. Anderson
  • Presentation Emerson Murphy-Hill

2
Introduction
  • Shared Memory Multiprocessors
  • Mutual exclusion required
  • Almost always hardware primitives provided
  • Direct mutual exclusion
  • Mutual exclusion through locking
  • Interest here short critical regions, spin locks
  • The problem spinning processors cost
    communication bandwidth how can we cut it?

3
Range of Architectures
  • Two dimensions
  • Interconnect type (multistage network or bus)
  • Cache type
  • So six architectures considered
  • Multistage network without private caches
  • Multistage network, invalidation based cache
    coherence using RD
  • Bus without coherent private cache
  • Bus w/snoopy write through invalidation-based
    cache coherence
  • Bus with snoopy write-back invalidation based
    cache coherence
  • Bus with snoopy distributed write cache coherence
  • Architectures generally read, modify, and write
    atomically

4
Why Spinlocks are Slow
  • Tradeoff frequent polling gets you the lock
    faster, but slows everyone else down
  • Latency is an issue some overhead for
    complicated spinlock algorithm

5
A Spin-Waiting Algorithm
  • Spin on Test-and-Set
  • while(TestAndSet(lock) BUSY)
  • ltcriticial sectiongt
  • Lock CLEAR
  • Slow, because
  • Lock holder must content with non-lock holders
  • Spinning requests slow other requests

6
Another Spin-Waiting Algorithm
  • Spin on Read (Test-and-Test-and-Set)
  • while(lockBUSY or TestAndSet(lock)BUSY)
  • ltcriticial sectiongt
  • lock CLEAR
  • For architectures with per-processor cache
  • Like previous, but no network/bus communication
    on read
  • For short critical sections, this is slow,
    because the time to quiesce (all processors
    resume spinning) dominates

7
Reasons Why Quiescence is Slow
  • Elapsed time between Read and Test-and-Set
  • All cached copies of a lock are invalidated on a
    Test-and-Set, even if the test fails
  • Invalidation-based cache-coherence requires O(P)
    bus/network cycles, because a written value has
    to be propegated to every processor (the same
    one!)

8
Validation
9
Validation (a bit more)
10
Now, Speed it Up
  • Author presents 5 alternative approaches
  • Interesting approach 4 are based on the
    observation that communication during spin
    waiting is like CSMA (Ethernet) networking
    protocols

11
1/5 Static Delay on Lock Release
  • When a processor notices the lock has been
    released, it waits a fixed amount of time before
    trying a Test-And-Set
  • Each processor is assigned a static delay (slot)
  • Good performance
  • Fewer slots, fewer spinning processors
  • Many slots, more spinning processors

12
2/5 Backoff on Lock Release
  • Like Ethernet backoff
  • Wait a small amount of time between Read and
    Test-and-Set
  • If processor collides with another processor, it
    backs off for a greater random interval
  • Indirectly, processors base backoff interval on
    the number of spinning processors
  • But

13
More on Backoff
  • Processors should not change their mean delay if
    another processor acquires the lock
  • Maximum time to delay should be bounded
  • Initial delay on arrival should be a fraction of
    the last delay

14
3/5 Static Delay before Reference
  • while(lockBUSY or TestAndSet(lock)BUSY)
  • delay()
  • ltcriticial sectiongt
  • Here you just check the lock less often
  • Good when
  • Checking frequently, and few other spinners
  • Checking infrequently, many spinners

15
4/5 Backoff before Reference
  • while(lockBUSY or TestAndSet(lock)BUSY)
  • delay()
  • delay randomBackoff()
  • ltcriticial sectiongt
  • Analogous to backoff on lock release
  • Both dynamic and static backoff are bad when the
    critical section is long they just keep backing
    off while the lock is being held

16
5/5 Queue
  • Cant estimate backoff by number of waiting
    processes, cant keep a process queue (just as
    slow as the lock!)
  • This authors contribution (finally)
  • Init flags0 HAS_LOCK
  • flags1..P-1 MUST_WAIT
  • queueLast 0
  • Lock myPlace ReadAndIncrement(queueLast)
  • while(flagsmyPlace mod PMUST_WAIT)
  • ltcritical sectiongt
  • Unlock flagsmyPlace mod P MUST_WAIT
  • flags(myPlace1) mod P HAS_LOCK

17
More on Queuing
  • Works especially well for multistage networks
    each flag can be on a separate module, so a
    single memory location isnt saturated with
    requests
  • Works less well if theres a bus without cache
    coherence, because we still have the problem that
    each process has to poll for a single value in
    one place
  • Lock latency is increased (overhead), so poor
    performance when theres no contention

18
Benchmark Spin-lock Alternatives
19
Overhead vs. Number of Slots
20
Spin-waiting Overhead for a Burst
21
Network Hardware Solutions
  • Combining Networks
  • Multiple paths to same memory location
  • Hardware Queuing
  • Eliminates polling across the network
  • Goodmans Queue Links
  • Stores the name of the next processor in the
    queue directly in each processors cache
  • Eliminates need for memory access for queuing

22
Bus Hardware Solutions
  • Invalidate cache copies ONLY when Test-and-Set
    succeeds
  • Read broadcast
  • Whenever some other processor reads a value which
    I know is invalid, I get a copy of that value too
    (piggyback)
  • Eliminates the cascade of read-misses
  • Special handling of Test-and-Set
  • Cache and bus controllers dont mess with the bus
    if the lock is busy
  • Essentially, doesnt do a test-and-set so long as
    there is a possibility it might fail

23
Conclusions
  • Spin-locking performance doesnt scale
  • A variant of Ethernet backoff has good results
    when there is little lock contention
  • Queuing (parallelizing lock handoff) has good
    results when there are waiting processors
  • A little supportive hardware goes a long way
    towards a healthy multiprocessor relationship
Write a Comment
User Comments (0)
About PowerShow.com