Synchronization - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Synchronization

Description:

With snoopy write-through invalidation-based cache coherence ... With snoopy distributed-write cache coherence. Hardware Support for Mutex. Atomic RMW. ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 40
Provided by: Ken667
Category:

less

Transcript and Presenter's Notes

Title: Synchronization


1
Synchronization
  • Kenneth Chiu

2
Reporting Performance with Percentages
  • Suppose A takes 100 s and B takes 80 s. Which is
    correct?
  • B is 25 faster than A.
  • B is 20 faster than A.
  • Notes
  • 1/100 .01
  • 1/80 .0125

3
Introduction
  • Introduction
  • Architectures
  • Network
  • Bus
  • Simple Approaches
  • Spin on test-and-set
  • Spin on read
  • New Software Alternatives
  • Delays
  • Queuing
  • Hardware Solutions
  • Summary

4
Mutual Exclusion
  • Pure software solutions are inefficient
  • Dekkers algorithm
  • Petersons algorithm
  • Most ISAs provide atomic instructions
  • Test-and-set
  • Atomic increment
  • Compare-and-swap
  • Load-linked/store-conditional
  • Atomic exchange

5
Spinning (Busy-Waiting)
  • What is spin-lock?
  • Test
  • Test-Lock
  • Release
  • Isnt busy-waiting bad?
  • Blocking is expensive
  • Adaptive
  • What should happen on a uniprocessor?

6
Architectures
  • Introduction
  • Architectures
  • Network
  • Bus
  • Simple Approaches
  • Spin on test-and-set
  • Spin on read
  • New Software Alternatives
  • Delays
  • Queuing
  • Hardware Solutions
  • Summary

7
Architectures
  • Multistage interconnection network
  • Without coherent private caches
  • With invalidation-based cache coherence using
    remote directories
  • Bus
  • Without coherent private caches
  • With snoopy write-through invalidation-based
    cache coherence
  • With snoopy write-back invalidation-based cache
    coherence
  • With snoopy distributed-write cache coherence

8
Hardware Support for Mutex
  • Atomic RMW. Requires
  • Read
  • Write
  • Arbitration
  • Locking
  • Usually collapsed into one or two transactions.

9
Multistage Networks
  • Requests forwarded through series of switches.
  • Cached copies (recorded in remote directory)
    invalidated.
  • Computation can be done remotely.

10
Bus
  • Bus used for arbitration
  • Processor acquires the bus and raises a line

11
Simple Approaches
  • Introduction
  • Architectures
  • Network
  • Bus
  • Simple Approaches
  • Spin on test-and-set
  • Spin on read
  • New Software Alternatives
  • Delays
  • Queuing
  • Hardware Solutions
  • Summary

12
Three Concerns
  • To minimize delay before reacquisition, should
    retry often.
  • Retrying often creates a lot of bus activity,
    which degrades performance.
  • Complex algorithms may address first two, but
    then will have high latency for uncontended locks.

13
Performance Under Contention Unimportant?
  • Could be argued
  • Highly parallel application by definition has no
    lock with contention.
  • If you have contention, youve designed it wrong.
  • Specious argument
  • Non-linear behavior. Contention may be worse with
    bad locks.
  • Cant always redesign the program.

14
Spin on Test-And-Set
  • Code
  • Initial condition
  • lock CLEAR
  • Lock
  • while (TestAndSet(lock) BUSY)
  • Unlock
  • lock CLEAR
  • Performance
  • Contention on the lock datum.
  • Architectures dont allow lock holder to have
    priority.
  • What about priority inheritance?
  • General subsystem contention.

15
Spin on Read
  • Pseudo-Code
  • Lock
  • while (lock BUSY or TestAndSet(lock) BUSY)
  • Assume short-circuiting.
  • Performance
  • When busy, reads out of cache.
  • Upon release, either each copy updated
    (distributed-write) or read miss.
  • Better?

16
Better?
W(i)
P1
R(m)
T(i)
P2
R(m)
T(i)
R
R
P3
R(m)
R(m)
R
R
P4
17
Spin on Read
  • Upon release, each processor will incur a read
    miss. Each processor will then read the value.
    One processor will read the new value and do the
    first test-and-set.
  • Processors that miss the unlocked window will
    resume spinning. Processors that see the lock
    unlocked will try to test-and-test.
  • Each failing test-and-test will invalidate all
    other cache copies, causing them to miss again.

18
Quiescence
  • Memory requests before will be slowed down.
  • Memory requests after unaffected.
  • This is a per-critical section overhead, so short
    critical sections will suffer.
  • Fixed priority bus arbitration
  • Lock holder has highest priority so will not be
    slowed.
  • Race conditions that will upset the priority.
  • If released before quiescence, will be processors
    that are still contending for the lock.

19
Spin on Read Analysis
  • Several factors
  • Delay between detecting the lock has been
    released, and attempting to acquire it with
    test-and-set.
  • Invalidation occurs during the test-and-set.
  • Invalidation-based cache-coherence requires O(P)
    bus/network cycles to broadcast a value.
  • Cant they snoop?
  • Broadcasting updates exacerbates the problem.
    (Why?)
  • Since they all try the test-and-test.

20
Performance
  • Sequent shared-memory with 20 386 processors.
  • Write-back, invalidation-based.
  • Acquiring and releasing normally takes 5.6
    microsecs.
  • Total elapsed time for all processors to execute
    a CS one million times.
  • Lock, execute critical section, release, delay
  • Mean delay is equal to 5X time of the critical
    section.
  • Whats the purpose of the delay?
  • Lock and data in separate cache lines.
  • Whats false sharing?

21
Performance
Ideal is with free spin-waiting.
22
Quiescence Time
  • How to measure?
  • A critical section that delays, then uses bus
    heavily.
  • If delay is long enough, then time to execute the
    critical section should be the same on one
    processor as on all processors.
  • Perform a search.
  • What kind of search? (How would you pick the
    times?)

23
Quiescence Time
24
New Software Alternatives
  • Introduction
  • Architectures
  • Network
  • Bus
  • Simple Approaches
  • Spin on test-and-set
  • Spin on read
  • New Software Alternatives
  • Delays
  • Queuing
  • Hardware Solutions
  • Summary

25
Inserting Delays
  • Two dimensions
  • Insertion location
  • After release
  • After every access
  • Delay length
  • static fixed for a particular processor
  • dynamic changes

26
CSMA/CD
  • Ethernet algorithm
  • Listen for idle line
  • Immediately try to transmit when idle
  • If collision occurs (how do we know?), then wait
    and retry
  • Increase wait time exponentially
  • Bonus question why does Ethernet have a minimum
    packet size?

27
Delay After Release Detection
  • Idea is to minimize unsuccessful test-and-set
    instructions.
  • Two kinds of delay
  • Static
  • Each processor statically assigned a slot from 0
    to N 1.
  • Number of slots can be adjusted.
  • Few processors
  • Many slots, high latency
  • Few slots, good performance
  • Many processors
  • Many slots, good performance
  • Few slots, high contention
  • Dynamic

28
Dynamic Delay
  • CSMA/CD collision has fundamentally different
    properties.
  • In locking collision, the first locker
    succeeded.
  • In CSMA/CD collision, no sender succeeded.
  • What happens with exponential backoff for 10
    lockers?
  • Solution is to bound the delay.

29
Delay between References
  • Doesnt work well with dynamic delay.
  • Backoff continues while locking processor in
    critical section.
  • Delay should be tied to number of spinning
    processors, not the length of the critical
    section.
  • Any possible alternatives to estimating the
    number of spinning processors?

30
Performance
  • 1 microsecond to execute a test-and-set.
  • Queuing done with explicit lock.
  • Ideal time subtracted to show overhead only.
  • One processor time shows latency.
  • Static worse with few processors.
  • With many processors, backoff slightly worse

31
Spin-Waiting Overhead vs No. of Slots
  • Need lots of slots when lots of processors.

32
Queuing
  • Use shared counter to keep track of no. of
    spinning processors.
  • Two extra atomic instructions per critical
    section.
  • Each spinning processor must read counter.
  • Use explicit queue
  • Doesnt really get anywhere, since need a lock
    for the queue.

33
Use Array of Flags
  • Each processor spins on its own memory location.
  • To unlock, signal next memory location.
  • Use atomic increment to assign memory locations.
  • Use modular arithmetic to avoid infinite arrays.

34
Queue
  • Code
  • Init
  • flags0 HAS_LOCK
  • flags1..P 1 MUST_WAIT
  • queueLast 0
  • Lock
  • myPlace ReadAndIncrement(queueLast)
  • while (flagsmyPlace mod P MUST_WAIT)
  • flagsmyPlace mod P MUST_WAIT
  • Unlock
  • flags(myPlace 1) mod P HAS_LOCK
  • What happens on overflow of queueLast? How to
    fix?
  • Memory barriers needed?

35
Performance
  • Atomic increment is emulated.
  • Initial latency is high.

36
Overhead in Achieving Barrier
  • Barrier
  • Timestamp is taken on release.
  • Timestamp taken when last processor acquires the
    lock.

37
Hardware Solutions
  • Introduction
  • Architectures
  • Network
  • Bus
  • Simple Approaches
  • Spin on test-and-set
  • Spin on read
  • New Alternatives
  • Delays
  • Queuing
  • Hardware Solutions
  • Summary

38
Hardware Solutions
  • Networks
  • Combining
  • Hardware queuing
  • Bus
  • Invalidate only if lock value changes.
  • Still has performance degradation as processors
    goes up.
  • More snooping
  • Snoop read miss data
  • Snoop test-and-set requests
  • First read miss (snoop miss data)
  • If busy, abort
  • If free, then try locking bus
  • While waiting, monitor the bus.
  • Abort if someone else gets the lock

39
Summary
  • Memory operations are not free.
  • Memory operations are not independent on
    shared-memory machines.
  • Writes are expensive.
  • Atomic instructions are even more expensive.
  • Dont kill the bus.
Write a Comment
User Comments (0)
About PowerShow.com