Spin Lock in Shared Memory Multiprocessors - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Spin Lock in Shared Memory Multiprocessors

Description:

Processors can access common memory locations. Symmetric multiprocessors (SMP) - shared memory multiprocessor ... If too long, spend time waiting needlessly ... – PowerPoint PPT presentation

Number of Views:238

Avg rating:3.0/5.0

Slides: 18

Provided by: r335

Category:

more less

Transcript and Presenter's Notes

Title: Spin Lock in Shared Memory Multiprocessors

1
Spin Lock in Shared Memory Multiprocessors

Ref Anderson paper

2
Outline

Background
Multiprocessor architectures
Caches and cache coherence
Synchronization
Simple spin locks and their performance
Test-and-set Spin
Spin-on-read
More complex spin locks
Delays
Queueing

3
Multiprocessor Architectures

Basically, three flavors
Message passing machines (multicomputers)
Each processor has its own private memory (and
address space)
Communications via message passing
Cluster of (uniprocessor) workstations
Shared Memory Multiprocessors
Processors can access common memory locations
Symmetric multiprocessors (SMP) - shared memory
multiprocessor where average memory access time
is the same for any processor, independent of
which memory location you are accessing
By contrast, asymmetric machines distinguish
between local (fast) and remote (slow) memory
harder to program, less common today
Current trend SMPs becoming ubiquitous (aka
multi-core architecture)
Combination of the two
E.g., collection of SMPs connected via a fast
switch or LAN
Shared memory within SMP, message passing between
SMPs (can build message passing library over
shared memory to hide this)
Common organization for many supercomputers

4
SMP Hardware

Processor
Memory
Cache memory
Typically, two or three levels of cache
First level fast, small (e.g., 1 clock cycle
(cc), 256 KB), second slower, larger (e.g., 10
cc, 4 MBs)
Main memory (e.g., hundreds of cc, 16 GBs)
Store buffers so processor need not wait for
memory write to complete
Switch
General switch (e.g., crossbar)
Single shared bus
Broadcast easy on shared bus - important!
Here, focus on bus-based SMPs
Many ideas apply to machines with general
interconnects

5
Cache Coherence

Suppose a single variable is stored in the cache
memory of several different processors, and one
of the processors modifies the variable
Cache coherence problem
Solutions
Invalidate (delete) the stale, cached copies
Update the cached copies
Invalidation/Update easier w/ shared bus
snoopy cache coherence protocol - processors
listen to memory accesses on bus,
update/invalidate as needed

6
Synchronization Operations

Atomic read-modify-write operation
Sparc ldstub test-and-set instruction
Bus can be exploited here too
Processor raises a bus line while performing
read-modify-write operation
Other processors cannot start read-modify-write
operation while bus line held high
Other (non-synchronization) memory access can
access the bus while line held high
Since write is performed, requires
invalidation/update operations

7
Spin Locks

Init lock CLEAR
Lock while (TestAndSet(lock)BUSY)
Unlock lock CLEAR
Advantages
Efficient for short waits (short critical
sections) since thread need not give up processor
and incur scheduling and context switch overheads
Processor may not have anything else to do anyway
Disadvantages
Inefficient for longer waits
System overheads (e.g., bus contention, cache
effects)?
Main focus of Anderson paper

8
Performance Metrics

Small bandwidth consumption
bus shared with other processors
Small delay - time from when lock becomes free
until a waiting processor can acquire it
Short latency - time to acquire lock when there
is no contention (lock is available)
Spin locks tradeoff spin locks are polling
mechanisms if you poll too often, you consume
too much bandwidth, if rarely, you increase delay
Observation one can poll local cached copy
without using bus bandwidth!
Ideally, polling done on local cached copy

9
TestAndSet Spin Lock
Lock while (TestAndSet(lock)BUSY) Unlock
lock CLEAR

Latency (time to gain lock if no contention)?
Good Low latency to acquire a free lock
Bandwidth (bus cycles used)?
Each TestAndSet (TS) operation requires a bus
cycle saturates bus as the number of processors
increases this slows other processors down, even
those not accessing lock
Delay (lock release to acquire time)?
Processor releasing lock must contend with
processors trying to acquire lock for bus cycles
SMPs usually do not give priority to processor
releasing lock
Overall, performance poor if many processors due
to contention does not exploit cache for polling
because TS used to poll

10
Test and Test-and-Set(Spin-on-Read)
Poll by reading lock only use TestSet if lock
not busy Lock while (lockBUSY
TestAndSet(lock)BUSY)

Latency?
Again good little time to obtain lock if no
contention
Two memory operations (why?)
Bandwidth?
If lock Busy, spin by reading cache (dont
consume bus cycles) because using read to poll,
not TS
Delay (release-to-acquire time)?
A bit more complex

11
Spin-on-Read (cont.)

Lock
while (lockBUSY TestAndSet(lock)BUSY)
Delay (release to acquire time)?
When lock released, cached copy is invalidated or
updated, causing the TestAndSet operation to be
executed
One processor acquires lock others continue to
spin
If many processors, flurry of activity when lock
released (many unneeded bus ops) consider an
invalidation protocol
Lock is released (set to CLEAR), invalidating
copy in all other caches
Some processors read lock (cache miss), load lock
(valueCLEAR) into their cache (pending TS
request)
Remaining processors have not yet read lock
(pending read request)
One processor executes TS, acquires lock
TS operation invalidates caches of processors w/
pending TS requests!
Some processors with pending reads now load
(busy) lock into cache
Other processors with pending TS now do TS,
fail to get lock, but invalidate all the caches,
again! etc. etc. etc.
Finally, last processor w/ pending TS
invalidates everyones cache
Each processor again reloads lock into cache

12
Inefficiencies with Spin-on-Read

During the time from when a processor loads the
lock into its cache until it does TS to acquire
lock, other processors may also load the CLEAR
lock into their cache, then try TS this
degrades performance
Each subsequent TS uses a bus cycle, invalidates
cached copies of the lock, even though the TS
write does not change the memory locations value
For P processors, after an invalidation occurs, P
bus cycles are needed to reload the value into
the caches, even though the same value is being
read by each processor
Empirical measurements
Original TS spin performs poorly for many
processors
Spin-on-Read better, but still disappointing
performance for many processors

13
Update Protocols

Update protocols typically use copy-back write
policy
Typically distinguish between
Exclusive data (this is only cache with a copy)
Shared data (stored in multiple caches
Bus operations needed when shared data is written
Lock variable is shared in scenarios described
earlier, so a bus transaction is needed for each
TS operation to update the other caches

14
Other Spin Locks Adding Delay

Delay after noticing lock was released
Lock
While(lockBUSY TS(lock)BUSY)
while (lock BUSY) // spin here
Delay()
Rationale
Recall problem was having many pending TS ops
after lock becomes available
Solution After noticing lock available, Delay,
then check if busy again before trying TS
Hope to reduce number of unsuccessful TS ops,
reducing number of invalidations

15
Delay Period

Length of delay could be set statically or
dynamically
If too short, delay doesnt help much
If too long, spend time waiting needlessly
Ideally, different processors to delay different
amounts of time
Dynamic delay each processor chooses a random
delay (and adapts)
Like Ethernet (CSMA networks)
But, collisions in the spin-lock case cost more
if there are more processors waiting (it takes
longer to start spinning on the cache again)
Good heuristic for backoff
Maximum bound on mean delay equal to the number
of processors to avoid long delays if only one
still waiting
Initial delay should be a function of the delay
last time before the lock was acquired (costly to
learn from mistakes!), e.g., half of the previous
delay

16
Other Approaches

Delay between each reference
Lock
While(lockBUSY TS(lock)BUSY)
Delay()
Reduces bus traffic for machines w/o cache
coherence or SMPs with invalidation protocols
Queue waiting processors
See paper for deeper discussion, performance
analyses

17
Summary

Implementations of spin locks can have
significant impact on performance, especially as
the number of spinning processors increases
Lock design must be tailored to the coherence
protocol for best performance
Optimized protocols attempt to minimize the
number of unfruitful test and set operations in
order to minimize number of bus cycles, avoid
excessive invalidation/update operations

Write a Comment

User Comments (0)