Title: Uppsala%20University%20Department%20of%20Information%20Technology%20Uppsala%20Architecture%20Research%20Team%20[UART]
1 Efficient Synchronization for Non-Uniform
Communication Architecture
Zoran Radovic and Erik Hagerstenzoran.radovic,
erik.hagersten_at_it.uu.se
- Uppsala UniversityDepartment of Information
TechnologyUppsala Architecture Research Team
UART
2Synchronization Basics
- Locks are used to protect the shared critical
section data
3Simple Spin Locks
- test_and_testset (TATAS), 84
- TATAS with exponential backoff (TATAS_EXP), 90
- Many variations
Memory
TATAS_LOCK(L) if (tas(L)) do if
(L) continue while (tas(L))
TATAS_UNLOCK(L) L 0 // FREE
FREE
BUSY
Lock
P1
P2
P3
Pn
P3
FREE
BUSY
BUSY
BUSY
4Performance Under Contention
CS Cost
IF (more contention) ? THEN less efficient CS
Amount of Contention
5Making it Scalable Queues
- First-come,first-served order
- Starvation avoidance
- Maximal fairness
- Reduced traffic
- Queue-based locks
- HW QOLB 89
- SW MCS 91
- SW CLH 93
6Queue Locks Under Contention
Spin locks
Spin locks w/ backoff
CS Cost
IF (more contention) ? THEN constant CS cost
Amount of Contention
7Non-Uniform MemoryArchitecture (NUMA)
Memory
Memory
Switch
1
2 10
P1
P2
P3
Pn
P1
P2
P3
Pn
- Many NUMA optimizations are proposed
- Page migration
- Page replication
8Non-Uniform CommunicationArchitecture (NUCA)
Memory
Memory
Switch
NUCA ratio
1
2 10
P1
P2
P3
Pn
P1
P2
P3
Pn
- NUCA examples (NUCA ratios)
- 1992 Stanford DASH ( 4.5)
- 1996 Sequent NUMA-Q ( 10)
- 1999 Sun WildFire ( 6)
- 2000 Compaq DS-320 ( 3.5)
- Future CMP, SMT ( 10)
Our NUCA
9Our Goals
- Design a scalable spin lock that exploits the
NUCAs - Creating node affinity
- For lock handover
- For CS data
- Stable lock
- Reducing the traffic compared with the testset
locks
10Outline
- Background Motivation
- NUMA vs. NUCA
- The RH Lock
- Performance Results
- Application Study
- Conclusions
11Key Ideas Behind RH Lock
- Minimizing global traffic at lock-handover
- Only one thread per node will try to acquire a
remotely owned lock - Maximizing node locality of NUCAs
- Handover the lock to a neighbor in the same node
- Creates locality for the critical section (CS)
data as well - Especially good for large CS and high contention
- RH lock in a nutshell
- Double TATAS_EXP one node-local lock one
global
12The RH Lock Algorithm
Cabinet 1 Memory
Cabinet 2 Memory
FREE
Lock2
REMOTE
Lock2
FREE
2
1
16
L_FREE
16
Lock1
19
32
REMOTE
Lock1
P2
P1
P2
P3
P16
P17
P18
P19
P32
P19
FREE?CS
2
1
REMOTE
FREE?CS
13Our NUCA Sun WildFire
Memory
Memory
Switch
NUCA ratio
1
6
P1
P2
P3
Pn
P1
P2
P3
Pn
WF
14NUCA-performance
14
14
15New Microbenchmark
- More realistic node handoffs for queue-based
locks - Constant number of processors
- Amount of Critical Section (CS) work can be
increased - we can control the amount of contention
for (i 0 i lt iterations i) LOCK(L)
delay(critical_work) // CS UNLOCK(L)
static_delay() random_delay()
16Performance ResultsNew microbenchmark, 2-node
Sun WildFire, 28 CPUs
14
14
WF
17Traffic MeasurementsNew microbenchmark
critical_work 1500
18Application PerformanceRaytrace Speedup
WF
19Application PerformanceRaytrace Speedup
WF
20RH Lock Under Contention
Spin locks
Spin locks w/ backoff
CS Cost
Queue-based locks
Amount of Contention
21Total Traffic Raytrace
22Application Performance28-processor runs
23Conclusions
- First-come, first-served not desirable for NUCAs
- The RH lock exploits NUCAs by
- creating locality through CS affinity (stable
lock) - reducing traffic compared with the testset locks
- The first lock that performs better under
contention - Global traffic is significantly reduced
- Applications with contented locks scale better
with RH locks on NUCAs
24Any Drawbacks?
- Proof-of-concept NUCA-aware lock for 2 nodes
- Hard to port to some architectures
- Memory needs to be allocated/placed in different
nodes - Lock storage is proportional to NUCA nodes
- Sensitive for starvation
- Non-uniform nature of the algorithm
- No mechanism for lowering the risk of starvation
25Can We Fix It?
- We propose a new set of NUCA-aware locks
- Hierarchical Backoff Locks (HBO)
- HPCA-9 Anaheim, California, February 2003
- Teaser
- Portable
- Scalable to many NUCA nodes
- Only cas atomic operations are used
- Only node_id is needed
- Lowers the risk of starvation
26UARTs Home Page
http//www.it.uu.se/research/group/uart