Uppsala%20University%20Department%20of%20Information%20Technology%20Uppsala%20Architecture%20Research%20Team%20[UART] - PowerPoint PPT Presentation

About This Presentation
Title:

Uppsala%20University%20Department%20of%20Information%20Technology%20Uppsala%20Architecture%20Research%20Team%20[UART]

Description:

Non-Uniform Communication Architecture. Zoran Radovic and Erik Hagersten ... TATAS(my_TID, Lock) until FREE or. L_FREE. if 'REMOTE': Spin remotely. CAS(FREE, REMOTE) ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 27
Provided by: user62
Category:

less

Transcript and Presenter's Notes

Title: Uppsala%20University%20Department%20of%20Information%20Technology%20Uppsala%20Architecture%20Research%20Team%20[UART]


1
Efficient Synchronization for Non-Uniform
Communication Architecture
Zoran Radovic and Erik Hagerstenzoran.radovic,
erik.hagersten_at_it.uu.se
  • Uppsala UniversityDepartment of Information
    TechnologyUppsala Architecture Research Team
    UART

2
Synchronization Basics
  • Locks are used to protect the shared critical
    section data

3
Simple Spin Locks
  • test_and_testset (TATAS), 84
  • TATAS with exponential backoff (TATAS_EXP), 90
  • Many variations

Memory
TATAS_LOCK(L) if (tas(L)) do if
(L) continue while (tas(L))
TATAS_UNLOCK(L) L 0 // FREE
FREE
BUSY
Lock





P1
P2
P3
Pn
P3
FREE
BUSY
BUSY
BUSY
4
Performance Under Contention
CS Cost
IF (more contention) ? THEN less efficient CS
Amount of Contention
5
Making it Scalable Queues
  • First-come,first-served order
  • Starvation avoidance
  • Maximal fairness
  • Reduced traffic
  • Queue-based locks
  • HW QOLB 89
  • SW MCS 91
  • SW CLH 93

6
Queue Locks Under Contention
Spin locks
Spin locks w/ backoff
CS Cost
IF (more contention) ? THEN constant CS cost
Amount of Contention
7
Non-Uniform MemoryArchitecture (NUMA)
Memory
Memory
Switch








1
2 10
P1
P2
P3
Pn
P1
P2
P3
Pn
  • Many NUMA optimizations are proposed
  • Page migration
  • Page replication

8
Non-Uniform CommunicationArchitecture (NUCA)
Memory
Memory
Switch
NUCA ratio








1
2 10
P1
P2
P3
Pn
P1
P2
P3
Pn
  • NUCA examples (NUCA ratios)
  • 1992 Stanford DASH ( 4.5)
  • 1996 Sequent NUMA-Q ( 10)
  • 1999 Sun WildFire ( 6)
  • 2000 Compaq DS-320 ( 3.5)
  • Future CMP, SMT ( 10)

Our NUCA
9
Our Goals
  • Design a scalable spin lock that exploits the
    NUCAs
  • Creating node affinity
  • For lock handover
  • For CS data
  • Stable lock
  • Reducing the traffic compared with the testset
    locks

10
Outline
  • Background Motivation
  • NUMA vs. NUCA
  • The RH Lock
  • Performance Results
  • Application Study
  • Conclusions

11
Key Ideas Behind RH Lock
  • Minimizing global traffic at lock-handover
  • Only one thread per node will try to acquire a
    remotely owned lock
  • Maximizing node locality of NUCAs
  • Handover the lock to a neighbor in the same node
  • Creates locality for the critical section (CS)
    data as well
  • Especially good for large CS and high contention
  • RH lock in a nutshell
  • Double TATAS_EXP one node-local lock one
    global

12
The RH Lock Algorithm
Cabinet 1 Memory
Cabinet 2 Memory
FREE
Lock2
REMOTE
Lock2
FREE
2
1
16
L_FREE
16
Lock1
19
32
REMOTE
Lock1










P2
P1
P2
P3
P16
P17
P18
P19
P32
P19
FREE?CS
2
1
REMOTE
FREE?CS
13
Our NUCA Sun WildFire
Memory
Memory
Switch
NUCA ratio








1
6
P1
P2
P3
Pn
P1
P2
P3
Pn
WF
14
NUCA-performance
14
14
15
New Microbenchmark
  • More realistic node handoffs for queue-based
    locks
  • Constant number of processors
  • Amount of Critical Section (CS) work can be
    increased
  • we can control the amount of contention

for (i 0 i lt iterations i) LOCK(L)
delay(critical_work) // CS UNLOCK(L)
static_delay() random_delay()
16
Performance ResultsNew microbenchmark, 2-node
Sun WildFire, 28 CPUs
14
14
WF
17
Traffic MeasurementsNew microbenchmark
critical_work 1500
18
Application PerformanceRaytrace Speedup
WF
19
Application PerformanceRaytrace Speedup
WF
20
RH Lock Under Contention
Spin locks
Spin locks w/ backoff
CS Cost
Queue-based locks
Amount of Contention
21
Total Traffic Raytrace
22
Application Performance28-processor runs
23
Conclusions
  • First-come, first-served not desirable for NUCAs
  • The RH lock exploits NUCAs by
  • creating locality through CS affinity (stable
    lock)
  • reducing traffic compared with the testset locks
  • The first lock that performs better under
    contention
  • Global traffic is significantly reduced
  • Applications with contented locks scale better
    with RH locks on NUCAs

24
Any Drawbacks?
  • Proof-of-concept NUCA-aware lock for 2 nodes
  • Hard to port to some architectures
  • Memory needs to be allocated/placed in different
    nodes
  • Lock storage is proportional to NUCA nodes
  • Sensitive for starvation
  • Non-uniform nature of the algorithm
  • No mechanism for lowering the risk of starvation

25
Can We Fix It?
  • We propose a new set of NUCA-aware locks
  • Hierarchical Backoff Locks (HBO)
  • HPCA-9 Anaheim, California, February 2003
  • Teaser
  • Portable
  • Scalable to many NUCA nodes
  • Only cas atomic operations are used
  • Only node_id is needed
  • Lowers the risk of starvation

26
UARTs Home Page
http//www.it.uu.se/research/group/uart
Write a Comment
User Comments (0)
About PowerShow.com