Title: Lock Behaviour Characterization of Commercial Workloads
1Lock Behaviour Characterization of Commercial
Workloads
ltchang_at_cs.wisc.edugt ltwxd_at_cs.wisc.edugt
Jichuan Chang Xidong Wang
2Outline
- Motivation
- Methods
- Results
- Speculative Lock Elision Issues
- Conclusions
3Motivation
Understanding the Synchronization Behavior
of Commercial Workloads (OLTP, Apache, SpecJBB)
Identifying Opportunities for Speculative Lock
Elision (performance, ease of programming)
4Questions to Answer
- Lock related statistics
- Can hardware identify critical sections?
- Critical section size
- Lock-free section size
- Amount of lock contentions
- Hardware optimizations by speculation
- Context switching implications
- Resource requirements
- Other issues
- Realistic timing model
- Other synchronization (reader/writer, etc)
5Methods
- Benchmarks
- OLTP, Apache, JBB, Barnes (for comparison)
- Full system simulation (tracing) using Simics
- Simple timing model - Simics tracer
- Ruby timing model - Simics Ruby
- Using instr (not cycle) as the measurement unit
- Set cpu_switch_time to 1, disable STC
- Validating our approach
- Using micro-benchmarks, to compare our stats with
the result reported by kernel tools (lockstat) - Tracing into disassembly code (kernel/user)
6Lock Identification
- Basic idea from SLE
- Lock acquisition must use one atomic instruction.
- Silent store pair as a pair, the stores in lock
acquisition and release operations are silent. - SPARC v9 atomic instructions
- ldstub, swap, casa (compare-and-swap)
OLTP
JBB
Values
Values
casa l2 128,g4,g3 casa l2
128,l0,g4
ldstub o0 g0, o4 brnz,pn o4,
lt0x10034b98gt stbar stb g0, o0 12
0x0-gt0xff 0xff-gt0x0
0x1-gt0x8410f8bc 0x8410f8bc-gt0x1
7Lock Identification Algorithm
- Starts with an atomic instruction
- that writes back a different value to the lock
- otherwise meaning unsuccessful lock acquisition
- Examine each following store made by the same CPU
- Until we meet a normal store
- that completes the silent store pair
- usually with the value of 0x0
- Other completion patterns
- Self-release (by the same CPU)
- using atomic instruction, pair-silently (JBB)
- using atomic instruction, not pair-silently
- Cross-release (by a different CPU)
- using atomic instruction
- Removed cant observe lock release (16K limited
window).
8Lock Frequency
9Execution Phase Breakdown
10Critical Section Size
11Lock-free Section Size
12Timing Models
- Adding Ruby doesnt change the size of critical
section and lock-free section, but removes lock
contentions. - Why?
- Shrinking caused by less frequent memory
accesses within critical sections - or simulation effect?
- Guess more shrinking using Ruby and Opal
13Lock Contention
- Waiting from the first try to successful
acquisition - Spinning ignore those have been waiting for more
than 4K instructions.
14Distinguishing wait and spin
- Why bother?
- Very few long-waiting events make big difference
in the percentages of wasted instructions - Easy if we can identify thread switching
- But the identification is not easy
- Waiting if spinning for too many instructions
- Using 4096 instructions as the limit
- 90 contentions are shorter than 4K instr
- It makes sense for different timing models.
15Lock Contention Most Contended Locks
16SLE on Commercial Workloads
- Context switching (later)
- Buffering requirement Not much
- Small critical sections dominate
- Except for Apache user locks (1-8K)
- Single shared buffer among threads on the same
CPU - Possible performance gain
- Not big if only counting num of instructions (1 -
6) - Critical section size already small
- Contention already infrequent
- Can be larger if lock spinning latency increases
- Can be smaller
- less lock contentions happen (as in Ruby case)
- Must throttle speculation (to avoid unnecessary
rollbacks)
17Context Switch
- Why bother?
- Needed to precisely quantify the amount of
instructions spent on lock waiting (process and
thread switching) - Needed to correctly implement speculative lock
elision (process switching only) - Process Switching Identification
- Marker Demap TLB on context switch
- Apache (100 transactions, CPU 3)
- Average 210K instructions (Max 360K, Min
160K) - Process switching are infrequent, performance
implication negligible - Thread Switching Identification is hard
- No simple patterns to observe, No feedback to
validate assumptions - Not a good idea to provide separate buffer for
each thread on a single processor. Hard to detect
conflicts, thread switch need many buffers.
18Other Synchronization Algorithms
- Hard to recognize complex synchronization
- Barriers, Read/writer locks, etc
- Mutual Exclusion implementation composed of the
small critical sections - pthread_mutex_lock(lock) acquires 3 lock
- Reader/writer lock use locks to maintain data
structure (reader/writer queues, num of current
reader, etc)
Serialized Execution (maintained by synch. algo.)
writer_enter()
writer_exit()
HW only sees two small critical sections
19Conclusion
- Commercial workloads lock characterization
- Small critical sections dominate
- Infrequent lock contention
- User/kernel code have different behavior
- Kernel locks cant be ignored
- (Kernel) contented PCs predictable
- Performance Improvements
- SLE wont help as much
20 Thank You! Questions?
21Backup Slides
- Thread switching details
- Critical section size using Ruby timing model
- Sparc Atomic Instructions
- Misc Issues
- Acknowledgement
22Thread Switch Identification
- User thread scheduling
- Disassemble user thread library, Observe
execution of scheduling methods (_disp, _switch).
not always possible!! - Kernel thread scheduling
- Involve a set of interleaved method invocations
(resume, disp, swtch, _resume_from_idle..). Hard
to identify starting and ending point of thread
switch - Impossible to identify kernel thread switch by
only observing register window swap since it also
happen in user thread switch - No feedback from OS to validate our assumption
- Methodology Preliminary Observations
- Disassemble kernel code to build VA ? kernel
method map. Observe the method control flow in
Simics trace. - resume may indicate a kernel thread switch
- user_rtt may indicate a user level thread switch.
- Conclusion Thread Switch Identification is a
hard, unresolved issue
23Critical Section Size (Ruby)
24Sparc Atomic Instructions
- ldstub
- Write all 1 into a byte
- Swap
- Swap the value of the reg and the mem location
- Compare-and-swap
- Swap if (value in the 1st reg value in mem)
- Membar/stbar
- Usually follows such atomic instructions
25Misc.
- Why Apache strange?
- Lock more frequent, few user lock (1-2)
- Large percentage of critical section instruction
- Nested Locks
- Intertwined Locks
- Critical sections in Barnes are more clustered
- Buffer size 29 30 1/3 64 Blocks
- The same as SLE
26Acknowledgement
- Project suggested by Prof. Mark Hill
- Guiding and supporting
- Lots of discussion with and help from
- Min Xu, our TA
- Carl Mauer, Multifacet simulator expert
- Ravi Rajwar, SLE paper author