Title: Architectural Features of Transactional Memory Designs for an Operating System
1Architectural Features of Transactional Memory
Designs for an Operating System
- Chris Rossbach, Hany Ramadan, Don Porter
- Advanced Computer Architecture
- Fall 2006- Prof. Burger
2Motivation
- What would a realistic HTM system actually
support? (primitives/design choices) - Current Transactional Memory proposals make
architectural design choices with inadequate
information - shared counter, linked list benchmarks
- focus on user mode avoids OS issues
3HTM OS are you nuts?
- Large concurrent program with complex data access
patterns - Complex code simplify programming model
- Many apps spend a lot of time in kernel
- Diverse synchronization primitives
- spinlocks, semaphores, per-CPU variables, RCU,
seqlocks, completions, mutexes
4Our HTM System
- Basic primitives
- xbegin, xend
- OS-specific primitives
- xpush, xpop
- stack management interrupts on x86 re-use stack
- Configurable Hardware Parameters
- Conflict detection granularity
- Commit abort penalties
- Overflow costs
- Configurable contention management
- Conflict resolution policies which tx restarts?
- Backoff policies how long to wait before restart
5An Issue Unique to an OS Using transactions in
interrupt handlers
No tx in interrupts
system_call() XBEGIN modify 0x10 XEND
intr_handler() XPUSH XBEGIN modify
0x30 XEND XPOP
0x10
TX 1 0x10
0x20
Interrupts abort active tx
0x30
TX 1 0x10
0x40
interrupt
TX 2 0x30
Nest the transactions
TX 1 0x10, 0x30
TX 1 0x10
Multiple active transactions
TX 1 0x10
TX 2 0x30
6Converting Linux to TxLinux
- TxLinux based on kernel 2.6.16.1
- Converted core primitives to use transactions
- spin-locks, RCU primitives, r/w locks
- critical sections become transactions
- Converted high traffic subsystems
- memory allocators, FS directory cache, mapping
addresses to pages data structures, memory
mapping files into address spaces, ip routing,
and socket locking - Modified interrupt-handling code to use
primitives in our HTM model (xpush, xpop)
7HTM Implementation
- Implemented HTM model as x86 extensions
- Simulation environment
- Simics 3.0.17 machine simulator
- transactional L1 cache (variable 4k-32k)
- 4MB L2 1GB RAM
- 1 cycle/instruction, 16 cycle/L1 miss, 200
cycle/L2 miss - 4 8 processors
8Experimental Setup
- Benchmarks
- micro kernalloc, Counter, directory cache
punisher - macro pmake, netcat, MAB, configure, find
- Measurements
- Execution time
- Transactions statistics created/restarted/overflo
wed, working sets, footprint - Cache statistics (e.g. miss rate)
- Variables
- Contention management (conflict/backoff policies)
- Transactional cache size
- Commit, abort, overflow penalties
- Conflict granularity (byte vs. word vs. cache
line)
9TxLinux Results (4 processors)
Transactions Created 105,972 425,888 475,860 1,810,602 1,408,610 243,934
- Performance change minimal, lots of transactions
- Unique Transaction restarts were lt 0.07
- Data cache miss rates do not change appreciably
10Contention Management Matters!
linear back off policy, 4 processors
11Conclusions
- TxLinux is cooler than, and has comparable
performance to Linux - Cache line granularity is good enough
- 16KB Transactional cache covers the vast majority
of transactions - Best contention management policy is workload
dependent. - Exponential back off is too conservative
12Backup Slides
13Contention Management Restart Rates
14Conflict Granularity Backoff Policy
15Stack Management Issue
- Treating the Stack as a shared resource
- Checkpoint
- Partition
16Txl Memory Allocator Investigation
- Examine Tx complexity/performance trade-off
- The slab is the default Kernel memory allocator
- Highly tuned for performance
- Avoids contention/locks , uses per-CPU structures
- About 3,880 lines of code
- The slob is a drop-in replacement
- Designed for minimal bookkeeping memory overhead
- Uses two coarse-grained locks (386 lines)
- The slob-opt is slob with modifications
- Removed obvious transaction bottlenecks
- Only a couple of dozen lines of code changed
17Txl Memory Allocator Results (4 proc)
Kernalloc Pmake MAB configure Find
slab 1.4 13.9 8.0 14.1 1.8
0 0.04 0.07 0.04 0
slob - 14.3 21.3 16.3 1.8
- 1.78 19.72 5.93 0.71
slob-optimized 16.7 14.1 12.7 14.9 1.8
18.17 0.45 8.48 1.42 0.12
Execution time (in seconds) Unique restarts
18Transactional Memory Issues
- Hardware vs. Software
- Different interfaces
- strong (HW) vs. weak (SW) atomicity
- Will transactions make programming easier?
- Transactions for blocking primitives?
- Using transactions for security?