Title: Programming Multi-Core Systems
1Programming Multi-Core Systems
- ??
- ????
- 2007-12-28 ????
- http//zhoufeng.net
2Last Week
- A trend in software The Movement to Safe
Languages - ?
- Improving dependability of system software
without using safe languages
3A Trend in Hardware
- The Movement to Multi-core Processors
- Originates from inability to increase processor
clock rate - Profound impact on architecture, OS and
applications
Topic today How to program multi-core
systems effectively?
4Why Multi-Core?
- Areas of improving CPU performance in last 30
years - Clock speed
- Execution optimization (Cycles-Per-Instruction)
- Cache
- All 3 are concurrency-agnostic
- Now 1 disappears, 2 slows down, 3 still good
- 1 No 4Ghz Intel CPUs
- 2 Improves with new micro-architecture (e.g.
Core vs. NetBurst)
5CPU Speed History
From Computer Architecture A Quantitative
Approach, 4th edition, 2007
6Reality Today
- Near-term performance drivers
- Multicore
- Hardware threads (a.k.a. Simultaneous
Multi-Threading) - Cache
- 1 and 2 useful only when software is concurrent
Concurrency is the next revolution in how we
write software --- The
Free Lunch is Over
7Costs/Problems of Concurrency
- Overhead of locks, message passing
- Not all programs are parallelizable
- Programming concurrently is HARD
- Complex concepts mutex, read-write lock, queue
- Correct synchronization race, deadlocks
- Getting speed-up
Our focus today
8Potential Multi-Core Apps
Application Category Examples
Server apps w/o shared state Apache web server
Server apps with shared state MMORPG game server
Stream-Sort data processing MapReduce, Yahoo Pig
Scientific computing (many different models) BLAS, Monte Carlo, N-Body
Machine learning HMM, EM algorithm
Graphics and games NVIDIA Cg, GPU computing
9Roadmap
- Overview of multi-core computing
- Overview of transactional programming
- Case Transactions for safe/managed languages
- Case Transactions for languages like C
10Status Quo in Synchronization
- Current mechanism manual locking
- Organization lock for each shared structure
- Usage (block) ? acquire ? access ? release
- Correctness issues
- Under-locking ? data races
- Acquires in different orders ? deadlock
- Performance issues
- Difficult to find right granularity
- Overhead of acquiring vs. allowed concurrency
11Transactions / Atomic Sections
- Databases has provided automatic concurrency
control for 30 years ACID transactions - Vision
- Atomicity
- Isolation
- Serialization only on conflicts
- (optional) Rollback/abort support
Question Is it possible to provide database
transactionsemantics to general programming?
12Transactions vs. Manual Locks
- Manual locking issues
- Under-locking
- Acquires in different orders
- Blocking
- Conservative serialization
- How transactions help
- No explicit locks
- No ordering
- Can cancel transactions
- Serialization only on conflicts
Transactions Potentially simpler and more
efficient
13Design Space
- Hardware Transactional Memory vs. software TM
- Granularity object, word, block
- Update method
- Deferred discard private copy on aborts
- Direct control access to data, erase update on
aborts - Concurrency control
- Pessimistic prevent conflicts by locking
- Optimistic assumes no conflict and retry if
there is -
14Hard Issues
- Communications, or side-effects
- File I/O
- Database accesses
- Interaction with other abstractions
- Garbage collection
- Virtual memory
- Work with existing languages, synchronization
primitives,
15Why software TM?
- More flexible
- Easier to modify and evolve
- Integrate better with existing systems and
languages - Not limited to fixed-size hardware structures,
e.g. caches
16Proposed Language Support
// Insert into a doubly-linked list
atomically atomic newNode-gtprev node
newNode-gtnext node-gtnext node-gtnext-gtprev
newNode node-gtnext newNode
atomic (queueSize gt 0) // remove item from
queue and use it
17Transactions forManaged/Safe Languages
18Intro
- Optimizing Memory Transactions,Harris et al.
PLDI06 - STM system from MSR Cambridge, for MSIL (.Net)
- Design features
- Object-level granularity
- Direct update
- Version numbers to track reads
- 2-phase locking to track write
- Compiler optimizations
19First-gen STM
- Very expensive. E.g. HarrisFraser, OOPSLA 03)
- Every load and store instruction logs information
into a thread-local log - A store instruction writes the log only
- A load instruction consults the log first
- At the end of the block validate the log and
atomically commit it to shared memory
20Direct Update STM
- Augment objects with (i) a lock, (ii) a version
number - Transactional write
- Lock objects before they are written to (abort if
another thread has that lock) - Log the overwritten data we need it to restore
the heap case of retry, transaction abort, or a
conflict with a concurrent thread - Make the update in place to the object
21- Transactional read
- Log the objects version number
- Read from the object itself
- Commit
- Check the version numbers of objects weve read
- Increment the version numbers of object weve
written, unlocking them
22Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
23Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100
T1 reads from c1 logs that it saw version 100
24Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100
c1.ver100
T2 also reads from c1 logs that it saw version
100
25Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100
Suppose T1 now reads from c2, sees it at version
200
26Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
lockedT2
val 40
val 10
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100
Before updating c1, thread T2 must lock it
record old version number
27Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
lockedT2
val 40
val 11
(2) After logging the old value, T2 makes its
update in place to c1
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) Before updating c1.val, thread T2 must log
the data its going to overwrite
28Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver101
val 40
val 11
(2) T2s transaction commits successfully.
Unlock the object, installing the new version
number
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) Check the version we locked matches the
version we previously read
29Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver101
val 40
val 11
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) T1 attempts to commit. Check the versions it
read are still up-to-date.
(2) Object c1 was updated from version 100 to
101, so T1s transaction is aborted and re-run.
30 Compiler integration
- We expose decomposed log-writing operations in
the compilers internal intermediate code (no
change to MSIL) - OpenForRead before the first time we read from
an object (e.g. c1 or c2 in the examples) - OpenForUpdate before the first time we update
an object - LogOldValue before the first time we write to a
given field
Source code
Basic intermediate code
Optimized intermediate code
atomic t n.value n n.next
OpenForRead(n) t n.value OpenForRead(n) n
n.next
OpenForRead(n) t n.value n n.next
31Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
32Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
33Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
34Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
4. Discard log entries for unreachable objects
theyre dead whether or not the block succeeds
35Results Against Previous Work
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Direct-update STM (2.04x)
Normalised execution time
Direct-update STM compiler integration (1.46x)
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
Scalable to multicore
36Scalability (µ-benchmark)
Coarse-grained locking
Fine-grained locking
WSTM (atomic blocks)
DSTM (API)
OSTM (API)
Microseconds per operation
Direct-update STM compiler integration
threads
37Results long running tests
10.8
73
162
Direct-update STM
Run-time filtering
Compile-time optimizations
Original application (no tx)
Normalised execution time
tree
go
skip
merge-sort
xlisp
38Summary
- A pure software implementation can perform well
and scale to vast transactions - Direct update
- Pessimistic optimistic CC
- Compiler support optimizations
- Still need a better understanding of realistic
workload distributions
39Atomic Sections for C-like Languages
40Intro
- Autolocker Synchronization Inference for Atomic
SectionsBill McCloskey, Feng Zhou, David Gay,
Eric Brewer, POPL 2006 - Question Can we have language support for atomic
sections for languages like C? - No object meta-data
- No type safety
- No garbage collection
- Answer Let programmer help a bit (w/
annotations) - And do a simpler one NO aborts!
41Autolocker C Atomic Sections
- Shared data is protected by annotated locks
- Threads access shared data in atomic sections
- Threads never deadlock (due to Autolocker)
- Threads never race for protected data
- How can we implement this semantics?
mutex m int shared_var protected_by(m) atomic
... x shared_var ...
Code runs as if a single lock protects all atomic
sections
42Autolocker Transformation
- Autolocker is a source-to-source transformation
Autolocker code
C code
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
43Autolocker Transformation
- Autolocker is a source-to-source transformation
Autolocker code
C code
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
Atomic sections can be nested arbitrarily The
nesting level is tracked at runtime
44Autolocker Transformation
- Autolocker is a source-to-source transformation
Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are acquired as needed Lock acquisitions
are reentrant
45Autolocker Transformation
- Autolocker is a source-to-source transformation
Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are released when outermost section
ends Strict two-phase locking guarantees
atomicity
46Autolocker Transformation
- Autolocker is a source-to-source transformation
Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are acquired in a global order Acquiring
locks in order will never lead to deadlock
47Outline
- Introduction
- semantics
- usage model and benefits
- Autolocker algorithm
- match locks to data
- order lock acquisitions
- insert lock acquisitions
- Related work
- Experimental evaluation
- Conclusion
48Autolocker Usage Model
- Typical process for writing threaded software
- Linux kernel evolved to SMP this way
- Autolocker helps you program in this style
finish here many fine-grained locks low
contention high parallelism
start here one coarse-grained lock lots of
contention little parallelism
shared data
lock
threads
49Granularity and Autolocker
- In Autolocker, annotations control performance
- Simpler than fixing all uses of shared_var
- Changing annotations wont introduce bugs
- no deadlocks
- no new race conditions
int shared_var protected_by(kernel_lock)
int shared_var protected_by(fs-gtlock)
50Outline
- Introduction
- semantics
- usage model and benefits
- Autolocker algorithm
- match locks to data
- order lock acquisitions
- insert lock acquisitions
- Related work
- Experimental evaluation
- Conclusion
51Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
52Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
53Matching Locks to Data
foo
bar
void foo() atomic use m2 y 3
use m1 x 2
qux
baz
functions
C program with atomic sections
use L
the lock L must already be held when this
statement is reached
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2)
Symbol table
54Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
55Acquisition Placement
- Assume theres an acyclic order lt on locks
- We need to turn uses into acquires
void foo() atomic use m2 y 3
use m1 x 2
We must acquire m2
m1 lt m2
global lock order
56Acquisition Placement
- Assume theres an acyclic order lt on locks
- We need to turn uses into acquires
void foo() atomic use m2 y 3
use m1 x 2
We must acquire m2 ... but since m1 is needed
later and its ordered first
m1 lt m2
global lock order
57Acquisition Placement
- Assume theres an acyclic order lt on locks
- We need to turn uses into acquires
void foo() atomic acquire m1
acquire m2 y 3 use m1 x 2
We must acquire m2 ... but since m1 is needed
later and its ordered first we acquire m1
first
m1 lt m2
global lock order
58Acquisition Placement
- Assume theres an acyclic order lt on locks
- We need to turn uses into acquires
void foo() atomic acquire m1
acquire m2 y 3 use m1 x 2
- Preemptive acquires
- Preemptively take locks that
- are used later
- but ordered before
m1 lt m2
global lock order
59Acquisition Placement
- Assume theres an acyclic order lt on locks
- We need to turn uses into acquires
void foo() atomic acquire m1
acquire m2 y 3 acquire m1 x
2
Eventually well optimize away this acquisition
m1 lt m2
global lock order
60Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
61What Is a Good Lock Order?
- Some lock orders break the insertion algorithm!
- We say that p-gtm lt m is not a feasible order
- So how do we find a feasible lock order?
Assume p-gtm lt m
acquire p-gtm acquire m p p-gtnext acquire
p-gtm
use m p p-gtnext use p-gtm
Because of the assignment
via insertion algo
this lock has never been acquired before. It is
acquired after m!
If p-gtm lt m then m cannot be acquired before
any lock called p-gtm
62Finding a Feasible Lock Order
- We want only feasible orders
- We search for any code matching this pattern
- Any feasible order must ensure e1 lt e2
- Scan through all code to find constraints
e1 cannot be acquired after it is used.
use e1 maykill e2 use e2
Signifies anything that may affect e2s
value (like p p-gtnext when e2 p-gtm)
e2 cannot be acquired before the update. Its
value is different above there.
63Finding a Feasible Lock Order
Constraints
foo
bar
m1 lt p-gtm
m1 lt r-gtm
qux
baz
m3 lt p-gtm
p-gtm lt p-gtm
search for infeasible patterns
functions
q-gtm lt p-gtm
topological sort
Global Lock Order
m1 r-gtm q-gtm p-gtm p-gtm
64Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
65Outline
- Introduction
- semantics
- usage model and benefits
- Autolocker algorithm
- computing protections
- lock ordering
- acquisition placement
- Related work
- Experimental evaluation
- Conclusion
66Comparison
- Transactional memory with optimistic CC
- Threads work locally and commit when done
- If two threads conflict, one rolls back
restarts - Benefit no complex static analysis
- Drawbacks
- software versions lots of copying, can be slow
- hardware versions need new hardware
- both some operations cannot be rolled back
(e.g., fire missile) - How does this compare with Autolocker?
67Experimental Questions
- Question 1
- What is the performance cost of using Autolocker?
- How does it compare to other concurrency models?
68Concurrent Hash Table
- Simple microbenchmark
- Goal Same performance as hand-optimized
- low overhead
- high scalability
- Compared Autolocker to
- manual locking
- Frasers object-based transactional memory mgr.
- Ennals revised transactional memory mgr.
69Hash Table Benchmark
Machine four processors, 1.9 GHz,
HyperThreading, 1 GB RAM Each datapoint is the
average of 4 runs after 1 warmup run
70Experimental Questions
- Question 1
- What is the performance cost of using Autolocker?
- How does it compare to other concurrency models?
- Question 2
- Autolocker may reject code for potential
deadlocks - Does it accept enough programs to be useful?
- Will it only accept easy, coarse-grained policies?
71AOLServer
- Threaded web server using manual locking
- Goal Preserve the original locking policy with AL
Size 52,589 lines
Size 82 modules
Changes 143 atomic sections added
Changes 126 types annotated with protections
Problems 175 trusted casts (23 worrisome)
Problems 78/82 modules used original locking policies
Performance negligible impact (3)
72Conclusion
- Contributions
- a new model for programming parallel systems
- a transformation tool to implement it
- Benefits
- performance close to well-written manual locking
- freedom from deadlocks
- freedom from races on protected data
- very good performance when ops/sync is high
- increase performance without increasing bugs!
73References
- Optimizing memory transactions, Tim Harris, Mark
Plesko, Avraham Shinnar, David Tarditi, PLDI06 - Autolocker Synchronization Inference for Atomic
Sections, Bill McCloskey, Feng Zhou, David Gay,
Eric Brewer, POPL06 - The free lunch is over A fundamental turn toward
concurrency in software, Herb Sutter, Dr. Dobbs
Journal, 2005 - Unlocking Concurrency, Ali-Reza Adl-Tabatabai et
al, ACM Queue, 2006 - Transactional Memory, James Larus, Ravi Rajwar,
Morgan Kaufmann, 2006 (book) - Contact me zf_at_rd.netease.com
- http//zhoufeng.net
74Backup slides
75A Lot of Questions to Answer
From The Landscape of Parallel Computing
Research A View from Berkeley