Programming Multi-Core Systems

About This Presentation

Title:

Programming Multi-Core Systems

Description:

Title: Autolocker: An Automatic Atomicicity Analysis Author: Bill McCloskey Last modified by: zf Created Date: 5/5/2005 7:25:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:201

Avg rating:3.0/5.0

Slides: 76

Provided by: BillMcC5

Category:

more less

Transcript and Presenter's Notes

Title: Programming Multi-Core Systems

1
Programming Multi-Core Systems

??
????
2007-12-28 ????
http//zhoufeng.net

2
Last Week

A trend in software The Movement to Safe
Languages
?
Improving dependability of system software
without using safe languages

3
A Trend in Hardware

The Movement to Multi-core Processors
Originates from inability to increase processor
clock rate
Profound impact on architecture, OS and
applications

Topic today How to program multi-core
systems effectively?
4
Why Multi-Core?

Areas of improving CPU performance in last 30
years
Clock speed
Execution optimization (Cycles-Per-Instruction)
Cache
All 3 are concurrency-agnostic
Now 1 disappears, 2 slows down, 3 still good
1 No 4Ghz Intel CPUs
2 Improves with new micro-architecture (e.g.
Core vs. NetBurst)

5
CPU Speed History
From Computer Architecture A Quantitative
Approach, 4th edition, 2007
6
Reality Today

Near-term performance drivers
Multicore
Hardware threads (a.k.a. Simultaneous
Multi-Threading)
Cache
1 and 2 useful only when software is concurrent

Concurrency is the next revolution in how we
write software --- The
Free Lunch is Over
7
Costs/Problems of Concurrency

Overhead of locks, message passing
Not all programs are parallelizable
Programming concurrently is HARD
Complex concepts mutex, read-write lock, queue
Correct synchronization race, deadlocks
Getting speed-up

Our focus today
8
Potential Multi-Core Apps
Application Category Examples
Server apps w/o shared state Apache web server
Server apps with shared state MMORPG game server
Stream-Sort data processing MapReduce, Yahoo Pig
Scientific computing (many different models) BLAS, Monte Carlo, N-Body
Machine learning HMM, EM algorithm
Graphics and games NVIDIA Cg, GPU computing

9
Roadmap

Overview of multi-core computing
Overview of transactional programming
Case Transactions for safe/managed languages
Case Transactions for languages like C

10
Status Quo in Synchronization

Current mechanism manual locking
Organization lock for each shared structure
Usage (block) ? acquire ? access ? release
Correctness issues
Under-locking ? data races
Acquires in different orders ? deadlock
Performance issues
Difficult to find right granularity
Overhead of acquiring vs. allowed concurrency

11
Transactions / Atomic Sections

Databases has provided automatic concurrency
control for 30 years ACID transactions
Vision
Atomicity
Isolation
Serialization only on conflicts
(optional) Rollback/abort support

Question Is it possible to provide database
transactionsemantics to general programming?
12
Transactions vs. Manual Locks

Manual locking issues
Under-locking
Acquires in different orders
Blocking
Conservative serialization

How transactions help
No explicit locks
No ordering
Can cancel transactions
Serialization only on conflicts

Transactions Potentially simpler and more
efficient
13
Design Space

Hardware Transactional Memory vs. software TM
Granularity object, word, block
Update method
Deferred discard private copy on aborts
Direct control access to data, erase update on
aborts
Concurrency control
Pessimistic prevent conflicts by locking
Optimistic assumes no conflict and retry if
there is

14
Hard Issues

Communications, or side-effects
File I/O
Database accesses
Interaction with other abstractions
Garbage collection
Virtual memory
Work with existing languages, synchronization
primitives,

15
Why software TM?

More flexible
Easier to modify and evolve
Integrate better with existing systems and
languages
Not limited to fixed-size hardware structures,
e.g. caches

16
Proposed Language Support
// Insert into a doubly-linked list
atomically atomic newNode-gtprev node
newNode-gtnext node-gtnext node-gtnext-gtprev
newNode node-gtnext newNode

Guard condition

atomic (queueSize gt 0) // remove item from
queue and use it
17
Transactions forManaged/Safe Languages
18
Intro

Optimizing Memory Transactions,Harris et al.
PLDI06
STM system from MSR Cambridge, for MSIL (.Net)
Design features
Object-level granularity
Direct update
Version numbers to track reads
2-phase locking to track write
Compiler optimizations

19
First-gen STM

Very expensive. E.g. HarrisFraser, OOPSLA 03)
Every load and store instruction logs information
into a thread-local log
A store instruction writes the log only
A load instruction consults the log first
At the end of the block validate the log and
atomically commit it to shared memory

20
Direct Update STM

Augment objects with (i) a lock, (ii) a version
number
Transactional write
Lock objects before they are written to (abort if
another thread has that lock)
Log the overwritten data we need it to restore
the heap case of retry, transaction abort, or a
conflict with a concurrent thread
Make the update in place to the object

Transactional read
Log the objects version number
Read from the object itself
Commit
Check the version numbers of objects weve read
Increment the version numbers of object weve
written, unlocking them

22
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log

23
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100

T1 reads from c1 logs that it saw version 100
24
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100
c1.ver100
T2 also reads from c1 logs that it saw version
100
25
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100
Suppose T1 now reads from c2, sees it at version
200
26
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
lockedT2
val 40
val 10
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100
Before updating c1, thread T2 must lock it
record old version number
27
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
lockedT2
val 40
val 11
(2) After logging the old value, T2 makes its
update in place to c1
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) Before updating c1.val, thread T2 must log
the data its going to overwrite
28
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver101
val 40
val 11
(2) T2s transaction commits successfully.
Unlock the object, installing the new version
number
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) Check the version we locked matches the
version we previously read
29
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver101
val 40
val 11
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) T1 attempts to commit. Check the versions it
read are still up-to-date.
(2) Object c1 was updated from version 100 to
101, so T1s transaction is aborted and re-run.
30
Compiler integration

We expose decomposed log-writing operations in
the compilers internal intermediate code (no
change to MSIL)
OpenForRead before the first time we read from
an object (e.g. c1 or c2 in the examples)
OpenForUpdate before the first time we update
an object
LogOldValue before the first time we write to a
given field

Source code
Basic intermediate code
Optimized intermediate code
atomic t n.value n n.next
OpenForRead(n) t n.value OpenForRead(n) n
n.next
OpenForRead(n) t n.value n n.next
31
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
32
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
33
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
34
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
4. Discard log entries for unreachable objects
theyre dead whether or not the block succeeds
35
Results Against Previous Work
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Direct-update STM (2.04x)
Normalised execution time
Direct-update STM compiler integration (1.46x)
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
Scalable to multicore
36
Scalability (µ-benchmark)
Coarse-grained locking
Fine-grained locking
WSTM (atomic blocks)
DSTM (API)
OSTM (API)
Microseconds per operation
Direct-update STM compiler integration
threads
37
Results long running tests
10.8
73
162
Direct-update STM
Run-time filtering
Compile-time optimizations
Original application (no tx)
Normalised execution time
tree
go
skip
merge-sort
xlisp
38
Summary

A pure software implementation can perform well
and scale to vast transactions
Direct update
Pessimistic optimistic CC
Compiler support optimizations
Still need a better understanding of realistic
workload distributions

39
Atomic Sections for C-like Languages
40
Intro

Autolocker Synchronization Inference for Atomic
SectionsBill McCloskey, Feng Zhou, David Gay,
Eric Brewer, POPL 2006
Question Can we have language support for atomic
sections for languages like C?
No object meta-data
No type safety
No garbage collection
Answer Let programmer help a bit (w/
annotations)
And do a simpler one NO aborts!

41
Autolocker C Atomic Sections

Shared data is protected by annotated locks
Threads access shared data in atomic sections
Threads never deadlock (due to Autolocker)
Threads never race for protected data
How can we implement this semantics?

mutex m int shared_var protected_by(m) atomic
... x shared_var ...
Code runs as if a single lock protects all atomic
sections
42
Autolocker Transformation

Autolocker is a source-to-source transformation

Autolocker code
C code
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
43
Autolocker Transformation

Autolocker is a source-to-source transformation

Autolocker code
C code
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
Atomic sections can be nested arbitrarily The
nesting level is tracked at runtime
44
Autolocker Transformation

Autolocker is a source-to-source transformation

Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are acquired as needed Lock acquisitions
are reentrant
45
Autolocker Transformation

Autolocker is a source-to-source transformation

Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are released when outermost section
ends Strict two-phase locking guarantees
atomicity
46
Autolocker Transformation

Autolocker is a source-to-source transformation

Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are acquired in a global order Acquiring
locks in order will never lead to deadlock
47
Outline

Introduction
semantics
usage model and benefits
Autolocker algorithm
match locks to data
order lock acquisitions
insert lock acquisitions
Related work
Experimental evaluation
Conclusion

48
Autolocker Usage Model

Typical process for writing threaded software
Linux kernel evolved to SMP this way
Autolocker helps you program in this style

finish here many fine-grained locks low
contention high parallelism
start here one coarse-grained lock lots of
contention little parallelism
shared data
lock
threads
49
Granularity and Autolocker

In Autolocker, annotations control performance
Simpler than fixing all uses of shared_var
Changing annotations wont introduce bugs
no deadlocks
no new race conditions

int shared_var protected_by(kernel_lock)
int shared_var protected_by(fs-gtlock)
50
Outline

Introduction
semantics
usage model and benefits
Autolocker algorithm
match locks to data
order lock acquisitions
insert lock acquisitions
Related work
Experimental evaluation
Conclusion

51
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
52
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
53
Matching Locks to Data
foo
bar
void foo() atomic use m2 y 3
use m1 x 2
qux
baz
functions
C program with atomic sections
use L
the lock L must already be held when this
statement is reached
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2)
Symbol table
54
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
55
Acquisition Placement

Assume theres an acyclic order lt on locks
We need to turn uses into acquires

void foo() atomic use m2 y 3
use m1 x 2
We must acquire m2
m1 lt m2
global lock order
56
Acquisition Placement

Assume theres an acyclic order lt on locks
We need to turn uses into acquires

void foo() atomic use m2 y 3
use m1 x 2
We must acquire m2 ... but since m1 is needed
later and its ordered first
m1 lt m2
global lock order
57
Acquisition Placement

Assume theres an acyclic order lt on locks
We need to turn uses into acquires

void foo() atomic acquire m1
acquire m2 y 3 use m1 x 2

We must acquire m2 ... but since m1 is needed
later and its ordered first we acquire m1
first
m1 lt m2
global lock order
58
Acquisition Placement

Assume theres an acyclic order lt on locks
We need to turn uses into acquires

void foo() atomic acquire m1
acquire m2 y 3 use m1 x 2

Preemptive acquires
Preemptively take locks that
are used later
but ordered before

m1 lt m2
global lock order
59
Acquisition Placement

Assume theres an acyclic order lt on locks
We need to turn uses into acquires

void foo() atomic acquire m1
acquire m2 y 3 acquire m1 x
2
Eventually well optimize away this acquisition
m1 lt m2
global lock order
60
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
61
What Is a Good Lock Order?

Some lock orders break the insertion algorithm!
We say that p-gtm lt m is not a feasible order
So how do we find a feasible lock order?

Assume p-gtm lt m
acquire p-gtm acquire m p p-gtnext acquire
p-gtm
use m p p-gtnext use p-gtm
Because of the assignment
via insertion algo
this lock has never been acquired before. It is
acquired after m!
If p-gtm lt m then m cannot be acquired before
any lock called p-gtm
62
Finding a Feasible Lock Order

We want only feasible orders
We search for any code matching this pattern
Any feasible order must ensure e1 lt e2
Scan through all code to find constraints

e1 cannot be acquired after it is used.
use e1 maykill e2 use e2
Signifies anything that may affect e2s
value (like p p-gtnext when e2 p-gtm)
e2 cannot be acquired before the update. Its
value is different above there.
63
Finding a Feasible Lock Order
Constraints
foo
bar
m1 lt p-gtm
m1 lt r-gtm
qux
baz
m3 lt p-gtm
p-gtm lt p-gtm
search for infeasible patterns
functions
q-gtm lt p-gtm
topological sort
Global Lock Order
m1 r-gtm q-gtm p-gtm p-gtm
64
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
65
Outline

Introduction
semantics
usage model and benefits
Autolocker algorithm
computing protections
lock ordering
acquisition placement
Related work
Experimental evaluation
Conclusion

66
Comparison

Transactional memory with optimistic CC
Threads work locally and commit when done
If two threads conflict, one rolls back
restarts
Benefit no complex static analysis
Drawbacks
software versions lots of copying, can be slow
hardware versions need new hardware
both some operations cannot be rolled back
(e.g., fire missile)
How does this compare with Autolocker?

67
Experimental Questions

Question 1
What is the performance cost of using Autolocker?
How does it compare to other concurrency models?

68
Concurrent Hash Table

Simple microbenchmark
Goal Same performance as hand-optimized
low overhead
high scalability
Compared Autolocker to
manual locking
Frasers object-based transactional memory mgr.
Ennals revised transactional memory mgr.

69
Hash Table Benchmark
Machine four processors, 1.9 GHz,
HyperThreading, 1 GB RAM Each datapoint is the
average of 4 runs after 1 warmup run
70
Experimental Questions

Question 1
What is the performance cost of using Autolocker?
How does it compare to other concurrency models?
Question 2
Autolocker may reject code for potential
deadlocks
Does it accept enough programs to be useful?
Will it only accept easy, coarse-grained policies?

71
AOLServer

Threaded web server using manual locking
Goal Preserve the original locking policy with AL

Size 52,589 lines
Size 82 modules
Changes 143 atomic sections added
Changes 126 types annotated with protections
Problems 175 trusted casts (23 worrisome)
Problems 78/82 modules used original locking policies
Performance negligible impact (3)
72
Conclusion

Contributions
a new model for programming parallel systems
a transformation tool to implement it
Benefits
performance close to well-written manual locking
freedom from deadlocks
freedom from races on protected data
very good performance when ops/sync is high
increase performance without increasing bugs!

73
References

Optimizing memory transactions, Tim Harris, Mark
Plesko, Avraham Shinnar, David Tarditi, PLDI06
Autolocker Synchronization Inference for Atomic
Sections, Bill McCloskey, Feng Zhou, David Gay,
Eric Brewer, POPL06
The free lunch is over A fundamental turn toward
concurrency in software, Herb Sutter, Dr. Dobbs
Journal, 2005
Unlocking Concurrency, Ali-Reza Adl-Tabatabai et
al, ACM Queue, 2006
Transactional Memory, James Larus, Ravi Rajwar,
Morgan Kaufmann, 2006 (book)
Contact me zf_at_rd.netease.com
http//zhoufeng.net