Programming Multi-Core Systems - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Programming Multi-Core Systems

Description:

Title: Autolocker: An Automatic Atomicicity Analysis Author: Bill McCloskey Last modified by: zf Created Date: 5/5/2005 7:25:45 AM Document presentation format – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 76
Provided by: BillMcC5
Category:

less

Transcript and Presenter's Notes

Title: Programming Multi-Core Systems


1
Programming Multi-Core Systems
  • ??
  • ????
  • 2007-12-28 ????
  • http//zhoufeng.net

2
Last Week
  • A trend in software The Movement to Safe
    Languages
  • ?
  • Improving dependability of system software
    without using safe languages

3
A Trend in Hardware
  • The Movement to Multi-core Processors
  • Originates from inability to increase processor
    clock rate
  • Profound impact on architecture, OS and
    applications

Topic today How to program multi-core
systems effectively?
4
Why Multi-Core?
  • Areas of improving CPU performance in last 30
    years
  • Clock speed
  • Execution optimization (Cycles-Per-Instruction)
  • Cache
  • All 3 are concurrency-agnostic
  • Now 1 disappears, 2 slows down, 3 still good
  • 1 No 4Ghz Intel CPUs
  • 2 Improves with new micro-architecture (e.g.
    Core vs. NetBurst)

5
CPU Speed History
From Computer Architecture A Quantitative
Approach, 4th edition, 2007
6
Reality Today
  • Near-term performance drivers
  • Multicore
  • Hardware threads (a.k.a. Simultaneous
    Multi-Threading)
  • Cache
  • 1 and 2 useful only when software is concurrent

Concurrency is the next revolution in how we
write software --- The
Free Lunch is Over
7
Costs/Problems of Concurrency
  • Overhead of locks, message passing
  • Not all programs are parallelizable
  • Programming concurrently is HARD
  • Complex concepts mutex, read-write lock, queue
  • Correct synchronization race, deadlocks
  • Getting speed-up

Our focus today
8
Potential Multi-Core Apps
Application Category Examples
Server apps w/o shared state Apache web server
Server apps with shared state MMORPG game server
Stream-Sort data processing MapReduce, Yahoo Pig
Scientific computing (many different models) BLAS, Monte Carlo, N-Body
Machine learning HMM, EM algorithm
Graphics and games NVIDIA Cg, GPU computing

9
Roadmap
  • Overview of multi-core computing
  • Overview of transactional programming
  • Case Transactions for safe/managed languages
  • Case Transactions for languages like C

10
Status Quo in Synchronization
  • Current mechanism manual locking
  • Organization lock for each shared structure
  • Usage (block) ? acquire ? access ? release
  • Correctness issues
  • Under-locking ? data races
  • Acquires in different orders ? deadlock
  • Performance issues
  • Difficult to find right granularity
  • Overhead of acquiring vs. allowed concurrency

11
Transactions / Atomic Sections
  • Databases has provided automatic concurrency
    control for 30 years ACID transactions
  • Vision
  • Atomicity
  • Isolation
  • Serialization only on conflicts
  • (optional) Rollback/abort support

Question Is it possible to provide database
transactionsemantics to general programming?
12
Transactions vs. Manual Locks
  • Manual locking issues
  • Under-locking
  • Acquires in different orders
  • Blocking
  • Conservative serialization
  • How transactions help
  • No explicit locks
  • No ordering
  • Can cancel transactions
  • Serialization only on conflicts

Transactions Potentially simpler and more
efficient
13
Design Space
  • Hardware Transactional Memory vs. software TM
  • Granularity object, word, block
  • Update method
  • Deferred discard private copy on aborts
  • Direct control access to data, erase update on
    aborts
  • Concurrency control
  • Pessimistic prevent conflicts by locking
  • Optimistic assumes no conflict and retry if
    there is

14
Hard Issues
  • Communications, or side-effects
  • File I/O
  • Database accesses
  • Interaction with other abstractions
  • Garbage collection
  • Virtual memory
  • Work with existing languages, synchronization
    primitives,

15
Why software TM?
  • More flexible
  • Easier to modify and evolve
  • Integrate better with existing systems and
    languages
  • Not limited to fixed-size hardware structures,
    e.g. caches

16
Proposed Language Support
// Insert into a doubly-linked list
atomically atomic newNode-gtprev node
newNode-gtnext node-gtnext node-gtnext-gtprev
newNode node-gtnext newNode
  • Guard condition

atomic (queueSize gt 0) // remove item from
queue and use it
17
Transactions forManaged/Safe Languages
18
Intro
  • Optimizing Memory Transactions,Harris et al.
    PLDI06
  • STM system from MSR Cambridge, for MSIL (.Net)
  • Design features
  • Object-level granularity
  • Direct update
  • Version numbers to track reads
  • 2-phase locking to track write
  • Compiler optimizations

19
First-gen STM
  • Very expensive. E.g. HarrisFraser, OOPSLA 03)
  • Every load and store instruction logs information
    into a thread-local log
  • A store instruction writes the log only
  • A load instruction consults the log first
  • At the end of the block validate the log and
    atomically commit it to shared memory

20
Direct Update STM
  • Augment objects with (i) a lock, (ii) a version
    number
  • Transactional write
  • Lock objects before they are written to (abort if
    another thread has that lock)
  • Log the overwritten data we need it to restore
    the heap case of retry, transaction abort, or a
    conflict with a concurrent thread
  • Make the update in place to the object

21
  • Transactional read
  • Log the objects version number
  • Read from the object itself
  • Commit
  • Check the version numbers of objects weve read
  • Increment the version numbers of object weve
    written, unlocking them

22
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log


23
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100

T1 reads from c1 logs that it saw version 100
24
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100
c1.ver100
T2 also reads from c1 logs that it saw version
100
25
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver 100
val 40
val 10
T1s log
T2s log
c1.ver100 c2.ver200
c1.ver100
Suppose T1 now reads from c2, sees it at version
200
26
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
lockedT2
val 40
val 10
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100
Before updating c1, thread T2 must lock it
record old version number
27
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
lockedT2
val 40
val 11
(2) After logging the old value, T2 makes its
update in place to c1
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) Before updating c1.val, thread T2 must log
the data its going to overwrite
28
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver101
val 40
val 11
(2) T2s transaction commits successfully.
Unlock the object, installing the new version
number
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) Check the version we locked matches the
version we previously read
29
Example contention between transactions
c2
c1
Thread T2
Thread T1
atomic t c1.val t c1.val t
int t 0 atomic t c1.val t
c2.val
ver 200
ver101
val 40
val 11
T1s log
T2s log
c1.ver100
c1.ver100 lock c1, 100 c1.val10
(1) T1 attempts to commit. Check the versions it
read are still up-to-date.
(2) Object c1 was updated from version 100 to
101, so T1s transaction is aborted and re-run.
30
Compiler integration
  • We expose decomposed log-writing operations in
    the compilers internal intermediate code (no
    change to MSIL)
  • OpenForRead before the first time we read from
    an object (e.g. c1 or c2 in the examples)
  • OpenForUpdate before the first time we update
    an object
  • LogOldValue before the first time we write to a
    given field

Source code
Basic intermediate code
Optimized intermediate code
atomic t n.value n n.next
OpenForRead(n) t n.value OpenForRead(n) n
n.next
OpenForRead(n) t n.value n n.next
31
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
32
Runtime integration garbage collection
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
33
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
34
Runtime integration garbage collection
3. GC visits objects reachable from refs
overwritten in LogForUndo entries retaining
objects needed if any block rolls back
1. GC runs while some threads are in atomic blocks
atomic
obj1.field old obj2.ver 100 obj3.locked _at_ 100
2. GC visits the heap as normal retaining
objects that are needed if the blocks succeed
4. Discard log entries for unreachable objects
theyre dead whether or not the block succeeds
35
Results Against Previous Work
Fine-grained locking (2.57x)
HarrisFraser WSTM (5.69x)
Coarse-grained locking (1.13x)
Direct-update STM (2.04x)
Normalised execution time
Direct-update STM compiler integration (1.46x)
Sequential baseline (1.00x)
Workload operations on a red-black tree, 1
thread, 611 lookupinsertdelete mix with keys
0..65535
Scalable to multicore
36
Scalability (µ-benchmark)
Coarse-grained locking
Fine-grained locking
WSTM (atomic blocks)
DSTM (API)
OSTM (API)
Microseconds per operation
Direct-update STM compiler integration
threads
37
Results long running tests
10.8
73
162
Direct-update STM
Run-time filtering
Compile-time optimizations
Original application (no tx)
Normalised execution time
tree
go
skip
merge-sort
xlisp
38
Summary
  • A pure software implementation can perform well
    and scale to vast transactions
  • Direct update
  • Pessimistic optimistic CC
  • Compiler support optimizations
  • Still need a better understanding of realistic
    workload distributions

39
Atomic Sections for C-like Languages
40
Intro
  • Autolocker Synchronization Inference for Atomic
    SectionsBill McCloskey, Feng Zhou, David Gay,
    Eric Brewer, POPL 2006
  • Question Can we have language support for atomic
    sections for languages like C?
  • No object meta-data
  • No type safety
  • No garbage collection
  • Answer Let programmer help a bit (w/
    annotations)
  • And do a simpler one NO aborts!

41
Autolocker C Atomic Sections
  • Shared data is protected by annotated locks
  • Threads access shared data in atomic sections
  • Threads never deadlock (due to Autolocker)
  • Threads never race for protected data
  • How can we implement this semantics?

mutex m int shared_var protected_by(m) atomic
... x shared_var ...
Code runs as if a single lock protects all atomic
sections
42
Autolocker Transformation
  • Autolocker is a source-to-source transformation

Autolocker code
C code
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
43
Autolocker Transformation
  • Autolocker is a source-to-source transformation

Autolocker code
C code
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
Atomic sections can be nested arbitrarily The
nesting level is tracked at runtime
44
Autolocker Transformation
  • Autolocker is a source-to-source transformation

Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are acquired as needed Lock acquisitions
are reentrant
45
Autolocker Transformation
  • Autolocker is a source-to-source transformation

Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are released when outermost section
ends Strict two-phase locking guarantees
atomicity
46
Autolocker Transformation
  • Autolocker is a source-to-source transformation

Autolocker code
C code
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2) atomic x 3 y 2
int m1, m2 int x int y begin_atomic()
acquire(m1) x 3 acquire(m2) y
2 end_atomic()
Locks are acquired in a global order Acquiring
locks in order will never lead to deadlock
47
Outline
  • Introduction
  • semantics
  • usage model and benefits
  • Autolocker algorithm
  • match locks to data
  • order lock acquisitions
  • insert lock acquisitions
  • Related work
  • Experimental evaluation
  • Conclusion

48
Autolocker Usage Model
  • Typical process for writing threaded software
  • Linux kernel evolved to SMP this way
  • Autolocker helps you program in this style

finish here many fine-grained locks low
contention high parallelism
start here one coarse-grained lock lots of
contention little parallelism
shared data
lock
threads
49
Granularity and Autolocker
  • In Autolocker, annotations control performance
  • Simpler than fixing all uses of shared_var
  • Changing annotations wont introduce bugs
  • no deadlocks
  • no new race conditions

int shared_var protected_by(kernel_lock)
int shared_var protected_by(fs-gtlock)
50
Outline
  • Introduction
  • semantics
  • usage model and benefits
  • Autolocker algorithm
  • match locks to data
  • order lock acquisitions
  • insert lock acquisitions
  • Related work
  • Experimental evaluation
  • Conclusion

51
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
52
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
53
Matching Locks to Data
foo
bar
void foo() atomic use m2 y 3
use m1 x 2
qux
baz
functions
C program with atomic sections
use L
the lock L must already be held when this
statement is reached
mutex m1, m2 int x protected_by(m1) int y
protected_by(m2)
Symbol table
54
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
55
Acquisition Placement
  • Assume theres an acyclic order lt on locks
  • We need to turn uses into acquires

void foo() atomic use m2 y 3
use m1 x 2
We must acquire m2
m1 lt m2
global lock order
56
Acquisition Placement
  • Assume theres an acyclic order lt on locks
  • We need to turn uses into acquires

void foo() atomic use m2 y 3
use m1 x 2
We must acquire m2 ... but since m1 is needed
later and its ordered first
m1 lt m2
global lock order
57
Acquisition Placement
  • Assume theres an acyclic order lt on locks
  • We need to turn uses into acquires

void foo() atomic acquire m1
acquire m2 y 3 use m1 x 2

We must acquire m2 ... but since m1 is needed
later and its ordered first we acquire m1
first
m1 lt m2
global lock order
58
Acquisition Placement
  • Assume theres an acyclic order lt on locks
  • We need to turn uses into acquires

void foo() atomic acquire m1
acquire m2 y 3 use m1 x 2
  • Preemptive acquires
  • Preemptively take locks that
  • are used later
  • but ordered before

m1 lt m2
global lock order
59
Acquisition Placement
  • Assume theres an acyclic order lt on locks
  • We need to turn uses into acquires

void foo() atomic acquire m1
acquire m2 y 3 acquire m1 x
2
Eventually well optimize away this acquisition
m1 lt m2
global lock order
60
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
61
What Is a Good Lock Order?
  • Some lock orders break the insertion algorithm!
  • We say that p-gtm lt m is not a feasible order
  • So how do we find a feasible lock order?

Assume p-gtm lt m
acquire p-gtm acquire m p p-gtnext acquire
p-gtm
use m p p-gtnext use p-gtm
Because of the assignment
via insertion algo
this lock has never been acquired before. It is
acquired after m!
If p-gtm lt m then m cannot be acquired before
any lock called p-gtm
62
Finding a Feasible Lock Order
  • We want only feasible orders
  • We search for any code matching this pattern
  • Any feasible order must ensure e1 lt e2
  • Scan through all code to find constraints

e1 cannot be acquired after it is used.
use e1 maykill e2 use e2
Signifies anything that may affect e2s
value (like p p-gtnext when e2 p-gtm)
e2 cannot be acquired before the update. Its
value is different above there.
63
Finding a Feasible Lock Order
Constraints
foo
bar
m1 lt p-gtm
m1 lt r-gtm
qux
baz
m3 lt p-gtm
p-gtm lt p-gtm
search for infeasible patterns
functions
q-gtm lt p-gtm
topological sort
Global Lock Order
m1 r-gtm q-gtm p-gtm p-gtm
64
Algorithm Summary
atomic
C program with atomic sections
match locks to data
lock requirements
fail and report potential deadlock
generate global lock order
yes
lock order
no
insert ordered lock acquisitions
cyclic?
begin acq L1, L2 end
C program with acquire statements
begin acq L2 end
remove redundant acquisitions
65
Outline
  • Introduction
  • semantics
  • usage model and benefits
  • Autolocker algorithm
  • computing protections
  • lock ordering
  • acquisition placement
  • Related work
  • Experimental evaluation
  • Conclusion

66
Comparison
  • Transactional memory with optimistic CC
  • Threads work locally and commit when done
  • If two threads conflict, one rolls back
    restarts
  • Benefit no complex static analysis
  • Drawbacks
  • software versions lots of copying, can be slow
  • hardware versions need new hardware
  • both some operations cannot be rolled back
    (e.g., fire missile)
  • How does this compare with Autolocker?

67
Experimental Questions
  • Question 1
  • What is the performance cost of using Autolocker?
  • How does it compare to other concurrency models?

68
Concurrent Hash Table
  • Simple microbenchmark
  • Goal Same performance as hand-optimized
  • low overhead
  • high scalability
  • Compared Autolocker to
  • manual locking
  • Frasers object-based transactional memory mgr.
  • Ennals revised transactional memory mgr.

69
Hash Table Benchmark
Machine four processors, 1.9 GHz,
HyperThreading, 1 GB RAM Each datapoint is the
average of 4 runs after 1 warmup run
70
Experimental Questions
  • Question 1
  • What is the performance cost of using Autolocker?
  • How does it compare to other concurrency models?
  • Question 2
  • Autolocker may reject code for potential
    deadlocks
  • Does it accept enough programs to be useful?
  • Will it only accept easy, coarse-grained policies?

71
AOLServer
  • Threaded web server using manual locking
  • Goal Preserve the original locking policy with AL

Size 52,589 lines
Size 82 modules
Changes 143 atomic sections added
Changes 126 types annotated with protections
Problems 175 trusted casts (23 worrisome)
Problems 78/82 modules used original locking policies
Performance negligible impact (3)
72
Conclusion
  • Contributions
  • a new model for programming parallel systems
  • a transformation tool to implement it
  • Benefits
  • performance close to well-written manual locking
  • freedom from deadlocks
  • freedom from races on protected data
  • very good performance when ops/sync is high
  • increase performance without increasing bugs!

73
References
  • Optimizing memory transactions, Tim Harris, Mark
    Plesko, Avraham Shinnar, David Tarditi, PLDI06
  • Autolocker Synchronization Inference for Atomic
    Sections, Bill McCloskey, Feng Zhou, David Gay,
    Eric Brewer, POPL06
  • The free lunch is over A fundamental turn toward
    concurrency in software, Herb Sutter, Dr. Dobbs
    Journal, 2005
  • Unlocking Concurrency, Ali-Reza Adl-Tabatabai et
    al, ACM Queue, 2006
  • Transactional Memory, James Larus, Ravi Rajwar,
    Morgan Kaufmann, 2006 (book)
  • Contact me zf_at_rd.netease.com
  • http//zhoufeng.net

74
Backup slides
75
A Lot of Questions to Answer
From The Landscape of Parallel Computing
Research A View from Berkeley
Write a Comment
User Comments (0)
About PowerShow.com