Title: In memoriam: Moores Law
1In memoriam Moores Law
Obituaries
It is with a heavy
heart that we say goodbye to you, my dear
Moores Law. We will never
forget the way you made our
code faster, without us having to lift a
finger. Your shrinking
silicone features attracted everyone in sight,
and whenever your
clock rate went up, you brightened our day. I
remembered fondly how you dazzled us, with
your funny multi-issue superscalar execution,
your speculative branches and amazingly deep
pipelines. Even when a new version of
Windows hit the market, we could always count on
you to make things bearable. Though people
had been predicting for years that your days
were counted, you kept surprising us with more
power and performance, right up to the sad
day that your leakage problems became so severe
that
GHz
time
2Transactional Memory
A 2007 Hot Topic in Computer Science and
Engineering
Presented by Haraldur D. Þorvaldsson
3Moores Law down but not out
- Modest clock rate increases, but transistor
counts will keep growing for now - Chipmakers response many cores per die
- x86 multiple symmetric full-blown cores
- Network Processors, Cell etc. master core plus
numerous small, specialized worker cores - Higher performance per mm2 / per Watt
4No more performance free lunch
- Thread-level parallelism, instead of
instruction-level parallelism - No threads, no thrust (no pain, no gain?)
- So what do we do now?
- Split everything into threads, synchronize
everything with monitors, semaphores? - But low-level, manual synch is really hard!
- Races, deadlocks, priority inversions
5Example spot the race!
class Boo private Auxiliary aux null
public Auxiliary getAux() if (aux
null) synchronized(this)
if (aux null) aux new
Auxiliary() return aux
6Spot the race, Master Class
7Discern the Deadlock is it safe?
synchronized void DispatcherhandleCallbacks()
for (int i 0 i cbi.onCallBack() public synchronized void
Dispatcherunregister(Callback c)
synchronized void VooonCallback() boolean
gone do() public synchronized boolean
Voodo() if (myDispatcher.isFull())
myDispatcher.unregister(this) return
true return false
8Limiting factor the human brain
- We have a sequential thinking process
- Lack intuition and working memory space to reason
about (combinatoric!) execution interleavings,
potential deadlock cycles etc. - There are methods and tools for careful analysis
and concurrency safety proofs - But still very tough, for elite brains only
- Auto-parallelization mixed success record
9Alternative solution Transactions
- Provide a higher-level execution model for
programmers, based on atomic actions - An actions code executes blithely as if no
concurrent overlap with any other actions - The machine and compile-time / run-time
environment worry about how to make it appear so - The database system approach, essentially
10Programming with atomic actions
- Partition code into atomic blocks.
- A block executes as a transaction
- Transactional Memory ? make the set of memory
reads and writes by each block appear atomic
w.r.t. other the sets of reads and writes from
other atomic blocks. - To each atomic block, appears as if it runs
strictly before or after all other blocks.
11Hypothetical language example
atomic Object Queuedequeue() if (head
null) return null Object o head
head head.next return o
- Block runs completely or not at all.
- Sequential inside block, easy to reason using
pre/post-conditions/invariants etc.
12Things to notice
- Concurrency control mechanism is hidden
- Intended effect specified, implementations may
vary - Besides atomic keyword, there is no explicit
specification of synchronization - No locking calls, code looks like sequential code
- Hence, no naming of shared synchronization
entities, possible to compose atomic blocks.
13Why its Hot, and might take off
- Transactions likely quite brain-friendly
- A proven, well-loved paradigm for concurrent
database systems programming - Atomic transitions basis of many formalisms (I/O
Automata, Abstract State Machines, e.g.) - Other benefits, for example
- Improved fault-tolerance (thru robust aborts)
- Facilitates robust dynamic evolution of code
- Endorsement all 3 DARPA HPCS langs use TM
14A bit of history
- The idea of applying transactions to general
programming almost as old as the concept of
database transactions itself! - Process structuring, synchronization, and
recovery using atomic actions, D. B. Lomet, 1977. - Idea revisited by Herlihy and Moss
- Transactional Memory Architectural Support for
Lock-Free Data Structures, ISCA 93 - Proposed to extend CPU cache coherency protocols
to handle transactional memory access
15A bit of history, contd.
- First implementation of a STM system
- Software Transactional Memory, Shavit and
Touitou, PODC 95 - Influential STM design
- Software Transactional Memory for Dynamic-Sized
Data Structures, Herlihy, Luchangco, Moir,
Scherer, PODC 03 - Now everybody and their grandmother!
16The rest of this talk
- Semantics of atomic blocks
- More atomic block techniques and issues
- Basics of TM implementation
- Implementing Transactional Memory
- Software implementations (STM)
- Hardware implementations (HTM)
- Challenges I/O, legacy issues etc.
- Conclusions, I?? transactions.
17A great reference
- Terminology and some structure adapted from
following, excellent reference - Transactional Memory, James R. Larus and Ravi
Rajwar,Synthesis Lectures on Computer
Architecture,Morgan Claypool
Publishers,January 12th, 2007ISBN-13
978-1598291247
18The rest of this talk
- Semantics of atomic blocks
- More atomic block techniques and issues
- Basics of TM implementation
- Implementing Transactional Memory
- Software implementations (STM)
- Hardware implementations (HTM)
- Challenges I/O, legacy issues etc.
- Conclusions, I?? transactions.
19Example Blocking atomic queue
atomic Object Queuedequeue() if (head
null) retry Object o head head
head.next return o
- retry aborts, transaction later re-executes
- TM system can defer re-execution until some
variable read by transaction has changed - No busy waiting, easy Conditional Critical
Regions
20Safe, composable multi-blocking
atomic Object Queuedq_1() return
q1.dequeue() orElse return
q2.dequeue()
- If q1 retries, orElse starts latter block if
that block retries, orElse starts all over again - Contrast with select()-style dispatching
21Nested transactions
- Support varies between system systems
- Simple Flattened transactions (children
commit/abort with top-level transaction) - Full nesting transactions execute concurrently
with parent, child updates visible to parent once
child commits. - Even better composability
- Parental control over aborting behavior
22Atomic blocks and exceptions
int x 1try atomic x 2
throw new Exception() catch (Exception)
// value of x?
- What should be the value of x at the end?
- Exceptions abort is more flexible
- Exceptions transactions ?? dependable rollback
of side-effects upon exceptions, Yay!
23Concurrency for robustness
- Harris and Peyton Jones add a check keyword to
STM Haskell (MS Research tech report) - Specify a set of state invariants that are
automatically checked after transactions update
the relevant state - Immediate detection of faulty transactions
- Related idea check operation preconditions and
security/access rights in parallel to the op
itself - Operation only commits if all checks complete,
aborted without side-effects if a check fails
24The progress of this talk
- Semantics of atomic blocks
- More atomic block techniques and issues
- Basics of TM implementation
- Implementing Transactional Memory
- Software implementations (STM)
- Hardware implementations (HTM)
- Challenges I/O, legacy issues etc.
- Conclusions, I?? transactions.
25Basics of TM Implementations
- Recall atomic blocks declare intent alone, not
how atomicity is ensured. - A straw man solution use one lock for program,
blocks acquire before executing and release once
done. - Not practical (no concurrency!) but can serve as
a working definition of block semantics.
26Concurrent block conflicts
- Similar to pipeline hazards, DB conflicts. For
example, a transaction should not - Read or write a variable written by a transaction
that commits after it. - Read a variable written by a transaction that
will abort. - Precise semantics differ between systems.
27Main approaches to updating data shared between
blocks
- Deferred update transactions update private
copies, copy to final on commit. - Pros easy aborts, synergy with caches
- Cons reference indirections, memory usage
- Direct update update shared data in place,
concurrency control prevents conflicting reads
and writes. - Pros less need for indirection, faster commits
- Cons expensive aborts, must log for rollback
28Handling concurrent conflicts
- Once detected, conflicts handled by delaying
and/or aborting transactions. - For example, a transaction that reads
inconsistent values will be aborted and
automatically restarted. - In many TMs, conflict resolution a configurable
or installable policy. - Transaction priorities, lengths, start times etc.
- Different levels of granularity for conflicts
word(s), objects/fields, cache lines, pages, e.g.
29When to detect conflicts
- Many TM systems based on optimistic concurrency
control methods. - Detection and resolution of conflicts happens
after conflicts occur (no later than commit,
though!) - Works well when conflicts are infrequent
- Allows dynamic speculative concurrency
- Time of detection / resolution is a trade-off
- A conflicting transaction may in fact eventually
commit (for example, if the other transaction
aborts) - But a doomed transaction is doing wasted work.
30Some thorny issues
- How do extra-transactional accesses interact with
transactional accesses? - Weak consistency ? consistency guaranteed between
transactions only. - Strong consistency ? consistency between
transactional and non-transactional code too - How to handle / tolerate inconsistent state
- A doomed transaction may go astray (bad
dereferencing, out-of-bound array accesses etc.) - Still, real errors should be handled normally
31The progress of this talk
- Semantics of atomic blocks
- More atomic block techniques and issues
- Basics of TM implementation
- Implementing Transactional Memory
- Software implementations (STM)
- Hardware implementations (HTM)
- Challenges I/O, legacy issues etc.
- Conclusions, I?? transactions.
32Software Transactional Memory
- Arguments favoring STM over HTMs
- Easier to create and evolve than hardware
- Easier integration with new and existing
languages, garbage collection etc. - Less stringent resource limitations (e.g. memory)
- A bridge towards hardware-based systems
- Argument against high overhead, slow.
- Eventual system will likely be hybrids
- Division btw. software/hardware ongoing research
33Case study DSTM
- Software Transactional Memory for Dynamic-Sized
Data Structures, Herlihy et al, PODC03. - Dynamic, no pre-declaration of accesses
- Based on deferred updates (private copies)
- C library usable from C or Java
- www.sun.com/download/products.xml?id453fb28e
34DSTM thread and object wrappers
class TMObject TMObject(Object obj) enum
Mode READ, WRITE Object open(Mode mode)
class TMThread Thread void
beginTransaction() bool commitTransaction()
void abortTransaction()
35Example atomic counter
TMObject tmCounter new TMObject(new
Counter(0)) TMThread thread
(TMThread)Thread.getCurrentThread() thread.beginT
ransaction() while (true) Counter c
(Counter)tmCounter.open(WRITE)
c.increment() if (thread.commitTransaction())
break
36DSTM implementation
- open() returns a copy of wrapped object
- Also validates all prior opened objects to check
for conflicts with other transactions - Prevents observation of inconsistent state
- Expensive, O(n2) checks for n objects
- An object opened in read mode may later be
re-opened in write mode
37DSTM Data structures
Transaction
Locator
TMObject
? active, committed, aborted
status
transaction
readSet
new object
old object
W
R
Transaction.status committed ?new object is
the current version Transaction.status aborted
?old object is the current version Transaction.st
atus active ?conflict (new is tentative new
ver)
Object read-only copy
Object writable (private) copy
38Opening object for write
Transaction
Locator
TMObject
commited
aborted
status
transaction
readSet
new object
old object
R
W
Atomic Compare-and-swap (CAS)
active
New Locator
Newtransaction
W
39Opening object for write
Transaction
CAS
Locator
TMObject
active
aborted
status
transaction
readSet
Conflict!
new object
old object
R
W
active
New Locator
Newtransaction
W
40Opening object for read
Prior Transaction
Locator
TMObject
commited
status
transaction
readSet
new object
old object
R
W
- Committing a transaction
- Validate read set check for each (o, v) in read
set that v is still current version of o. - CAS status from active to committed
active
Transaction
( ?? , ? )
41Some other STMs of note
- Harris and Keir, OOPSLA03 WSTM, add atomic
keyword to Java, with efficiently (re)evaluated
guard expressions - Dice and Shavit, TRANSACT06 TL, uses locks,
acquired at commit time, performs better than
locking at first access - Adl-Tabatabai et al., TRANSACT06 McRT-STM,
two-phased locking, timeouts break deadlocks - Harris et Al., PPoPP05 STM Haskell, the retry
and orElse keywords - R. Ennals, Efficient Software Transactional
Memory, Intel tech report, rebels against
non-blocking
42The progress of this talk
- Semantics of atomic blocks
- More atomic block techniques and issues
- Basics of TM implementation
- Implementing Transactional Memory
- Software implementations (STM)
- Hardware implementations (HTM)
- Challenges I/O, legacy issues etc.
- Conclusions, I?? transactions.
43Hardware Transactional Memory
- Arguments in favor of HTM vs STM
- Low or negligible overhead, better performance
- Less invasive, language/compiler agnostic
- Feasible to support strong isolation
- May be easier to support unrestricted legacy code
- Argument against
- Hard to arrive at a standard, hard to evolve
- Difficult rollout, e.g. when will x86 support it?
- Most HTMs extensions to cache coherency
protocols, speculative execution control etc.
44Case study TCC
- Transactional Memory Coherence and Consistency,
Hammond et al, IEEE ASCA04 - All transactions, all the time! Code partitioned
into transactions by programmer or tools - Possibly at run-time, for legacy code!
- All writes buffered in caches, CPUs arbitrate
system-wide for which one gets to commit - Updates broadcast to all CPUs. CPUs detect
conflicts of their transactions and abort
45Additional state for cache lines
- A Read bit, for lines speculatively read by a
transaction - Cause an abort, if conflicting write snooped
- Can have multiple bits, for finer intra-line
granularity - A Modified bit, for lines speculatively written
- Causes abort if write to line snooped.
- CPU must checkpoint registers, for restarts
- If write-buffer overflows CPU acquires
system-wide exclusivity, finishes
non-speculatively
46TCC Implementation
Loads stores
CPU Core
storesonly
Local cache hierarchy
r m ? V tag data
Write buffer
Commit control
snooping
commits
Broadcast bus or network
47Specifying transaction ordering
- May assign phase numbers to transactions
- Only transactions with phase t may commit, where
t oldest remaining phase
0
0
1
2
0
0
1
2
3
0
2
3
Time
48Some other HTMs of note
- Moore et al., HPCA06 LogTM, transaction data
may overflow into memory, direct updates with
logging for aborts (favor commits). - Ceze et al., ISCA06 Bulk, not a cache protocol,
but compact aggregation and broadcast of updated
addresses for conflict detection. - Unbounded HTMs, transacts survive context
switches Ananian et al., HPCA6, UTM, Rajwar et
al, ISCA06, VTM - Hybrid STM-HTMs Kumar et al., PPoPP06 HTM
falls back on STM on overflow, Shiraman et al.,
TRANSACT06 special HTM instructions for STMs.
49The progress of this talk
- Semantics of atomic blocks
- More atomic block techniques and issues
- Basics of TM implementation
- Implementing Transactional Memory
- Software implementations (STM)
- Hardware implementations (HTM)
- Challenges I/O, legacy issues etc.
- Conclusions, I?? transactions.
50Challenges for practical TM
- How to deal with I/O
- Ban it in transactions, require transactional
semantics of I/O devices, compensators - Other tricky interactions
- Kernel data structures, virtual memory
- Legacy / library / language / tools support
- For HTMs in particular
- Supporting large, long-running HTM transactions
- Interactions with CPU scheduling, pipeline flow
etc.
51Conclusions and my two cents
- Could be great if it all pans out!
- Greatly simplified concurrent programming
- Higher abstraction, TM implementation freedom
- Better performance for inherently parallel stuff
- High, DB-like robustness, via roll-backing aborts
- Code with atomic blocks a better value
- Efficient support for guarded commands
- An great enabler for distributed execution
- Load-balance with correctness, impunity
- Hardware synthesis too, a la Bluespec?