Title: Joseph B. Manzano
1Features that you (most probably) didnt know
your Microprocessor had
- Joseph B. Manzano
- Spring 2009
2Outline
- The Powerful and the Fallen
- The Mutualists
- The Just Passing
- The Olympic Sprinters
- The Threads Commune
- Breaking the Despotic Rule of the Lock
3The Powerful and The Fallen
Multiple Issue Architectures Increase your IPC /
Take advantages of ILP
Register Renaming
Tomasulo Algorithm
Reorder Buffer
Scoreboarding
4The Powerful and The Fallen
Based on the CDC 6000 Architecture
Scoreboarding
Important Feature Scoreboard
Issue WAW, Decode RAW, execute and write
results WAR
Reorder Buffer
Implemented in the IBM360/91s floating point
unit.
Tomasulo Algorithm
Important Feature Reservation Station and CDB
Issue tag if not available, copy if they are
Execute stall RAW monitoring the CDB Write
results Send results to the CDB and dump the
store buffer contents Exception Handling No
insts can be issued until a branch can be resolved
Register Renaming
5The Powerful and The Fallen
Power5
Dual Core Two way SMT IBM PowerPC SuperScalar
Architecture.
Picture Courtesy of IBM from Power5
Microarchitecture
6The Powerful and The Fallen
Intel Xeon Out of Order Engine Pipeline
Picture Courtesy of Intel from Hyper-Threading
Technology Architecture and Microarchitecture
7Outline
- The Powerful and the Fallen
- The Mutualists
- The Just Passing
- The Olympic Sprinters
- The Threads Commune
- Breaking the Despotic Rule of the Lock
8The Mutualists
- Vector Processing
- Super Computer of the past
- SIMD type of design
- Elements of the data stream are worked by a
single type of instruction - Simplifies hardware design
- Moving toward more general purpose vector
processing
9The Mutualists
The Cell Broadband Engine
Created by STI
Composed of nine computing elements
10Outline
- The Powerful and the Fallen
- The Mutualists
- The Just Passing
- The Olympic Sprinters
- The Threads Commune
- Breaking the Despotic Rule of the Lock
11The Just Passing
- Cache ? Invisible architecture component
- Not so much in the last years
- PowerPC and other architecture provides
instructions to control - dcbfe, dcbste, dcbze, icbie, isync
- Instruction available to touch, to zeroed out, to
reserve, or to lock a line in place. - But for some interesting designs look no further
than
12The Just Passing
XBOX 360 Xenon Architectures
Picture Courtesy of IBM from XBOX 360 System
Microarchitecture
13Outline
- The Powerful and the Fallen
- The Mutualists
- The Just Passing
- The Olympic Sprinters
- The Threads Commune
- Breaking the Despotic Rule of the Lock
14The Olympic Sprinters
- The Hertz race is over however
- Some processors are still at it
- Power 6 and 7 running at 4 and 5 GHz
- Intel Polaris 3.6 to 6 GHz
- Many hardware re-designs are in order
- Make pipelines shorter, simpler
- Get rid of extra hardware features
15The Olympic Sprinters
Pictures Courtesy of Intel from IBM Power6
Microarchitecture
Power6
Running at frequencies from 4 to 5 GHz
13 FO4 versus 23 FO4 pipeline
16Outline
- The Powerful and the Fallen
- The Mutualists
- The Just Passing
- The Olympic Sprinters
- The Threads Commune
- Breaking the Despotic Rule of the Lock
17The Threads Commune
- Large shared memory systems are becoming scarce
- Scalability issues due to synchronization
- Contention
- Coherency and Consistency
- Novel Solutions have emerged
- Explicit memory hierarchies with very weak memory
models - Massive Multithreading on chip
- Synchronization in memory
18The Threads Commune
- Cray XMT
- 128 Hardware streams
- A stream is 31 64-bit registers, 8 target
registers, and a control register - Three functional units M, A and C
- 500 MHz
- Full and Empty bits per word (2-bits)
- An example of a very high SMT design
19The Threads Commune
http//www.intel.com/technology/computing/dual-cor
e/demo/popup/demo.htm
20The Threads Commune
Cray MTA2 picture from Jonh Feos Can
programmers and Machines ever be friends
21The Threads Commune
- Data Race or Race Condition
- There is an anomaly of concurrent accesses by
two or more threads to a shared memory and at
least one of the accesses is a write - The orchestration of two or more threads (or
processes) to complete a task in a correct manner
and to avoid any data races - Problems
- Separation of lock and guarded data
22The Threads Commune
- Coherency and Consistency
- Caching elements and make sure that everyone sees
the last copy - If an element is written by processor A then how
processor B and C will know that they have the
latest copy? - Very difficult problem!
- One of the scalability problems of Shared memory
23The Threads Commune
- How Cray XMT solves these problems?
- For Synchronization Join the lock with each data
word and put the synchronization requirement on
the memory instead that the processor - For coherence and consistency DO NOT cache
remote data (outside the local 8 GiB)
24Outline
- The Powerful and the Fallen
- The Mutualists
- The Just Passing
- The Olympic Sprinters
- The Threads Commune
- Breaking the Despotic Rule of the Lock
25Breaking the Despotic Rule of the Lock
- Synchronization
- Atomicity and Seriability
- Locks and Barriers
- Around hundreds to ten thousands of cycles and
grows linearly (in the best cases) or polynomial
(in the worst cases) with the number of
processors - The lock
- The most used synch primitive!
- Alternatives Lock-free data structures
26Breaking the Despotic Rule of the Lock
- Lock Free Data Structures
- Used to implement non blocking or / and wait free
algorithms - Prevents deadlocks, livelocks and priority
inversions - Potential problems ABA problem
- It tells us no-one is working on this now, but
not if someone has done it before - Transactional Memory
- Based on transactions (an atomic bundle
operations) - If two transactions conflict then one is bound to
fail
27Side NoteA Review of LL and SC
- PowerPC and many other architecture instructions
- Provide a way to optimistically execute a piece
of code - In case that a violation has taken place,
discard your results - Many implementations
- PowerPC lwarx and stwcx
28Side NoteThe LL and SC behavior
- The lwarx instruction
- Loads a word aligned location
- Side Effects
- A reservation is created
- Storage coherence mechanism is notified that a
reservation exists
- The stwcx instruction
- Conditionally Store a location to a given memory
location. - Conditionally ? Depends on the reservation
- If success, all changes will be committed to
memory - If not, changes will be discarded.
29Side NoteReservations
- At most one per processor
- A reservation is lost when
- Processor holding the reservation executes
- A lwarx or ldarx
- A stwcx or stdcx (No matter if the reservation
matches or not) - Other processors executes
- A store or a dcbz to the granule
- Some other mechanism modifies a storage location
in the same reservation granule - Interrupts does not clean reservations
- But interrupt handlers might
- Granularity
- The length of the memory block to keep under
surveillance
30Side NoteExamples
LL a ?
a 100
SC a
brnz
a
Memory
Storage Mechanism
a ?
31Side NoteExamples
LL a ?
LL a ?
a 100
a 100
SC a
SC a
brnz
brnz
a
a
Memory
Storage Mechanism
a ?
32Side NoteExamples
LL a ?
LL a ?
a 100
a 100
a 100
SC a
SC a
brnz
brnz
X
X
Memory
Storage Mechanism
a 100
33Side NoteExamples
LL a ?
LL a ?
a 100
a 100
SC a
SC a
brnz
brnz
X
X
Memory
Storage Mechanism
a 100
34Side NoteExamples
LL a ?
LL a 100
a 100
a 100
SC a
SC a
brnz
brnz
X
a
Memory
Storage Mechanism
a 100
35Side NoteExamples
LL a 100
LL a 100
a 100
a 100
SC a
SC a
brnz
brnz
a
a
Memory
Storage Mechanism
a 100
36Side NoteExamples
LL a 100
LL a 100
a 100
a 100
SC a
SC a
brnz
brnz
X
a
Memory
Storage Mechanism
a 200
37Side NoteExamples
LL a 100
a 100
SC a
brnz
X
Memory
Storage Mechanism
a 200
38Side NoteExamples
LL a 200
a 100
SC a
brnz
a
Memory
Storage Mechanism
a 200
39Side NoteExamples
LL a 200
a 100
SC a
brnz
a
Memory
Storage Mechanism
a 20000
40Breaking the Despotic Rule of the Lock
- Sun Rock Processor
- Execute Ahead
- Scouting Threads
- Simultaneous Multithreading
- Transactional Memory
- Checkpoint
- Cache memory with extra bits for tracking
speculative execution - 32 logical threads and 16 physical cores
Pictures courtesy of Rock A SPARC CMT
Processor
41Breaking the Despotic Rule of the Lock
- Take a RISC-y Approach
- Small transaction ? HW
- Best effort
- Use the checkpoint mechanism!
- Transactions Software construct
- Checkpoint in case of failure
- Commit on successful transaction
- Executed speculative by a strand
- Use the cache store buffers and locks cache lines
until commit ( tracking lines with the s-bits )
42UltraSparc T1 Codename Niagara 8 Core Processor,
32 Logical Threads
Multi-core Trends in this Decade
Codename Rock 16 Core Processor, 32 Logical
Threads
AMD Turion64 X2IA32 x86 Dual Core Chip
Intel Core 2 Codename Penryn, Wolfdale IA32 x86
Dual Quad Core Chip
Intel Core Duo IA32 x86 Dual Core Chip
Pentium DIA32 x86 2 Core Chip
Power5 64 bit PowerPC 2 Core with SMT
CBEPowerPC 9 Core chip
Power7
Power 4 64 bit PowerPC 2 Core
Codename Nehalem 1 to 8 Core Chip
Power 6 64 bit PowerPC 2 Core with SMT
Xenon 64 bit PowerPC 3 Core chip
Intel Core 2 DuoIA32 x86 2 Core Chip
Xeon Dual CoreIA32 x86 2 Core Chip
Codename Sandy Bridge
AMD Opteron Code Name DenmarkIA32 x86 2 Core
Chip
AMD Code Name BarcelonaIA32 x86 Native 4 Core
Chip
IBM
Intel
UltraSparc T2 Codename Niagara 8 Core Processor,
64 Logical Threads
AMD
SUN
43Sources
- The Powerful and the Fallen
- Sinharoy, B et al, Power5 System
Microarchitecture, IBM Journal of Research and
Development, Vol 49, June/September 2005 - Marr, D et al, Hyper-Threading Technology
Architecture and Microarchitecture Intel
Technology Journal, Vol 6, Issue 1, 2002 - The Mutualists
- The Just Passing
- Andrews, Jeff and Baker, Nick XBOX 360 System
Architecture, IEEE Micro, Volume 26, Issue 2
March 2006 - The Olympic Sprinters
- Le, H.Q. et al, Power6 System Microarchitecture,
IBM Journal of Research and Development, Vol 61,
November 2007 - The Threads Commune
- Konecny, P, Introducing the Cray XMT, May 5th,
2007 - Feo, J ,Can programmers and machines can ever be
friends? - Breaking the Despotic Rule of the Lock
- Chaundhry, S, Rock A SPARC CMT Processor,
August 26, 2008