Joseph B. Manzano - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Joseph B. Manzano

Description:

... from 'Hyper-Threading Technology Architecture and ... The Threads' Commune ... The orchestration of two or more threads (or processes) to complete a task in a ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 44
Provided by: cabe1
Category:

less

Transcript and Presenter's Notes

Title: Joseph B. Manzano


1
Features that you (most probably) didnt know
your Microprocessor had
  • Joseph B. Manzano
  • Spring 2009

2
Outline
  • The Powerful and the Fallen
  • The Mutualists
  • The Just Passing
  • The Olympic Sprinters
  • The Threads Commune
  • Breaking the Despotic Rule of the Lock

3
The Powerful and The Fallen
Multiple Issue Architectures Increase your IPC /
Take advantages of ILP
Register Renaming
Tomasulo Algorithm
Reorder Buffer
Scoreboarding
4
The Powerful and The Fallen
Based on the CDC 6000 Architecture
Scoreboarding
Important Feature Scoreboard
Issue WAW, Decode RAW, execute and write
results WAR
Reorder Buffer
Implemented in the IBM360/91s floating point
unit.
Tomasulo Algorithm
Important Feature Reservation Station and CDB
Issue tag if not available, copy if they are
Execute stall RAW monitoring the CDB Write
results Send results to the CDB and dump the
store buffer contents Exception Handling No
insts can be issued until a branch can be resolved
Register Renaming
5
The Powerful and The Fallen
Power5
Dual Core Two way SMT IBM PowerPC SuperScalar
Architecture.
Picture Courtesy of IBM from Power5
Microarchitecture
6
The Powerful and The Fallen
Intel Xeon Out of Order Engine Pipeline
Picture Courtesy of Intel from Hyper-Threading
Technology Architecture and Microarchitecture
7
Outline
  • The Powerful and the Fallen
  • The Mutualists
  • The Just Passing
  • The Olympic Sprinters
  • The Threads Commune
  • Breaking the Despotic Rule of the Lock

8
The Mutualists
  • Vector Processing
  • Super Computer of the past
  • SIMD type of design
  • Elements of the data stream are worked by a
    single type of instruction
  • Simplifies hardware design
  • Moving toward more general purpose vector
    processing

9
The Mutualists
The Cell Broadband Engine
Created by STI
Composed of nine computing elements
10
Outline
  • The Powerful and the Fallen
  • The Mutualists
  • The Just Passing
  • The Olympic Sprinters
  • The Threads Commune
  • Breaking the Despotic Rule of the Lock

11
The Just Passing
  • Cache ? Invisible architecture component
  • Not so much in the last years
  • PowerPC and other architecture provides
    instructions to control
  • dcbfe, dcbste, dcbze, icbie, isync
  • Instruction available to touch, to zeroed out, to
    reserve, or to lock a line in place.
  • But for some interesting designs look no further
    than

12
The Just Passing
XBOX 360 Xenon Architectures
Picture Courtesy of IBM from XBOX 360 System
Microarchitecture
13
Outline
  • The Powerful and the Fallen
  • The Mutualists
  • The Just Passing
  • The Olympic Sprinters
  • The Threads Commune
  • Breaking the Despotic Rule of the Lock

14
The Olympic Sprinters
  • The Hertz race is over however
  • Some processors are still at it
  • Power 6 and 7 running at 4 and 5 GHz
  • Intel Polaris 3.6 to 6 GHz
  • Many hardware re-designs are in order
  • Make pipelines shorter, simpler
  • Get rid of extra hardware features

15
The Olympic Sprinters
Pictures Courtesy of Intel from IBM Power6
Microarchitecture
Power6
Running at frequencies from 4 to 5 GHz
13 FO4 versus 23 FO4 pipeline
16
Outline
  • The Powerful and the Fallen
  • The Mutualists
  • The Just Passing
  • The Olympic Sprinters
  • The Threads Commune
  • Breaking the Despotic Rule of the Lock

17
The Threads Commune
  • Large shared memory systems are becoming scarce
  • Scalability issues due to synchronization
  • Contention
  • Coherency and Consistency
  • Novel Solutions have emerged
  • Explicit memory hierarchies with very weak memory
    models
  • Massive Multithreading on chip
  • Synchronization in memory

18
The Threads Commune
  • Cray XMT
  • 128 Hardware streams
  • A stream is 31 64-bit registers, 8 target
    registers, and a control register
  • Three functional units M, A and C
  • 500 MHz
  • Full and Empty bits per word (2-bits)
  • An example of a very high SMT design

19
The Threads Commune
  • SMT / HT designs

http//www.intel.com/technology/computing/dual-cor
e/demo/popup/demo.htm
20
The Threads Commune
Cray MTA2 picture from Jonh Feos Can
programmers and Machines ever be friends
21
The Threads Commune
  • Data Race or Race Condition
  • There is an anomaly of concurrent accesses by
    two or more threads to a shared memory and at
    least one of the accesses is a write
  • The orchestration of two or more threads (or
    processes) to complete a task in a correct manner
    and to avoid any data races
  • Problems
  • Separation of lock and guarded data

22
The Threads Commune
  • Coherency and Consistency
  • Caching elements and make sure that everyone sees
    the last copy
  • If an element is written by processor A then how
    processor B and C will know that they have the
    latest copy?
  • Very difficult problem!
  • One of the scalability problems of Shared memory

23
The Threads Commune
  • How Cray XMT solves these problems?
  • For Synchronization Join the lock with each data
    word and put the synchronization requirement on
    the memory instead that the processor
  • For coherence and consistency DO NOT cache
    remote data (outside the local 8 GiB)

24
Outline
  • The Powerful and the Fallen
  • The Mutualists
  • The Just Passing
  • The Olympic Sprinters
  • The Threads Commune
  • Breaking the Despotic Rule of the Lock

25
Breaking the Despotic Rule of the Lock
  • Synchronization
  • Atomicity and Seriability
  • Locks and Barriers
  • Around hundreds to ten thousands of cycles and
    grows linearly (in the best cases) or polynomial
    (in the worst cases) with the number of
    processors
  • The lock
  • The most used synch primitive!
  • Alternatives Lock-free data structures

26
Breaking the Despotic Rule of the Lock
  • Lock Free Data Structures
  • Used to implement non blocking or / and wait free
    algorithms
  • Prevents deadlocks, livelocks and priority
    inversions
  • Potential problems ABA problem
  • It tells us no-one is working on this now, but
    not if someone has done it before
  • Transactional Memory
  • Based on transactions (an atomic bundle
    operations)
  • If two transactions conflict then one is bound to
    fail

27
Side NoteA Review of LL and SC
  • PowerPC and many other architecture instructions
  • Provide a way to optimistically execute a piece
    of code
  • In case that a violation has taken place,
    discard your results
  • Many implementations
  • PowerPC lwarx and stwcx

28
Side NoteThe LL and SC behavior
  • The lwarx instruction
  • Loads a word aligned location
  • Side Effects
  • A reservation is created
  • Storage coherence mechanism is notified that a
    reservation exists
  • The stwcx instruction
  • Conditionally Store a location to a given memory
    location.
  • Conditionally ? Depends on the reservation
  • If success, all changes will be committed to
    memory
  • If not, changes will be discarded.

29
Side NoteReservations
  • At most one per processor
  • A reservation is lost when
  • Processor holding the reservation executes
  • A lwarx or ldarx
  • A stwcx or stdcx (No matter if the reservation
    matches or not)
  • Other processors executes
  • A store or a dcbz to the granule
  • Some other mechanism modifies a storage location
    in the same reservation granule
  • Interrupts does not clean reservations
  • But interrupt handlers might
  • Granularity
  • The length of the memory block to keep under
    surveillance

30
Side NoteExamples
LL a ?
a 100

SC a
brnz
a
Memory
Storage Mechanism
a ?
31
Side NoteExamples
LL a ?
LL a ?
a 100
a 100
SC a
SC a
brnz
brnz
a
a
Memory
Storage Mechanism
a ?
32
Side NoteExamples
LL a ?
LL a ?
a 100
a 100
a 100
SC a
SC a
brnz
brnz
X
X
Memory
Storage Mechanism
a 100
33
Side NoteExamples
LL a ?
LL a ?
a 100
a 100
SC a
SC a
brnz
brnz
X
X
Memory
Storage Mechanism
a 100
34
Side NoteExamples
LL a ?
LL a 100
a 100
a 100
SC a
SC a
brnz
brnz
X
a
Memory
Storage Mechanism
a 100
35
Side NoteExamples
LL a 100
LL a 100
a 100
a 100
SC a
SC a
brnz
brnz
a
a
Memory
Storage Mechanism
a 100
36
Side NoteExamples
LL a 100
LL a 100
a 100
a 100
SC a
SC a
brnz
brnz
X
a
Memory
Storage Mechanism
a 200
37
Side NoteExamples
LL a 100
a 100
SC a
brnz
X
Memory
Storage Mechanism
a 200
38
Side NoteExamples
LL a 200
a 100
SC a
brnz
a
Memory
Storage Mechanism
a 200
39
Side NoteExamples
LL a 200
a 100
SC a
brnz
a
Memory
Storage Mechanism
a 20000
40
Breaking the Despotic Rule of the Lock
  • Sun Rock Processor
  • Execute Ahead
  • Scouting Threads
  • Simultaneous Multithreading
  • Transactional Memory
  • Checkpoint
  • Cache memory with extra bits for tracking
    speculative execution
  • 32 logical threads and 16 physical cores

Pictures courtesy of Rock A SPARC CMT
Processor
41
Breaking the Despotic Rule of the Lock
  • Take a RISC-y Approach
  • Small transaction ? HW
  • Best effort
  • Use the checkpoint mechanism!
  • Transactions Software construct
  • Checkpoint in case of failure
  • Commit on successful transaction
  • Executed speculative by a strand
  • Use the cache store buffers and locks cache lines
    until commit ( tracking lines with the s-bits )

42
UltraSparc T1 Codename Niagara 8 Core Processor,
32 Logical Threads
Multi-core Trends in this Decade
Codename Rock 16 Core Processor, 32 Logical
Threads
AMD Turion64 X2IA32 x86 Dual Core Chip
Intel Core 2 Codename Penryn, Wolfdale IA32 x86
Dual Quad Core Chip
Intel Core Duo IA32 x86 Dual Core Chip
Pentium DIA32 x86 2 Core Chip
Power5 64 bit PowerPC 2 Core with SMT
CBEPowerPC 9 Core chip
Power7
Power 4 64 bit PowerPC 2 Core
Codename Nehalem 1 to 8 Core Chip
Power 6 64 bit PowerPC 2 Core with SMT
Xenon 64 bit PowerPC 3 Core chip
Intel Core 2 DuoIA32 x86 2 Core Chip
Xeon Dual CoreIA32 x86 2 Core Chip
Codename Sandy Bridge
AMD Opteron Code Name DenmarkIA32 x86 2 Core
Chip
AMD Code Name BarcelonaIA32 x86 Native 4 Core
Chip
IBM
Intel
UltraSparc T2 Codename Niagara 8 Core Processor,
64 Logical Threads
AMD
SUN
43
Sources
  • The Powerful and the Fallen
  • Sinharoy, B et al, Power5 System
    Microarchitecture, IBM Journal of Research and
    Development, Vol 49, June/September 2005
  • Marr, D et al, Hyper-Threading Technology
    Architecture and Microarchitecture Intel
    Technology Journal, Vol 6, Issue 1, 2002
  • The Mutualists
  • The Just Passing
  • Andrews, Jeff and Baker, Nick XBOX 360 System
    Architecture, IEEE Micro, Volume 26, Issue 2
    March 2006
  • The Olympic Sprinters
  • Le, H.Q. et al, Power6 System Microarchitecture,
    IBM Journal of Research and Development, Vol 61,
    November 2007
  • The Threads Commune
  • Konecny, P, Introducing the Cray XMT, May 5th,
    2007
  • Feo, J ,Can programmers and machines can ever be
    friends?
  • Breaking the Despotic Rule of the Lock
  • Chaundhry, S, Rock A SPARC CMT Processor,
    August 26, 2008
Write a Comment
User Comments (0)
About PowerShow.com