Simultaneous Multithreading - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Simultaneous Multithreading

Description:

Mispredict more costly as pipeline depth increases resulting in stalls and wasted power ... Swith between threads only on costly stalls ... – PowerPoint PPT presentation

Number of Views:896

Avg rating:3.0/5.0

Slides: 45

Provided by: core75

Category:

more less

Transcript and Presenter's Notes

Title: Simultaneous Multithreading

1
Simultaneous Multithreading

CMPE 511
BOGAZIÇI UNIVERSITY

2
AGENDA

INTRODUCTION
Motivation
Types of Parallesim
Vertical and Horizontal Wasted Slot
Superscalar Processors
Multithreading
Simultaneous Multithreading
The Idea
SMT Model
Issues What to Fetch and What to Issue? Caching
Performance Analysis
Simulation Results
Comparision
Drawbacks
Commercial Examples
IBM POWER5
Future Tendincies

3
INTRODUCTION Motivation

Microprocessor Design Optimization Some Focus
Areas
Memory latency
Increased processor speeds make memory appear
further away
Longer stalls possible
Branch Processing
Mispredict more costly as pipeline depth
increases resulting in stalls and wasted power
Predication drives increased power and larger
chip area
Execution Unit Utilization
20-25 execution unit utilization common
SMT Adresses these areas!

4
INTRODUCTION Motivation

Memory subsystem improvement or increasing system
integration is not sufficient for significant
performance improvement.
Solution Increase parallelism in all its
available form
Combine the multiple-issue-per-instruction
features of modern superscalar processors
With latency-hiding ability of multithreaded
architectures

5
INTRODUCTION Types of Parallesim

Bit-level
Wider processor datapaths (8,16,32,64)
Word-level (SIMD)
Vector processors
Multimedia instruction sets (Intels MMX and SSE,
Suns VIS, etc.)
Instruction-level
Pipelining
Superscalar
VLIW and EPIC
Task and Application-levels
Explicit parallel programming
Multiple threads
Multiple applications

6
INTRODUCTION Vertical Slot Horizontal Slot

Vertical waste is introduced when the processor
issues no instructions in a cycle
Horizontal waste is introduced when not all issue
slots can be filled in a cycle.
61 of the wasted cycles are vertical waste.

7
INTRODUCTION Superscalar

Issues multiple instructions in each cycle.
Typically 4.
Several functional units of the same type, e.g.
ALUs
Dispatcher reads instructions, decides which can
run in parallel
Limited by instruction dependencies and
long-latency operations
Effects Horizontal Vertical Waste
Low Utilization even with higher-issue machines
8 Issue with 20

8
INTRODUCTION Superscalar

Many slots in the execution core are unused.

9
MULTITHREADING

Processor is extended with the concept of thread
allowing the scheduler to chose instructions from
one thread or another at each clock.
Two types in thread scheduling coarse-grain
multithreading and fine-grain multithreading.
SMT uses both types of Multithreading

10
MULTITHREADING
11
MULTITHREADING

What a processor needs for Multithreading?
Processor must be aware of several independent
states, one per each thread
Program Counter
Register File (and Flags)
Memory
Either multiple resources in the processor or a
fast way to switch across states

12
MULTITHREADING Coarse - Grain Multithreading

Swith between threads only on costly stalls
This form of multithreading only hides long
latency events.
Easy to implement but has large grains

13
MULTITHREADING Coarse-Grain
14
MULTITHREADING Fine - Grain
Multithreading

Context switch the threads on every clock cycle.
Occupancy of the execution core is now much
higher
Hides both long and short latency events
Vertical waste are eliminated but horizontal
waste is not. If a thread has little or no
operations to execute issue slots will be wasted.

15
MULTITHREADING Fine-Grain
16
Simultaneous Multithreading Idea

Combine Superscalar and Multithreading such that
Issue multiple instructions per cycle
Supercalar
Hardware state for several programs/threads
Multithreading
So issue multiple instructions from multiple
threads in each cycle

17
Simultaneous Multithreading Idea
18
Simultaneous Multithreading Model

Extend, replicate and redesign some units of
superscalar to achive multithreading
Resources replicated
State for hardware contexts (registers, PCs)
Per thread mechanisms for Pipeline flushing and
subroutine returns
Per thread identiers for branch target buffer and
translation lookaside buffer

19
Simultaneous Multithreading Model

Resources redesigned
Instruction fetch unit
Processor pipeline
Instruction Scheduling
Does not require additional hardware
Register renaming (same as superscalar)

20
Simultaneous Multithreading ModelSuperScalar
Architecture
21
Simultaneous Multithreading Model Block Diagram
22
Simultaneous Multithreading Model

Instruction Fetch Unit
Takes advantage of inter-thread competition
Partitioning bandwidth
Fetching threads that give maximum local benefit
2.8 fetching
Fetch 1 inst. per logical processor, for 2
threads
Decode 1 thread till branch/end of cache line,
then jump to the other
ICount feedback
Highest priority to threads with fewest
instructions in the decode, renaming, and queue
pipeline stages
Small hardware addition to track queue lengths

23
Simultaneous Multithreading Model

Register File
Each thread has 32 registers
Register File 32 threads rename registers
So, larger register file longer access time

24
Simultaneous Multithreading Model Pipeline Format

Superscalar

25
Simultaneous Multithreading Model Pipeline Format

To avoid increase in clock cycle time, SMT
pipeline extended to allow 2 cycle register reads
and writes
2 cycle reads/writes increase branch
misprediction penalty

26
Simultaneous Multithreading Where to Fetch

Where to Fetch
Static solutions Round-robin
Each cycle 8 instructions from 1 thread
Each cycle 4 instructions from 2 threads, 2 from
4,
Each cycle 8 instructions from 2 threads, and
forward as many as possible from 1 then when
long latency instruction in 1 pick rest from 2
Dynamic solutions Check execution queues!
Favour threads with minimal of in-flight
branches
Favour threads with minimal of outstanding
misses
Favour threads with minimal of in-flight
instructions
Favour threads with instructions far from queue
head

27
Simultaneous Multithreading What to Issue

Not exactly the same as in superscalars
In superscalar oldest is the best (least
speculation, more dependent ones waiting, etc.)
In SMT not so clear branch-speculation level and
optimism (cache-hit speculation) vary across
threads
Based on this the selection strategies
Oldest first
Cache-hit speculated last
Branch speculated last
Branches first
Important result doesnt matter too much!

28
Simultaneous Multithreading Compiler
Optimizations

Should try to minimize cache interference
Latency hiding techniques like speculation should
be enhanced
Sharing optimization techniques from
multiprocessors changed data sharing is now
good

29
Simultaneous Multithreading Caching

Same cache shared among threads
Performance degradation due to cache sharing
Possibility of cache thrashing

30
PERFORMANCE ANALYSIS

Four model is selected
Basic Machine is 10 FU, 8 Issue
Fine-Grain Multithreading
SMFull Simultaneous Issue Eight threads compete
for each of the issue slots each cycle.
SMSingle Issue,SMDual Issue, SMFour Issue
Limit the number of instructions each thread can
issue e.g each thread can issue a maximum of 2
instructions per cycle therefore, a minimum of 4
threads would be required to fill the 8 issue
slots in one cycle.
SMLimited Connection Each hardware context is
directly connected to exactly one of each type of
functional unit.

31
PERFORMANCE ANALYSIS
32
PERFORMANCE ANALYSIS H/W COMPLEXITY
33
COMPARISION

SMT vs. Multiprocessing
Multiprocessing statically assigns functional
units to threads
SMT allows threads to expand
Using available resources

34
COMPARISION
35
DRAWBACKS

Two main drawbacks
Single thread perfomance decreases due to the
architectural constraints
Additional contexts will increase power
consumption

36
Commercial Examples

Compaq Alpha 21464 (EV8)
4T SMT
Project killed June 2001
Intel Pentium IV (Xeon)
2T SMT
Availability in 2002 (already there before, but
not enabled)
10-30 gains expected
Also called as Hyperthreading
SUN Ultra IV
2-core CMP, 2T SMT
IBM POWER5
Dual processor core
8-way superscalar
Simultaneous multithreaded (SMT) core Up to 2
virtual processors per real processor
24 area growth per core for SMT

37
Commercial Examples IBM POWER5
38
Commercial Examples IBM POWER5

SMT added to Superscalar Micro-architecture
Second Program Counter (PC) added to share
I-fetch bandwidth
GPR/FPR rename mapper expanded to map second set
of registers (High order address bit indicates
thread)
Completion logic replicated to track two threads
Thread bit added to most address/tag buses

39
Commercial Examples IBM POWER5
40
Commercial Examples IBM POWER5

Includes
Thread Priority Mechanism Power Efficiency, 8
levels
Dynamic Thread Switching
Used if no task ready for second thread to run
Allocates all machine resources to one thread
Initiated by SW

41
Commercial Examples IBM POWER5

Dormant thread wakes up on
External interrupt
Decrementer interrupt
Special instruction from active thread

42
Future Tendincies

Simultaneous Redundantly Threaded
Processors(SRT)
Increase reliability with fault detection and
correction.
Run multiple copies of the same programme
simultaneously
Software Pre-Execution in SMT
In some cases data adress is extremely hard to
predict.
Prefetching is useless
Use an idle thread of SMT for pre-execution.
A complete software solution
Speculation
More techniques on speculation
E.g Speculative Data-Driven Multithreading,
Threaded Multiple Path Execution, Simultaneous
Subordinate Microthreading and Thread Level
Speculation

43
REFERANCES

"Simultaneous Multithreading Maximizing On-Chip
Parallelism" by Tullsen, Eggers and Levy in
ISCA95.
Simultaneous Multithreading Present
Developments and Future Directions by Miquel
Peric, June 2003
Simultaneous Multi-threading Implementation in
POWER5 -- IBM's Next Generation POWER
Microprocessor by IBM, Aug 2004
Simultaneous Multithreading A Platform for
Next-Generation Processors by Eggers, Emer,
Levy, Lo, Stamm and Tullsen in IEEE Micro,
October, 1997.

44
QA