Title: Simultaneous Multithreading
1Simultaneous Multithreading
- CMPE 511
- BOGAZIÇI UNIVERSITY
2AGENDA
- INTRODUCTION
- Motivation
- Types of Parallesim
- Vertical and Horizontal Wasted Slot
- Superscalar Processors
- Multithreading
- Simultaneous Multithreading
- The Idea
- SMT Model
- Issues What to Fetch and What to Issue? Caching
- Performance Analysis
- Simulation Results
- Comparision
- Drawbacks
- Commercial Examples
- IBM POWER5
- Future Tendincies
3INTRODUCTION Motivation
- Microprocessor Design Optimization Some Focus
Areas - Memory latency
- Increased processor speeds make memory appear
further away - Longer stalls possible
- Branch Processing
- Mispredict more costly as pipeline depth
increases resulting in stalls and wasted power - Predication drives increased power and larger
chip area - Execution Unit Utilization
- 20-25 execution unit utilization common
- SMT Adresses these areas!
4INTRODUCTION Motivation
- Memory subsystem improvement or increasing system
integration is not sufficient for significant
performance improvement. - Solution Increase parallelism in all its
available form - Combine the multiple-issue-per-instruction
features of modern superscalar processors - With latency-hiding ability of multithreaded
architectures
5INTRODUCTION Types of Parallesim
- Bit-level
- Wider processor datapaths (8,16,32,64)
- Word-level (SIMD)
- Vector processors
- Multimedia instruction sets (Intels MMX and SSE,
Suns VIS, etc.) - Instruction-level
- Pipelining
- Superscalar
- VLIW and EPIC
- Task and Application-levels
- Explicit parallel programming
- Multiple threads
- Multiple applications
6INTRODUCTION Vertical Slot Horizontal Slot
- Vertical waste is introduced when the processor
issues no instructions in a cycle - Horizontal waste is introduced when not all issue
slots can be filled in a cycle. - 61 of the wasted cycles are vertical waste.
7INTRODUCTION Superscalar
- Issues multiple instructions in each cycle.
Typically 4. - Several functional units of the same type, e.g.
ALUs - Dispatcher reads instructions, decides which can
run in parallel - Limited by instruction dependencies and
long-latency operations - Effects Horizontal Vertical Waste
- Low Utilization even with higher-issue machines
8 Issue with 20
8INTRODUCTION Superscalar
- Many slots in the execution core are unused.
9MULTITHREADING
- Processor is extended with the concept of thread
allowing the scheduler to chose instructions from
one thread or another at each clock. - Two types in thread scheduling coarse-grain
multithreading and fine-grain multithreading. - SMT uses both types of Multithreading
10MULTITHREADING
11MULTITHREADING
- What a processor needs for Multithreading?
- Processor must be aware of several independent
states, one per each thread - Program Counter
- Register File (and Flags)
- Memory
- Either multiple resources in the processor or a
fast way to switch across states
12MULTITHREADING Coarse - Grain Multithreading
- Swith between threads only on costly stalls
- This form of multithreading only hides long
latency events. - Easy to implement but has large grains
13MULTITHREADING Coarse-Grain
14MULTITHREADING Fine - Grain
Multithreading
- Context switch the threads on every clock cycle.
- Occupancy of the execution core is now much
higher - Hides both long and short latency events
- Vertical waste are eliminated but horizontal
waste is not. If a thread has little or no
operations to execute issue slots will be wasted.
15MULTITHREADING Fine-Grain
16Simultaneous Multithreading Idea
- Combine Superscalar and Multithreading such that
- Issue multiple instructions per cycle
Supercalar - Hardware state for several programs/threads
Multithreading - So issue multiple instructions from multiple
threads in each cycle
17Simultaneous Multithreading Idea
18Simultaneous Multithreading Model
- Extend, replicate and redesign some units of
superscalar to achive multithreading - Resources replicated
- State for hardware contexts (registers, PCs)
- Per thread mechanisms for Pipeline flushing and
subroutine returns - Per thread identiers for branch target buffer and
translation lookaside buffer
19Simultaneous Multithreading Model
- Resources redesigned
- Instruction fetch unit
- Processor pipeline
- Instruction Scheduling
- Does not require additional hardware
- Register renaming (same as superscalar)
20Simultaneous Multithreading ModelSuperScalar
Architecture
21Simultaneous Multithreading Model Block Diagram
22Simultaneous Multithreading Model
- Instruction Fetch Unit
- Takes advantage of inter-thread competition
- Partitioning bandwidth
- Fetching threads that give maximum local benefit
- 2.8 fetching
- Fetch 1 inst. per logical processor, for 2
threads - Decode 1 thread till branch/end of cache line,
then jump to the other - ICount feedback
- Highest priority to threads with fewest
instructions in the decode, renaming, and queue
pipeline stages - Small hardware addition to track queue lengths
23Simultaneous Multithreading Model
- Register File
- Each thread has 32 registers
- Register File 32 threads rename registers
- So, larger register file longer access time
24Simultaneous Multithreading Model Pipeline Format
25Simultaneous Multithreading Model Pipeline Format
- To avoid increase in clock cycle time, SMT
pipeline extended to allow 2 cycle register reads
and writes - 2 cycle reads/writes increase branch
misprediction penalty
26Simultaneous Multithreading Where to Fetch
- Where to Fetch
- Static solutions Round-robin
- Each cycle 8 instructions from 1 thread
- Each cycle 4 instructions from 2 threads, 2 from
4, - Each cycle 8 instructions from 2 threads, and
forward as many as possible from 1 then when
long latency instruction in 1 pick rest from 2 - Dynamic solutions Check execution queues!
- Favour threads with minimal of in-flight
branches - Favour threads with minimal of outstanding
misses - Favour threads with minimal of in-flight
instructions - Favour threads with instructions far from queue
head
27Simultaneous Multithreading What to Issue
- Not exactly the same as in superscalars
- In superscalar oldest is the best (least
speculation, more dependent ones waiting, etc.) - In SMT not so clear branch-speculation level and
optimism (cache-hit speculation) vary across
threads - Based on this the selection strategies
- Oldest first
- Cache-hit speculated last
- Branch speculated last
- Branches first
- Important result doesnt matter too much!
28Simultaneous Multithreading Compiler
Optimizations
- Should try to minimize cache interference
- Latency hiding techniques like speculation should
be enhanced - Sharing optimization techniques from
multiprocessors changed data sharing is now
good
29Simultaneous Multithreading Caching
- Same cache shared among threads
- Performance degradation due to cache sharing
- Possibility of cache thrashing
30PERFORMANCE ANALYSIS
- Four model is selected
- Basic Machine is 10 FU, 8 Issue
- Fine-Grain Multithreading
- SMFull Simultaneous Issue Eight threads compete
for each of the issue slots each cycle. - SMSingle Issue,SMDual Issue, SMFour Issue
Limit the number of instructions each thread can
issue e.g each thread can issue a maximum of 2
instructions per cycle therefore, a minimum of 4
threads would be required to fill the 8 issue
slots in one cycle. - SMLimited Connection Each hardware context is
directly connected to exactly one of each type of
functional unit.
31PERFORMANCE ANALYSIS
32PERFORMANCE ANALYSIS H/W COMPLEXITY
33COMPARISION
- SMT vs. Multiprocessing
- Multiprocessing statically assigns functional
units to threads - SMT allows threads to expand
- Using available resources
34COMPARISION
35DRAWBACKS
- Two main drawbacks
- Single thread perfomance decreases due to the
architectural constraints - Additional contexts will increase power
consumption
36Commercial Examples
- Compaq Alpha 21464 (EV8)
- 4T SMT
- Project killed June 2001
- Intel Pentium IV (Xeon)
- 2T SMT
- Availability in 2002 (already there before, but
not enabled) - 10-30 gains expected
- Also called as Hyperthreading
- SUN Ultra IV
- 2-core CMP, 2T SMT
- IBM POWER5
- Dual processor core
- 8-way superscalar
- Simultaneous multithreaded (SMT) core Up to 2
virtual processors per real processor - 24 area growth per core for SMT
37Commercial Examples IBM POWER5
38Commercial Examples IBM POWER5
- SMT added to Superscalar Micro-architecture
- Second Program Counter (PC) added to share
I-fetch bandwidth - GPR/FPR rename mapper expanded to map second set
of registers (High order address bit indicates
thread) - Completion logic replicated to track two threads
- Thread bit added to most address/tag buses
39Commercial Examples IBM POWER5
40Commercial Examples IBM POWER5
- Includes
- Thread Priority Mechanism Power Efficiency, 8
levels - Dynamic Thread Switching
- Used if no task ready for second thread to run
- Allocates all machine resources to one thread
- Initiated by SW
41Commercial Examples IBM POWER5
- Dormant thread wakes up on
- External interrupt
- Decrementer interrupt
- Special instruction from active thread
42Future Tendincies
- Simultaneous Redundantly Threaded
Processors(SRT) - Increase reliability with fault detection and
correction. - Run multiple copies of the same programme
simultaneously - Software Pre-Execution in SMT
- In some cases data adress is extremely hard to
predict. - Prefetching is useless
- Use an idle thread of SMT for pre-execution.
- A complete software solution
- Speculation
- More techniques on speculation
- E.g Speculative Data-Driven Multithreading,
Threaded Multiple Path Execution, Simultaneous
Subordinate Microthreading and Thread Level
Speculation
43REFERANCES
- "Simultaneous Multithreading Maximizing On-Chip
Parallelism" by Tullsen, Eggers and Levy in
ISCA95. - Simultaneous Multithreading Present
Developments and Future Directions by Miquel
Peric, June 2003 - Simultaneous Multi-threading Implementation in
POWER5 -- IBM's Next Generation POWER
Microprocessor by IBM, Aug 2004 - Simultaneous Multithreading A Platform for
Next-Generation Processors by Eggers, Emer,
Levy, Lo, Stamm and Tullsen in IEEE Micro,
October, 1997.
44QA