Simultaneous Multithreading - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Simultaneous Multithreading

Description:

Mispredict more costly as pipeline depth increases resulting in stalls and wasted power ... Swith between threads only on costly stalls ... – PowerPoint PPT presentation

Number of Views:896
Avg rating:3.0/5.0
Slides: 45
Provided by: core75
Category:

less

Transcript and Presenter's Notes

Title: Simultaneous Multithreading


1
Simultaneous Multithreading
  • CMPE 511
  • BOGAZIÇI UNIVERSITY

2
AGENDA
  • INTRODUCTION
  • Motivation
  • Types of Parallesim
  • Vertical and Horizontal Wasted Slot
  • Superscalar Processors
  • Multithreading
  • Simultaneous Multithreading
  • The Idea
  • SMT Model
  • Issues What to Fetch and What to Issue? Caching
  • Performance Analysis
  • Simulation Results
  • Comparision
  • Drawbacks
  • Commercial Examples
  • IBM POWER5
  • Future Tendincies

3
INTRODUCTION Motivation
  • Microprocessor Design Optimization Some Focus
    Areas
  • Memory latency
  • Increased processor speeds make memory appear
    further away
  • Longer stalls possible
  • Branch Processing
  • Mispredict more costly as pipeline depth
    increases resulting in stalls and wasted power
  • Predication drives increased power and larger
    chip area
  • Execution Unit Utilization
  • 20-25 execution unit utilization common
  • SMT Adresses these areas!

4
INTRODUCTION Motivation
  • Memory subsystem improvement or increasing system
    integration is not sufficient for significant
    performance improvement.
  • Solution Increase parallelism in all its
    available form
  • Combine the multiple-issue-per-instruction
    features of modern superscalar processors
  • With latency-hiding ability of multithreaded
    architectures

5
INTRODUCTION Types of Parallesim
  • Bit-level
  • Wider processor datapaths (8,16,32,64)
  • Word-level (SIMD)
  • Vector processors
  • Multimedia instruction sets (Intels MMX and SSE,
    Suns VIS, etc.)
  • Instruction-level
  • Pipelining
  • Superscalar
  • VLIW and EPIC
  • Task and Application-levels
  • Explicit parallel programming
  • Multiple threads
  • Multiple applications

6
INTRODUCTION Vertical Slot Horizontal Slot
  • Vertical waste is introduced when the processor
    issues no instructions in a cycle
  • Horizontal waste is introduced when not all issue
    slots can be filled in a cycle.
  • 61 of the wasted cycles are vertical waste.

7
INTRODUCTION Superscalar
  • Issues multiple instructions in each cycle.
    Typically 4.
  • Several functional units of the same type, e.g.
    ALUs
  • Dispatcher reads instructions, decides which can
    run in parallel
  • Limited by instruction dependencies and
    long-latency operations
  • Effects Horizontal Vertical Waste
  • Low Utilization even with higher-issue machines
    8 Issue with 20

8
INTRODUCTION Superscalar
  • Many slots in the execution core are unused.

9
MULTITHREADING
  • Processor is extended with the concept of thread
    allowing the scheduler to chose instructions from
    one thread or another at each clock.
  • Two types in thread scheduling coarse-grain
    multithreading and fine-grain multithreading.
  • SMT uses both types of Multithreading

10
MULTITHREADING
11
MULTITHREADING
  • What a processor needs for Multithreading?
  • Processor must be aware of several independent
    states, one per each thread
  • Program Counter
  • Register File (and Flags)
  • Memory
  • Either multiple resources in the processor or a
    fast way to switch across states

12
MULTITHREADING Coarse - Grain Multithreading
  • Swith between threads only on costly stalls
  • This form of multithreading only hides long
    latency events.
  • Easy to implement but has large grains

13
MULTITHREADING Coarse-Grain
14
MULTITHREADING Fine - Grain
Multithreading
  • Context switch the threads on every clock cycle.
  • Occupancy of the execution core is now much
    higher
  • Hides both long and short latency events
  • Vertical waste are eliminated but horizontal
    waste is not. If a thread has little or no
    operations to execute issue slots will be wasted.

15
MULTITHREADING Fine-Grain
16
Simultaneous Multithreading Idea
  • Combine Superscalar and Multithreading such that
  • Issue multiple instructions per cycle
    Supercalar
  • Hardware state for several programs/threads
    Multithreading
  • So issue multiple instructions from multiple
    threads in each cycle

17
Simultaneous Multithreading Idea
18
Simultaneous Multithreading Model
  • Extend, replicate and redesign some units of
    superscalar to achive multithreading
  • Resources replicated
  • State for hardware contexts (registers, PCs)
  • Per thread mechanisms for Pipeline flushing and
    subroutine returns
  • Per thread identiers for branch target buffer and
    translation lookaside buffer

19
Simultaneous Multithreading Model
  • Resources redesigned
  • Instruction fetch unit
  • Processor pipeline
  • Instruction Scheduling
  • Does not require additional hardware
  • Register renaming (same as superscalar)

20
Simultaneous Multithreading ModelSuperScalar
Architecture
21
Simultaneous Multithreading Model Block Diagram
22
Simultaneous Multithreading Model
  • Instruction Fetch Unit
  • Takes advantage of inter-thread competition
  • Partitioning bandwidth
  • Fetching threads that give maximum local benefit
  • 2.8 fetching
  • Fetch 1 inst. per logical processor, for 2
    threads
  • Decode 1 thread till branch/end of cache line,
    then jump to the other
  • ICount feedback
  • Highest priority to threads with fewest
    instructions in the decode, renaming, and queue
    pipeline stages
  • Small hardware addition to track queue lengths

23
Simultaneous Multithreading Model
  • Register File
  • Each thread has 32 registers
  • Register File 32 threads rename registers
  • So, larger register file longer access time

24
Simultaneous Multithreading Model Pipeline Format
  • Superscalar
  • SMT

25
Simultaneous Multithreading Model Pipeline Format
  • To avoid increase in clock cycle time, SMT
    pipeline extended to allow 2 cycle register reads
    and writes
  • 2 cycle reads/writes increase branch
    misprediction penalty

26
Simultaneous Multithreading Where to Fetch
  • Where to Fetch
  • Static solutions Round-robin
  • Each cycle 8 instructions from 1 thread
  • Each cycle 4 instructions from 2 threads, 2 from
    4,
  • Each cycle 8 instructions from 2 threads, and
    forward as many as possible from 1 then when
    long latency instruction in 1 pick rest from 2
  • Dynamic solutions Check execution queues!
  • Favour threads with minimal of in-flight
    branches
  • Favour threads with minimal of outstanding
    misses
  • Favour threads with minimal of in-flight
    instructions
  • Favour threads with instructions far from queue
    head

27
Simultaneous Multithreading What to Issue
  • Not exactly the same as in superscalars
  • In superscalar oldest is the best (least
    speculation, more dependent ones waiting, etc.)
  • In SMT not so clear branch-speculation level and
    optimism (cache-hit speculation) vary across
    threads
  • Based on this the selection strategies
  • Oldest first
  • Cache-hit speculated last
  • Branch speculated last
  • Branches first
  • Important result doesnt matter too much!

28
Simultaneous Multithreading Compiler
Optimizations
  • Should try to minimize cache interference
  • Latency hiding techniques like speculation should
    be enhanced
  • Sharing optimization techniques from
    multiprocessors changed data sharing is now
    good

29
Simultaneous Multithreading Caching
  • Same cache shared among threads
  • Performance degradation due to cache sharing
  • Possibility of cache thrashing

30
PERFORMANCE ANALYSIS
  • Four model is selected
  • Basic Machine is 10 FU, 8 Issue
  • Fine-Grain Multithreading
  • SMFull Simultaneous Issue Eight threads compete
    for each of the issue slots each cycle.
  • SMSingle Issue,SMDual Issue, SMFour Issue
    Limit the number of instructions each thread can
    issue e.g each thread can issue a maximum of 2
    instructions per cycle therefore, a minimum of 4
    threads would be required to fill the 8 issue
    slots in one cycle.
  • SMLimited Connection Each hardware context is
    directly connected to exactly one of each type of
    functional unit.

31
PERFORMANCE ANALYSIS
32
PERFORMANCE ANALYSIS H/W COMPLEXITY
33
COMPARISION
  • SMT vs. Multiprocessing
  • Multiprocessing statically assigns functional
    units to threads
  • SMT allows threads to expand
  • Using available resources

34
COMPARISION
35
DRAWBACKS
  • Two main drawbacks
  • Single thread perfomance decreases due to the
    architectural constraints
  • Additional contexts will increase power
    consumption

36
Commercial Examples
  • Compaq Alpha 21464 (EV8)
  • 4T SMT
  • Project killed June 2001
  • Intel Pentium IV (Xeon)
  • 2T SMT
  • Availability in 2002 (already there before, but
    not enabled)
  • 10-30 gains expected
  • Also called as Hyperthreading
  • SUN Ultra IV
  • 2-core CMP, 2T SMT
  • IBM POWER5
  • Dual processor core
  • 8-way superscalar
  • Simultaneous multithreaded (SMT) core Up to 2
    virtual processors per real processor
  • 24 area growth per core for SMT

37
Commercial Examples IBM POWER5
38
Commercial Examples IBM POWER5
  • SMT added to Superscalar Micro-architecture
  • Second Program Counter (PC) added to share
    I-fetch bandwidth
  • GPR/FPR rename mapper expanded to map second set
    of registers (High order address bit indicates
    thread)
  • Completion logic replicated to track two threads
  • Thread bit added to most address/tag buses

39
Commercial Examples IBM POWER5
40
Commercial Examples IBM POWER5
  • Includes
  • Thread Priority Mechanism Power Efficiency, 8
    levels
  • Dynamic Thread Switching
  • Used if no task ready for second thread to run
  • Allocates all machine resources to one thread
  • Initiated by SW

41
Commercial Examples IBM POWER5
  • Dormant thread wakes up on
  • External interrupt
  • Decrementer interrupt
  • Special instruction from active thread

42
Future Tendincies
  • Simultaneous Redundantly Threaded
    Processors(SRT)
  • Increase reliability with fault detection and
    correction.
  • Run multiple copies of the same programme
    simultaneously
  • Software Pre-Execution in SMT
  • In some cases data adress is extremely hard to
    predict.
  • Prefetching is useless
  • Use an idle thread of SMT for pre-execution.
  • A complete software solution
  • Speculation
  • More techniques on speculation
  • E.g Speculative Data-Driven Multithreading,
    Threaded Multiple Path Execution, Simultaneous
    Subordinate Microthreading and Thread Level
    Speculation

43
REFERANCES
  • "Simultaneous Multithreading Maximizing On-Chip
    Parallelism" by Tullsen, Eggers and Levy in
    ISCA95.
  • Simultaneous Multithreading Present
    Developments and Future Directions by Miquel
    Peric, June 2003
  • Simultaneous Multi-threading Implementation in
    POWER5 -- IBM's Next Generation POWER
    Microprocessor by IBM, Aug 2004
  • Simultaneous Multithreading A Platform for
    Next-Generation Processors by Eggers, Emer,
    Levy, Lo, Stamm and Tullsen in IEEE Micro,
    October, 1997.

44
QA
  • THANKS!
Write a Comment
User Comments (0)
About PowerShow.com