18747 Lecture 21: Multithreading Processors - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

18747 Lecture 21: Multithreading Processors

Description:

Announcements: Project 3 Short Proposal due Monday November 19 ... branch decisions, why execute them if I am going to predict the branch anyways ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 26

Provided by: jch104

Category:

more less

Transcript and Presenter's Notes

Title: 18747 Lecture 21: Multithreading Processors

1
18-747 Lecture 21Multithreading Processors

James C. Hoe
Dept of ECE, CMU
November 14, 2001

Reading Assignments paper below Announcements P
roject 3 Short Proposal due Monday November
19 Handouts Simultaneous Multithreading A
Platform for Next-Generation Processors,
Eggers, et al., IEEE Micro
2
Remaining Lectures

11/19 L22 Binary Translation and Optimization
11/26 L23 SMP Cache Coherence
11/28 L24 Exam Review and Course at a Glance
Guest Lecture Low Power Processor Design
by Prof. D. Marculescu
12/3 Exam 2
12/5 Recitation by Aaron
12/10 Class Presentations

3
Instruction-Level Parallelism

When executing a program, how many independent
operations can be performed in parallel
How to take advantage of ILP
Pipelining (including superpipelining)
overlap different stages from different
instructions
limited by divisibility of an instruction and ILP
Superscalar (including VLIW)
overlap processing of different instructions in
all stages
limited by ILP
How to increase ILP
dynamic/static register renaming ? reduce WAW and
WAR
dynamic/static instruction scheduling ? reduce
RAW hazards
use predictions to optimistically break dependence

4
Thread-Level Parallelism

The average processor actually executes several
programs (a.k.a. processes, threads of control,
etc) at the same time Time Multiplexing
The instructions from these different threads
have lots of parallelism
Taking advantage of thread-level parallelism,
i.e. by concurrent execution, can improve the
overall throughput of the processor (but not
turn-around time of any one thread)
Basic Assumption the processor has idle
resources when running only one thread at a time

5
Multiprocessing

Time-multiplex multiprocessing on uniprocessors
started back in 1962
Even concurrent execution by time-multiplexing
improves throughput How?
a single thread would effectively idle the
processor when spin-waiting for I/O to complete,
e.g. disk, keyboard, mouse, etc.
can spin for thousands to millions of cycles at a
time
a thread should just go to sleep when waiting
on I/O and let other threads use the processor,
a.k.a. context switch

6
Context Switch

A context is all of the processor (plus
machine) states associated with a particular
process
programmer visible states program counter,
register file contents, memory contents
and some invisible states control and status
reg, page table base pointers, page tables
What about cache (virtual vs. physical), BTB and
TLB entries?
Classic Context Switching
timer interrupt stops a program mid-execution
(precise)
OS saves away the context of the stopped thread
OS restores the context of a previously stopped
thread (all except PC)
OS uses a return from exception to jump to the
restarting PC
The restored thread has no idea it was
interrupted, removed, later restored and restarted

7
Saving and Restoring Context

Saving
Context information that occupy unique
resources must be copied and saved to a special
memory region belonging exclusively to the OS
e.g. program counter, reg file contents,
cntrl/status reg
Context information the occupy commodity
resources just needs to be hidden from the other
threads
e.g. active memory pages can be left in physical
memory but page translations must be removed (but
remembered)
Restoring is the opposite of saving
The act of saving and restoring is performed by
the OS in software
? can take a few hundred cycles per switch, but
the cost is amortize over the execution quantum

(If you want the full story, take a real OS
course!)
8
Fast Context Switches

A processor becomes idle when a thread runs into
a cache miss
Why not switch to another thread?
Cache miss lasts only tens of cycles, but it
costs OS at least 64 cycles just to save and
restore the 32 GPRs
Solution fast context switch in hardware
replicate hardware context registers PC, GPRs,
cntrl/status, PT base ptr eliminates copying
allow multiple context to share some resources,
i.e. include process ID as cache, BTB and TLB
match tags
eliminates cold starts
hardware context switch takes only a few cycles
set the PID register to the next process ID
select the corresponding set of hardware context
registers to be active

9
Example MITs Sparcle Processor

Based SUN SPARC II processors
provided hardware contexts for 4 threads, one is
reserved for the interrupt handlers
hijacked SPARC IIs register windowing mechanism
to support fast switching between 4 sets of 32
GPRs
switches context in 4 cycles
Used in a cache-coherent distributed shared
memory machine
On a cache miss to remote memory (takes hundreds
of cycles to satisfy), the processor
automatically switches to a different user thread
The network interface can interrupt the processor
to wake up the message handler thread to handle
communication

10
Really Fast Context Switches

When pipelined processor stalls due to RAW
dependence between instructions, the execution
stage is idling
Why not switch to another thread?
Not only do you need hardware contexts, switching
between contexts must be instantaneous to have
any advantage!!
If this can be done,
dont need complicated forwarding logic to avoid
stalls
RAW dependence and long latency operations
(multiply, cache misses) do not cause throughput
performance loss
Multithreading is a latency hiding
technique

11
Fine-grain Multithreading

Suppose instruction processing can be divided
into several stages, but some stages has very
long latency
run the pipeline at the speed of the slowest
stage, or
superpipeline the longer stages, but then
back-to-back dependencies cannot be forwarded

t0
t1
t2
t3
t4
superpipelined
Inst0
Inst1
Fa
Fb
Da
Db
Ea
Eb
Wa
Wb
Fa
Fb
Da
Db
Ea
Eb
Wa
Wb
2-way multithreaded superpipelined
t0
t1
t2
t3
t4
InstT1-0
Fa
Fb
Da
Eb
Wa
Wb
InstT2-x
Fa
Fb
Ea
Eb
Wa
Wb
InstT1-1
Fa
Fb
Ea
Eb
Wa
Wb
InstT2-y
12
Examples Instruction Latency Hiding

Using the previous scheme, MIT Monsoon pipeline
cycles through 8 statically scheduled threads to
hide its 8-cycle (pipelined) memory access
latency
HEP and Tera MTA B. Smith
on every cycle, dynamically selects a ready
thread (i.e. last instruction has finished) from
a pool of upto 128 threads
worst case instruction latency is 128 cycles (may
need 128 threads!!)
a thread can be waken early (i.e. before the last
instruction finishes) using software hints to
indicate no data dependence

13
Really Really Fast Context Switches

Superscalar processor datapath must be
over-resourced
has more functional units than ILP because the
units are not universal
current 4 to 8 way designs only achieves IPC of 2
to 3
Some units must be idling in each cycle
Why not switch to another thread?

14
Simultaneous Multi-Threading Eggers, et al.
Fdiv, unpipe (16 cyc)
Reorder Buffer A
Fetch Unit A
OOO Dispatch A
Context A
ALU1
Reorder Buffer Z
Fetch Unit Z
OOO Dispatch Z
Context Z
ALU2

Dynamic and flexible sharing of functional units
between multiple threads
? increases utilization ? increases throughtput

15
Compaq Alpha EV8

Technology
1.2 2.0 GHz
250 million transistors (mostly in the caches)
0.125um CMOS with copper
1.2V Vdd
1100 signal pins (flip chip)
probably about that many power and ground pins
Architecture
8-wide superscalar with support for 4-way SMT
supports both ILP and thread-level parallelism
On-chip router and directory support for building
glueless 512-way ccNUMA SMP

Joel Emer's Microprocessor Forum
16
EV8 Superscalar to SMT

In SMT mode, it is as if there are 4 processors
on a chip that shares their caches and TLB
Replicated hardware contexts
program counter
architected registers (actually just the renaming
table since architected registers and rename
registers come from the same physical pool)
Shared resources
rename register pool (larger than needed by 1
thread)
instruction queue
caches
TLB
branch predictors
The dynamic superscalar execution pipeline is
more or less unchanged

17
SMT Issues

Adding a SMT to superscalar
Single-thread performance is slight worse due to
overhead (longer pipeline, longer combinational
delays)
Over-utilization of shared resources
contention for instruction and data memory
bandwidth
interferences in caches, TLB and BTBs
But remember multithreading can hide some of the
penalties. For a given design point, SMT should
be more efficient than superscalar if
thread-level parallelism is available
High-degree SMT faces similar scalability
problems as superscalars
needs numerous I-cache and d-cache ports
needs numerous register file read and write ports
the dynamic renaming and reordering logic is not
simpler

18
Speculative Multithreading

SMT can justify wider-than-ILP datapath
But, datapath is only fully utilized by multiple
threads
How to make single-thread program run faster?
Think about predication
What to do with spare resources?
execute both sides of hard-to-predictable
branches
send another thread to scout ahead to warm up
caches BTB
speculatively execute future work
e.g. start several loop iterations concurrently
as different threads, if data dependence is
detected, redo the work
Must have ways to contain the effects of
incorrect speculations!!
run a dynamic compiler/optimizer on the side

19
Slipstream Processors

Execute a single-threaded application redundantly
on a modified 2-way SMT, with one thread
slightly ahead
an advanced stream (A-stream) followed by a
redundant stream (R-stream)
The two redundant programs combined run faster
than either can alone Rotenberg
How is this possible?
A-stream is highly speculative
can use all kinds of branch and value predictions
doesnt go back to check or correct misprediction
even selectively skip some instructions
e.g. some instructions compute branch decisions,
why execute them if I am going to predict the
branch anyways
A-stream should run faster, but its results cant
be trusted
R-stream is executed normally, but it still runs
faster because caches and TLB would have been
warmed by the A-stream!!

20
Microarchitecture for Reliability

Old way of thinking hardware is always
perfect!
42 million transistors in P4, and we expect every
single one (out side of the memory array) to work
as we intended all of the time
Reality in the (near) future
tiny transistors
low Vdd ? small noise margin
fast clocks
Too keep up with Moores Law, future processors
will be making soft errors from time to time

21
Redundancy for Fault Tolerance

It is not good enough to just design correctly.
You must engineer fault tolerance into the
microarchitecture
Fault tolerance ? error detection recovery
? redundant computation

22
Redundant Computation on SMTReinhartd and
Mukherjee

SRT simultaneous and redundantly threaded
processors
Run a thread redundantly on an SMT and check the
answers from the redundant executions to catch
transient errors
Advantages over static, processor-level
redundancy
some mechanisms may be more cheaply protected by
error correction coding (ECC) than replication,
e.g. register file, caches
SMTs flexible resource allocation gives betters
performance than if the same resources are
statically allocated to the redundant threads
All of the SMT hardware can revert back to
delivering performance when redundant execution
is not needed

23
SRT Sphere of Replication

Whatever isnt ECC protected must be replicated
(statically or dynamically) No single point of
failure

any information that enters the sphere of
replication must be replicated and stored
separated
all processing inside the sphere must be
performed redundantly so duplicate intermediate
results exist
any information that exits the sphere must be
checked against its redundant copy for
discrepancy
Should the Regfile be replicated or ECCed?

24
Diva Dynamic Verification Austin

SRT is still vulnerable because redundant
computation may reuses the same hardware, thus a
sufficient long-duration soft error may corrupt
both versions
Diva architecture uses two different processors
of two different designs to achieve redundancy
The core processor is a full superscalar except
the retirement stream is passed to the checker
processor for verification
The checker processor 1. detects and corrects any
mistakes made by the core processor,2. returns
corrected register updates and 3. writes to
memory on behave of the core processor
What happens if the checker processor makes a
mistake?

ROB commit stream
core processor
checker proc
verified register updates
25
Rationale behind Diva

Checking an answer is sometimes much simpler and
faster than computing the answer
when the checker processor is looking at an
instruction, that instructions operands are
already available because all earlier
instructions have already been executed by the
core processor
the checker processor doesnt need branch
prediction, register renaming or out-of-order
execution
a very simple and small checker processor can
keep up with the throughput of the fancy core
processor
thus, we can use use big transistors and lower
clock rate on the checker processor to prevent
soft errors
Other uses
use the checker processor to catch logic
design errors in the complicated superscalar
core processor
build very aggressively speculated core processor
and use checker processor to correct the
occasional
errors