Advanced Topic: High Performance Processors

About This Presentation

Title:

Advanced Topic: High Performance Processors

Description:

Advanced Topic: High Performance Processors CS M151B Spring 02 – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 150

Provided by: Sav130

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Topic: High Performance Processors

1
Advanced TopicHigh Performance Processors

CS M151B Spring 02

2
High Performance Processor Design Techniques

Main Idea Exploit as much parallelism and hide
as much overhead as possible
Instruction Level Parallelism
Scoreboarding
Reservation Station (Tomasulo Algorithm)
Dynamic Branch Prediction
Speculation Architecture
Multiple Instruction Issues (Super Scaler
Processor)
Vector Processors
Digital Signal Processors

3
Hardware Approach to Instruction Parallelism

Why in hardware at run time?
Works when cant know real dependence at compile
time
Compiler simpler
Code for one machine runs well on another
Key idea Allow instructions behind stall to
proceed
DIVD F0, F2, F4
ADDD F10, F0, F8
SUBD F12,F8,F14
Enables out-of-order execution gt out-of-order
completion
ID stage checked both for structural and data
hazards

4
Three Generic Data Hazards

Many high performance processor have multiple
execution units and have multiple instructions
executed at the same time. These processors have
to handle three types of data hazards
Read After Write (RAW) InstrI followed by
InstrJ, InstrJ tries to read operand before
InstrI writes it
Write After Read (WAR) InstrI followed by
InstrJ, InstrJ tries to write operand before
InstrI reads it
Gets wrong operand
Cant happen in the simple 5-stage pipeline
because
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5
Write After Write (WAW) InstrI followed by
InstrJ, InstrJ tries to write operand before
InstrI writes it
Leaves wrong result ( InstrI not InstrJ )
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5

5
Scoreboarding

Scoreboard dates to CDC 6600 in 1963
Out-of-order execution divides ID stage
1. Issuedecode instructions, check for
structural hazards
2. Read operandswait until no data hazards, then
read operands
Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions
CDC 6600 In order issue, out of order execution,
out of order commit (also called completion)

6
Scoreboard Implications

Out-of-order completion gt WAR, WAW hazards?
Solutions for WAR
Queue both the operation and copies of its
operands
Read registers only during Read Operands stage
For WAW, must detect hazard stall until other
completes
Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages

7
Four Stages of Scoreboard Control

1. Issuedecode instructions check for
structural hazards (ID1)
If a functional unit for the instruction is
free and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure. If a
structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.
2. Read operandswait until no data hazards, then
read operands (ID2)
A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit. When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be
sent into execution out of order.

8
Four Stages of Scoreboard Control

3. Executionoperate on operands (EX)
The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution.
4. Write resultfinish execution (WB)
Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction.
Example
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD
reads operands

9
Three Parts of the Scoreboard

1. Instruction statuswhich of 4 steps the
instruction is in
2. Functional unit statusIndicates the state of
the functional unit (FU). 9 fields for each
functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., or
)
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source
registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions will
write that register

10
Detailed Scoreboard Pipeline Control
11
Scoreboard Example
12
Scoreboard Example Cycle 1
13
Scoreboard Example Cycle 2

Issue 2nd LD?

14
Scoreboard Example Cycle 3

Issue MULT?

15
Scoreboard Example Cycle 4
16
Scoreboard Example Cycle 5
17
Scoreboard Example Cycle 6
18
Scoreboard Example Cycle 7

Read multiply operands?

19
Scoreboard Example Cycle 8a
20
Scoreboard Example Cycle 8b
21
Scoreboard Example Cycle 9

Read operands for MULT SUBD? Issue ADDD?

22
Scoreboard Example Cycle 11
23
Scoreboard Example Cycle 13
24
Scoreboard Example Cycle 14
25
Scoreboard Example Cycle 15
26
Scoreboard Example Cycle 16
27
Scoreboard Example Cycle 17

Write result of ADDD?

28
Scoreboard Example Cycle 18
29
Scoreboard Example Cycle 20
30
Scoreboard Example Cycle 21
31
Scoreboard Example Cycle 22
32
Scoreboard Example Cycle 61
33
Scoreboard Example Cycle 62
34
CDC 6600 Scoreboard

Speedup 1.7 from compiler 2.5 by hand BUT slow
memory (no cache) limits benefit
Limitations of 6600 scoreboard
No forwarding hardware
Limited to instructions in basic block (small
window)
Small number of functional units (structural
hazards), especailly integer/load store units
Do not issue on structural hazards
Wait for WAR hazards
Prevent WAW hazards

35
Another Dynamic Approach Tomasulo Algorithm

For IBM 360/91 about 3 years after CDC 6600
(1966)
Goal High Performance without special compilers
Differences between IBM 360 CDC 6600 ISA
IBM has only 2 register specifiers/instr vs. 3 in
CDC 6600
IBM has 4 FP registers vs. 8 in CDC 6600
Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,

36
Tomasulo Algorithm vs. Scoreboard

Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard
FU buffers called reservation stations have
pending operands
Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming
avoids WAR, WAW hazards
More reservation stations than registers, so can
do optimizations compilers cant
Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue

37
Tomasulo Organization
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
38
Reservation Station Components

OpOperation to perform in the unit (e.g., or
)
Vj, VkValue of Source operands
Store buffers has V field, result to be stored
Qj, QkReservation stations producing source
registers (value to be written)
Note No ready flags as in Scoreboard Qj,Qk0 gt
ready
Store buffers only have Qi for RS producing
result
BusyIndicates reservation station or FU is
busy
Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions that
will write that register.

39
Three Stages of Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station free (no structural
hazard), control issues instr sends operands
(renames registers).
2. Executionoperate on operands (EX)
When both operands ready then execute if not
ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units
mark reservation station available
Normal data bus data destination (go
to bus)
Common data bus data source (come from bus)
64 bits of data 4 bits of Functional Unit
source address
Write if matches expected Functional Unit
(produces result)
Does the broadcast

40
Tomasulo Example Cycle 0
41
Tomasulo Example Cycle 1
Yes
42
Tomasulo Example Cycle 2
Note Unlike 6600, can have multiple loads
outstanding
43
Tomasulo Example Cycle 3

Note registers names are removed (renamed) in
Reservation Stations MULT issued vs. scoreboard
Load1 completing what is waiting for Load1?

44
Tomasulo Example Cycle 4

Load2 completing what is waiting for it?

45
Tomasulo Example Cycle 5
46
Tomasulo Example Cycle 6

Issue ADDD here vs. scoreboard?

47
Tomasulo Example Cycle 7

Add1 completing what is waiting for it?

48
Tomasulo Example Cycle 8
49
Tomasulo Example Cycle 9
50
Tomasulo Example Cycle 10

Add2 completing what is waiting for it?

51
Tomasulo Example Cycle 11

Write result of ADDD here vs. scoreboard?

52
Tomasulo Example Cycle 12

Note all quick instructions complete already

53
Tomasulo Example Cycle 13
54
Tomasulo Example Cycle 14
55
Tomasulo Example Cycle 15

Mult1 completing what is waiting for it?

56
Tomasulo Example Cycle 16

Note Just waiting for divide

57
Tomasulo Example Cycle 55
58
Tomasulo Example Cycle 56

Mult 2 completing what is waiting for it?

59
Tomasulo Example Cycle 57

Again, in-oder issue, out-of-order execution,
completion

60
Compare to Scoreboard Cycle 62

Why takes longer on Scoreboard/6600?

61
Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)

Pipelined Functional Units Multiple Functional
Units
(6 load, 3 store, 3 , 2 x/) (1 load/store, 1
, 2 x, 1 )
window size 14 instructions 5 instructions
No issue on structural hazard same
WAR renaming avoids stall completion
WAW renaming avoids stall completion
Broadcast results from FU Write/read registers
Control reservation stations central
scoreboard

62
Tomasulo Drawbacks

Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed
Performance limited by Common Data Bus
Multiple CDBs gt more FU logic for parallel assoc
stores

63
Dynamic Branch Prediction

Performance ƒ(accuracy, cost of misprediction)
Branch History Table uses lower bits of PC
address index table of 1-bit values
Says whether or not branch taken last time
No address check
Problem in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iteratios before exit)
End of loop case, when it exits instead of
looping as before
First time through loop on next time through
code, when it predicts exit instead of looping

64
Dynamic Branch Prediction

Solution 2-bit scheme where change prediction
only if get misprediction twice (Figure 4.13, p.
264)
Red stop, not taken
Green go, taken

65
Branch History Table Accuracy

Mispredict because either
Wrong guess for that branch
Got branch history of wrong branch when index the
table
4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12
4096 about as good as infinite table(in Alpha
211164)

66
Correlating Branches

Hypothesis recent branches are correlated that
is, behavior of recently executed branches
affects prediction of current branch
Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table
In general, (m,n) predictor means record last m
branches to select between 2m history talbes each
with n-bit counters
Old 2-bit BHT is then a (0,2) predictor

67
Correlating Branches

(2,2) predictor
Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction

Prediction
68
Selective History Predictor
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
69
Accuracy of Different Schemes(Figure 4.21, p.
272)
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
70
Need Address at Same Time as Prediction

Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken)
Note must check for branch match now, since
cant use wrong branch address (Figure 4.22, p.
273)
Return instruction addresses predicted with stack

Branch Prediction Taken or not Taken
Predicted PC
71
Dynamic Branch Prediction Summary

Branch History Table 2 bits for loop accuracy
Correlation Recently executed branches
correlated with next branch
Branch Target Buffer include branch address
prediction
Predicated Execution can reduce number of
branches, number of mispredicted branches

72
Speculation

Speculation allow an instructionwithout any
consequences (including exceptions) if branch is
not actually taken (HW undo) called boosting
Combine branch prediction with dynamic scheduling
to execute before branches resolved
Separate speculative bypassing of results from
real bypassing of results
When instruction no longer speculative, write
boosted results (instruction commit)or discard
boosted results
execute out-of-order but commit in-order to
prevent irrevocable action (update state or
exception) until instruction commits

73
Hardware Support for Speculation

Need HW buffer for results of uncommitted
instructions reorder buffer
3 fields instr, destination, value
Reorder buffer can be operand source gt more
registers like RS
Use reorder buffer number instead of reservation
station when execution completes
Supplies operands between execution complete
commit
Once operand commits, result is put into
register
Instructionscommit
As a result, its easy to undo speculated
instructions on mispredicted branches or on
exceptions

Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
74
Four Steps of Speculative Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot
free, issue instr send operands reorder
buffer no. for destination (this stage sometimes
called dispatch)
2. Executionoperate on operands (EX)
When both operands ready then execute if not
ready, watch CDB for result when both in
reservation station, execute checks RAW
(sometimes called issue)
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting FUs
reorder buffer mark reservation station
available.
4. Commitupdate register with reorder result
When instr. at head of reorder buffer result
present, update register with result (or store to
memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer
(sometimes called graduation)

75
Renaming Registers

Common variation of speculative design
Reorder buffer keeps instruction information but
not the result
Extend register file with extra renaming
registers to hold speculative results
Rename register allocated at issue result into
rename register on execution complete rename
register into real register on commit
Operands read either from register file (real or
speculative) or via Common Data Bus
Advantage operands are always from single source
(extended register file)

76
Issuing Multiple Instructions/Cycle

Two variations
Superscalar varying no. instructions/cycle (1 to
8), scheduled by compiler or by HW (Tomasulo)
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
(Very) Long Instruction Words (V)LIW fixed
number of instructions (4-16) scheduled by the
compiler put ops into wide templates
Joint HP/Intel agreement in 1999/2000?
Intel Architecture-64 (IA-64) 64-bit address
Style Explicitly Parallel Instruction Computer
(EPIC)
Anticipated success lead to use of Instructions
Per Clock cycle (IPC) vs. CPI

77
Issuing Multiple Instructions/Cycle

Superscalar DLX 2 instructions, 1 FP 1
anything else
Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports for FP registers to do FP load FP
op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instructions in
SS
instruction in right half cant use it, nor
instructions in next slot

78
Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycle
Loop LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,40 10
BNEZ R1,LOOP 11
SD -32(R1),F20 12
Unrolled 5 times to avoid delays (1 due to SS)
12 clocks, or 2.4 clocks per iteration (1.5X)

79
Multiple Issue Challenges

While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations
No hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue
VLIW tradeoff instruction space for simple
decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch
16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide
Need compiling technique that schedules across
several branches

80
Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per
iteration (1.8X)
Average 2.5 ops per clock, 50 efficiency
Note Need more registers in VLIW (15 vs. 6 in
SS)

81
Trace Scheduling

Parallelism across IF branches vs. LOOP branches
Two steps
Trace Selection
Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess (discards values in
registers)
Subtle compiler bugs mean wrong answer vs. pooer
performance no hardware interlocks

82
Advantages of HW (Tomasulo) vs. SW (VLIW)
Speculation

HW determines address conflicts
HW better branch prediction
HW maintains precise exception model
HW does not execute bookkeeping instructions
Works across multiple implementations
SW speculation is much easier for HW design

83
Superscalar v. VLIW

Superscalar
Smaller code size
Binary compatability across generations of
hardware

VLIW
Simplified Hardware for decoding, issuing
instructions
No Interlock Hardware (compiler checks?)
More registers, but simplified Hardware for
Register Ports (multiple independent register
files?)

84
Dynamic Scheduling in Superscalar

Dependencies stop instruction issue
Code compiler for old version will run poorly on
newest version
May want code to vary depending on how
superscalar
How to issue two instructions and keep in-order
instruction issue for Tomasulo?
Assume 1 integer 1 floating point
1 Tomasulo control for integer, 1 for floating
point
Issue 2X Clock Rate, so that issue remains in
order
Only FP loads might cause dependency between
integer and FP issue
Replace load reservation station with a load
queue operands must be read in the order they
are fetched
Load checks addresses in Store Queue to avoid RAW
violation
Store checks addresses in Load Queue to avoid
WAR,WAW
Called decoupled architecture

85
Performance of Dynamic Superscaler Scheduling

Iteration Instructions Issues Executes Writes
result
no.
clock-cycle number
1 LD F0,0(R1) 1 2 4
1 ADDD F4,F0,F2 1 5 8
1 SD 0(R1),F4 2 9
1 SUBI R1,R1,8 3 4 5
1 BNEZ R1,LOOP 4 5
2 LD F0,0(R1) 5 6 8
2 ADDD F4,F0,F2 5 9 12
2 SD 0(R1),F4 6 13
2 SUBI R1,R1,8 7 8 9
2 BNEZ R1,LOOP 8 9
4 clocks per iteration only 1 FP
instr/iteration
Branches, Decrements issues still take 1 clock
cycle
How to get more performance?

86
Software Pipelining

Observation if iterations from loops are
independent, then can get more ILP by taking
instructions from different iterations
Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop (
Tomasulo in SW)

87
Software Pipelining Example

Before Unrolled 3 times
1 LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD -8(R1),F8
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD -16(R1),F12
10 SUBI R1,R1,24
11 BNEZ R1,LOOP

After Software Pipelined 1 SD 0(R1),F4 Stores
Mi 2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) Loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP
SW Pipeline
overlapped ops
Time
Loop Unrolled

Symbolic Loop Unrolling
Maximize result-use distance
Less code space than unrolling
Fill drain pipe only once per loop vs.
once per each unrolled iteration in loop unrolling

Time
88
Limits to Multi-Issue Machines

Inherent limitations of ILP
1 branch in 5 How to keep a 5-way VLIW busy?
Latencies of units many operations must be
scheduled
Need about Pipeline Depth x No. Functional Units
of independentDifficulties in building HW
Easy More instruction bandwidth
Easy Duplicate FUs to get parallel execution
Hard Increase ports to Register File (bandwidth)
VLIW example needs 7 read and 3 write for Int.
Reg. 5 read and 3 write for FP reg
Harder Increase ports to memory (bandwidth)
Decoding Superscalar and impact on clock rate,
pipeline depth?

89
Limits to Multi-Issue Machines

Limitations specific to either Superscalar or
VLIW implementation
Decode issue in Superscalar how wide practical?
VLIW code size unroll loops wasted fields in
VLIW
IA-64 compresses dependent instructions, but
still larger
VLIW lock step gt 1 hazard all instructions
stall
IA-64 not lock step? Dynamic pipeline?
VLIW binary compatibilityIA-64 promises binary
compatibility

90
Limits to ILP

Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C
programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using existing
mechanims with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?

91
Limits to ILP

Initial HW Model here MIPS compilers.
Assumptions for ideal/perfect machine to start
1. Register renaminginfinite virtual registers
and all WAW WAR hazards are avoided
2. Branch predictionperfect no mispredictions
3. Jump predictionall jumps perfectly predicted
gt machine with perfect speculation an
unbounded buffer of instructions available
4. Memory-address alias analysisaddresses are
known a store can be moved before a load
provided addresses not equal
1 cycle latency for all instructions unlimited
number of instructions issued per clock cycle

92
Intel/HP Explicitly Parallel Instruction
Computer (EPIC)

3 Instructions in 128 bit groups field
determines if instructions dependent or
independent
Smaller code size than old VLIW, larger than
x86/RISC
Groups can be linked to show independence gt 3
instr
64 integer registers 64 floating point
registers
Not separate filesper funcitonal unit as in old
VLIW
Hardware checks dependencies (interlocks gt
binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mispredictions?
IA-64 name of instruction set architecture
EPIC is type
Merced is name of first implementation
(1999/2000?)
LIW EPIC?

93
Dynamic Scheduling in PowerPC 604 and Pentium Pro

Both In-order Issue, Out-of-order execution,
In-order Commit
Pentium Pro more like a scoreboard since central
control vs. distributed

94
Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro
Max. instructions issued/clock 4 3
Max. instr. complete exec./clock 6 5
Max. instr. commited/clock 6 3
Window (Instrs in reorder buffer) 16 40
Number of reservations stations 12 20
Number of rename registers 8int/12FP 40
No. integer functional units (FUs) 2 2No.
floating point FUs 1 1 No. branch FUs 1 1 No.
complex integer FUs 1 0No. memory FUs 1 1 load
1 store

Q How pipeline 1 to 17 byte x86 instructions?
95
Dynamic Scheduling in Pentium Pro

PPro doesnt pipeline 80x86 instructions
PPro decode unit translates the Intel
instructions into 72-bit micro-operations ( DLX)
Sends micro-operations to reorder buffer
reservation stations
Takes 1 clock cycle to determine length of 80x86
instructions 2 more to create the
micro-operations
12-14 clocks in total pipeline ( 3 state
machines)
Many instructions translate to 1 to 4
micro-operations
Complex 80x86 instructions are executed by a
conventional microprogram (8K x 72 bits) that
issues long sequences of micro-operations

96
Problems with Instruction Level Parallelism

Limits to conventional exploitation of ILP
1) pipelined clock rate at some point, each
increase in clock rate has corresponding CPI
increase (branches, other hazards)
2) instruction fetch and decode at some point,
its hard to fetch and decode more instructions
per clock cycle
3) cache hit rate some long-running
(scientific) programs have very large data sets
accessed with poor locality others have
continuous data streams (multimedia) and hence
poor locality

97
Alternative Model Vector Processing

Vector processors have high-level operations that
work on linear arrays of numbers "vectors"

SCALAR (1 operation)
VECTOR (N operations)
add.vv v3, v1, v2
add r3, r1, r2
98
Properties of Vector Processors

Each result independent of previous result gt
long pipeline, compiler ensures no
dependenciesgt high clock rate
Vector instructions access memory with known
patterngt highly interleaved memory gt amortize
memory latency of over 64 elements gt no
(data) caches required! (Do use instruction
cache)
Reduces branches and branch problems in pipelines
Single vector instruction implies lots of work (
loop) gt fewer instruction fetches

99
Styles of Vector Architectures

memory-memory vector processors all vector
operations are memory to memory
vector-register processors all vector operations
between vector registers (except load and store)
Vector equivalent of load-store architectures
Includes all vector machines since late 1980s
Cray, Convex, Fujitsu, Hitachi, NEC
We assume vector-register for rest of lectures

100
Components of Vector Processor

Vector Register fixed length bank holding a
single vector
has at least 2 read and 1 write ports
typically 8-32 vector registers, each holding
64-128 64-bit elements
Vector Functional Units (FUs) fully pipelined,
start new operation every clock
typically 4 to 8 FUs FP add, FP mult, FP
reciprocal (1/X), integer add, logical, shift
may have multiple of same unit
Vector Load-Store Units (LSUs) fully pipelined
unit to load or store a vector may have multiple
LSUs
Scalar registers single element for FP scalar or
address
Cross-bar to connect FUs , LSUs, registers

101
Memory operations

Load/store operations move groups of data between
registers and memory
Three types of addressing
Unit stride
Fastest
Non-unit (constant) stride
Indexed (gather-scatter)
Vector equivalent of register indirect
Good for sparse arrays of data
Increases number of programs that vectorize

32
102
Example of Vector Instruction (Y a X Y)
Assuming vectors X, Y are length 64 Scalar vs.
Vector
LD F0,a load scalar a LV V1,Rx load
vector X MULTS V2,F0,V1 vector-scalar
mult. LV V3,Ry load vector Y ADDV V4,V2,V3 add
SV Ry,V4 store the result

LD F0,a
ADDI R4,Rx,512 last address to load
loop LD F2, 0(Rx) load X(i)
MULTD F2,F0,F2 aX(i)
LD F4, 0(Ry) load Y(i)
ADDD F4,F2, F4 aX(i) Y(i)
SD F4 ,0(Ry) store into Y(i)
ADDI Rx,Rx,8 increment index to X
ADDI Ry,Ry,8 increment index to Y
SUB R20,R4,Rx compute bound
BNZ R20,loop check if done

578 (2964) vs.321 (1564) ops (1.8X) 578
(2964) vs. 6 instructions (96X) 64
operation vectors no loop overhead also
64X fewer pipeline hazards
103
Vector Surprise

Use vectors for inner loop parallelism (no
surprise)
One dimension of array A0, 0, A0, 1, A0,
2, ...
think of machine as, say, 32 vector regs each
with 64 elements
1 instruction updates 64 elements of 1 vector
register
and for outer loop parallelism!
1 element from each column A0,0, A1,0,
A2,0, ...
think of machine as 64 virtual processors (VPs)
each with 32 scalar registers! ( multithreaded
processor)
1 instruction updates 1 scalar register in 64 VPs
Hardware identical, just 2 compiler perspectives

104
Virtial Processor Vector Model

Vector operations are SIMD (single instruction
multiple data)operations
Each element is computed by a virtual processor
(VP)
Number of VPs given by vector length
vector control register

105
Vector Architectural State
106
Vector Implementation

Vector register file
Each register is an array of elements
Size of each register determines maximumvector
length
Vector length register determines vector
lengthfor a particular operation
Multiple parallel execution units
lanes(sometimes called pipelines or pipes)

107
Vector Terminology 4 lanes, 2 vector functional
units
(Vector Functional Unit)
108
Vector Execution Time

Time f(vector length, data dependicies, struct.
hazards)
Initiation rate rate that FU consumes vector
elements ( number of lanes usually 1 or 2 on
Cray T-90)
Convoy set of vector instructions that can begin
execution in same clock (no struct. or data
hazards)
Chime approx. time for a vector operation
m convoys take m chimes if each vector length is
n, then they take approx. m x n clock cycles
(ignores overhead good approximization for long
vectors)

4 conveys, 1 lane, VL64 gt 4 x 64 256
clocks (or 4 clocks per result)
109
Example of Vector Instruction Start-up Time

Start-up time pipeline latency time (depth of FU
pipeline) another sources of overhead
Operation Start-up penalty (from CRAY-1)
Vector load/store 12
Vector multply 7
Vector add 6
Assume convoys don't overlap vector length n

Convoy Start 1st result last result 1. LV
0 12 11n (12n-1) 2. MULV,
LV 12n 12n12 232n Load start-up 3.
ADDV 242n 242n6 293n Wait convoy 2 4. SV
303n 303n12 414n Wait convoy 3
110
Why startup time for each vector instruction?

Why not overlap startup time of back-to-back
vector instructions?
Cray machines built from many ECL chips operating
at high clock rates hard to do?
Berkeley vector design (T0) didnt know it
wasnt supposed to do overlap, so no startup
times for functional units (except load)

111
Vector Load/Store Units Memories

Start-up overheads usually longer fo LSUs
Memory system must sustain ( lanes x word)
/clock cycle
Many Vector Procs. use banks (vs. simple
interleaving)
1) support multiple loads/stores per cycle gt
multiple banks address banks independently
2) support non-sequential accesses (see soon)
Note No. memory banks gt memory latency to avoid
stalls
m banks gt m words per memory lantecy l clocks
if m lt l, then gap in memory pipeline
clock 0 l l1 l2 lm- 1 lm 2 l
word -- 0 1 2 m-1 -- m
may have 1024 banks in SRAM

112
Vector Length

What to do when vector length is not exactly 64?
vector-length register (VLR) controls the length
of any vector operation, including a vector load
or store. (cannot be gt the length of vector
registers)
do 10 i 1, n
10 Y(i) a X(i) Y(i)
Don't know n until runtime! n gt Max. Vector
Length (MVL)?

113
Strip Mining

Suppose Vector Length gt Max. Vector Length (MVL)?
Strip mining generation of code such that each
vector operation is done for a size Š to the MVL
1st loop do short piece (n mod MVL), rest VL
MVL
low 1 VL (n mod MVL) /find the odd
size piece/ do 1 j 0,(n / MVL) /outer
loop/
do 10 i low,lowVL-1 /runs for length
VL/ Y(i) aX(i) Y(i) /main
operation/10 continue low lowVL /start of
next vector/ VL MVL /reset the length to
max/1 continue

114
Common Vector Metrics

Write a Comment

User Comments (0)