Title: Design Tradeoffsin Instruction Windowof SuperscalarProcessors
1Design Tradeoffs in Instruction Window of
Superscalar Processors
MS Project Proposal
Presented by Chunming Gao
Committee members Dr. Soner Onder (Chair) Dr.
Steven Carr Dr. David Poplawski Dr. Jianping
Dong
2Outline of the presentation
Part one Introduction Part two
Background Part three Instruction window
organizations Part four Work plan and
preliminary results
3Part One Introduction
4Motivation
1. Exploring more parallelism in instruction
level leads to large instruction window. 2.
Pursuing high clock speed limits the size of
instruction window .
5What Will We Study
1. Central window design 2. Distributed window
design 3. Dependence-based window design 4.
Cluster-based window design 5. PEWs (parallel
execution windows) 6. Direct wake-up based
window design
6How Do We Define Performance
1. IPC (Instructions per cycle) 2. Clock cycle
time 3. Compare the ratio of IPCs to a baseline
processor
7Part TwoBackground
8Superscalar Processor Stages
Decode
Complete
Execute
Dispatch
Fetch
Retire
Instruction Dispatch Issuing Completion
Store buffer buffer buffer buffer
buffer
9Bottlenecks of Superscalar Processors
1. Structural hazards A conflict between
multiple instructions which require the same
resource at the same time. 2. Control hazards
Instruction following a branch cannot be
executed until the branch is resolved. 3. Data
hazardsAn instruction depends on the result of
a previous instruction.
10Data Dependencies
1. True data dependencies RAW (Read after
write) i add r3 r2 r1 j add r6
r3 r4 2. False data dependencies WAR (Write
after read) k add r6 r3 r4 l
add r3 r7 r1 WAW(Write after write) m
add r3 r2 r1 n add r3 r7 r1
11Tomasulo's Algorithm
- A hardware algorithm for dynamically issuing
multiple instructions in a pipelined processor.
It provides a general mechanism for register
forwarding and data hazard detection. It
successfully deals with RAW, WAW, and WAR data
dependencies. Two kinds of techniques are used - Register renaming
- Shelving
12Register Renaming
Example add r3 r2 r1 r2 r1 -gt
r3 div r6 r3 r4 r3 / r4 -gt r6
(RAW) sub r3 r7 r1 r7- r1 -gt r3
(WAR, WAW) Register renaming r3 -gt rr1 r6 -gt
rr2 r3 -gt rr3 New instruction serial add rr1
r2 r1 r2 r1 -gt rr1 div rr2 rr1
r4 rr1 / r4 -gt rr2 (RAW) sub rr3 r7
r1 r7- r1 -gt rr3
13Shelving
Reservation station A buffer to hold decoded
instructions to wait for issuing into execution.
Independent instructions are detected and the RAW
true data dependencies are dealt here. Possible
reservation station entry components
Op Qj/Vj VBj Qk/Vk VBk Dest BusyBit
14What's the Instruction Window About
Instruction Decode
Instruction Window
- Holding decoded instructions
- Fetching operands
- Wake up instructions
- Select and issue instructions
FU
FU
FU
FU
15Instruction Window Design Space(1)
1. Reservation stations may vary
Reservation Stations
Individual RS's
Group RS's
Central RS's
RS
RS
RS
RS
RS
EU
EU
EU
EU
EU
EU
EU
EU
16Instruction Window Design Space(2)
2. Operand fetching scheme may vary
Reservation Station
Reg.File
Reservation Station
Reg.File
EU
EU
Scheme 2 Check of the explicit status bits
Scheme 1 Direct check of the scoreboard bits
17Part ThreeInstruction Window Organizations
18Central Window DesignStructure
1. One centralized reservation station holds
every kind of instructions after decoded. 2. It
serves all the functional units.
Decoded Instructions
Reservation Station
Ready Instructions
EU
EU
19Central Window DesignComponents
Decoded Instructions
Update Rd, set V-bi t
Rs1 Rs2 Rd
Identifier Entry DestReg Value Value Latest
Valid No. Valid
Bit
Register File
OC Os1/Is1 Vs1 Os2/Is2 Vs2 Rd
Reservation Station
OC Os1 Os2 Rd
EUs
Associative Update of Is1 Is2 with V-bits
Result, Rd/identifier
20Central Window DesignMerits and Drawbacks
Advantage 1. A large register file is used,
more registers can be renamed 2. A large
reservation station is used, more independent
instructions can be detected 3. Associative
search, more parallelism can be
exploited. Disadvantage 1. More ports are
required 2. Long wires are required 3.
Possibly long clock cycle is induced.
21Distributed Window DesignStructure
1. Two or more reservation stations hold decoded
instructions. 2. They serve different
functional units.
Decoded Instructions
Reservation Station 2
Reservation Station 1
Ready Instructions
EU
EU
22Distributed Window DesignStructure
Decoded Instructions
OC Rs1 Rs2 Rd
OC Rs1 Rs2 Rd
ReservationStation2
ReservationStation1
Rs1 Rs2 Rd
Update Rd, Set V-bit
Identifier Entry DestReg Value Value Latest
Valid No. Valid
Bit
Register File
EUs
Result Rd/Identifier
23Distributed Window DesignMerits and Drawbacks
Advantage 1. Reservation stations are less
complicated 2. Possibly short clock cycle is
achieved Disadvantage 1. Random steering or
Round Robin mode 2. The load in the different
reservation stations may be unbalanced 3.
More ports are still demanded to check the
availability of the operands
24Dependence-based Window DesignStructure
1. Reservation stations are distributed. 2.The
decoded instructions are steered into different
FIFO queues according to dependencies.
Rename, Steering
Dependence-based FIFOs
Register File
Update register file
EUs
25Dependence-based Window DesignSteering Algorithm
For a decoded instruction I 1. If all the
operands are ready, I is steered to a new
FIFO. 2. There is one operand not ready, and if
there's no instruction behind this instruction
in a FIFO, then put I into this FIFO otherwise
put into a new FIFO. 3. There are more than one
operands not ready. Apply 2 to the first operand.
If not suitable, apply to the second operand. 4.
If all the FIFOs are full or if no empty FIFO is
available, stall. After the last instruction in a
FIFO is issued, the FIFO is set free.
26Dependence-based Window DesignMerits and Drawback
s
Advantage 1. Issuing windows are
distributed. 2. Only the heads of the FIFOs are
checked, broadcast for wakeup is
avoided. Disadvantage An independent
instruction always requires an additional FIFO to
steer, if there's no FIFO available, it stalls.
Hence the overall performance will be impacted.
27Cluster-based Window DesignStructure
1. It's based on the dependence-based window
design. 2. The FIFOs are clustered, with each
using a copy of the register file.
Rename, Steering
Cluster2
Cluster1
Dependence-based FIFOs
Dependence-based FIFOs
Register File1
Register File2
EUs
EUs
28Cluster-based Window DesignMerits and Drawbacks
(1)
Advantage 1. Issuing windows are
distributed. 2. Only the heads of the FIFOs are
checked, broadcast for wakeup is avoided. 3.
The number of ports on each register file can be
reduced. Updates of the register file are in
parallel. 4. Local bypasses are used much more
frequently than inter-cluster bypasses.
29Cluster-based Window DesignMerits and Drawbacks
(2)
Disadvantage 1. An independent instruction
always requires an additional FIFO to steer, if
there's no FIFO available, it stalls. Hence the
overall performance will be impacted. 2.
Inter-cluster bypasses will decrease the overall
performance.
30Parallel Execution Windows (PEWs)Structure
It splits the instruction window into separate
execution windows(pews), with each having its own
reservation station and its register file. The
pews communicate with each other to get the
required register data.
pew3
pew2
pew0
Distributor
pew1
31PEWsMerits and Drawbacks
Advantage 1. Issuing windows are
distributed. 2. Local operands fetching and
update are efficient. Disadvantage More clock
cycle delays are induced to pass the results to
the remote pews.
32Direct Wakeup Window DesignStructure
RenameSteering
Not ready
Ready
Cntltgt0
I
Reorder Buffer
Wait_rslt
Wait_lop
Cnt0
wait_rop
wait_queues
Wakeup_input_queue
Wakeup wait_lopwait_rop
Not ready
Ready
EUs
Wakeup wait_rslt
ready_queues
33Direct Wakeup Window DesignMerits and Drawbacks
Advantage 1. Broadcast method is avoided. Only
the depended instructions are woken-up. 2.
Stalls happen only after the resources are fully
occupied, hence resource utilization is
high. Disadvantage An extra stage is
introduced to balance the complicated wakeup
process, which will increase the misprediction
roll back penalty.
34Part FourWork Plan and Preliminary Simulations
35Implementation Plan
1. Study the implemented designs Central window
design Dependence-based design Direct wakeup
based design. 2. Finish and verify the following
designs Distributed window design Cluster-base
d window design PEWs-based window design.
36Test Plan
1. Test using Integer benchmarks and Float
benchmarks 2. Test using different architecture
set-ups Vary the issue width Vary the window
size Vary the register file size Vary the
number of functional units. 3. Write report.
37Preliminary Results (1)
Central Window
Distributed Window
Dependence-based
Cluster-based
Direct wakeup
134.perl
126.gcc
099.go
129.comprss
130.li
38Preliminary Results (2)
Central window
Distributed window
Dependence-based
Cluster-based
Direct wakeup
107.mgrid
103.su2cor
104.hydro2d
101.tomcat
102.swim