HalfPrice Architecture - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

HalfPrice Architecture

Description:

Processors are designed to handle 0, 1 and 2-source ... crafty. eon. gap. gcc. gzip. mcf. parser. perl. twolf. vortex. vpr. 4-wide. 8-wide. June 9, 2003 ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 28

Provided by: ilhyu

Category:

more less

Transcript and Presenter's Notes

Title: HalfPrice Architecture

1
Half-Price Architecture

Ilhyun Kim
Mikko H. Lipasti
PHARM Team
University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
Motivations

Processors are designed to handle 0, 1 and
2-source instructions at equal cost
Satisfy the worst-case requirements of
instructions
No resource arbitrations / pipeline stalls in
handling source operands
Simple controls over instruction and data stream
Handling source operands requires 2x machine
bandwidth
e.g. 2 read ports / 1 write port per instruction
Heavily multi-ported structures in many pipeline
stages

3
Making the common case faster

2x HW configuration assumes 2 source operands are
common
1836 of instructions have 2 source operands
But, structures for 2 source operands are not
fully utilized
Scheduler
416 of instructions need two wakeups
Less than 3 of instructions handle 2 wakeups in
the same clock cycle
Register File
0.64 read port per instruction
Less than 4 of instructions need two register
read ports
Handling 2 source operands may NOT be the common
case
? Why not build a pipeline optimized for 1-source
instructions?

4
Half-price Architecture

Restrict the processors capability to handle 2
source operands
0- or 1-source instructions are processed without
any restriction
2-source instructions may execute more slowly
But, they are not the common case
? Reduce hardware complexity incurred by 2 source
operands
½ technique in scheduler Sequential wakeup
½ technique in RF Sequential register access

HW design point to match the worst-case
requirements
Opcode
Rdst
Rsrc 1
Rsrc 2
Opcode
Rdst
Rsrc 1
Half-price architecture design point
Needs more hardware
Opcode
Rdst / Rsrc
Opcode
5
2-source-format instructions
2-src-format insts

1836 of dynamic instructions have 2-source
format (excluding stores)

6
Target identification2-source instructions
2-src-format insts
2-src insts

623 of instructions are 2-source instructions
2 unique source operands with dependences
Dynamic behaviors of 2-source instructions will
expose greater opportunities

7
Outline

Motivations
Half-price architecture
Reducing scheduler complexity
Sequential wakeup
Reducing register file complexity
Conclusions Future work

8
Scheduler complexity

Overdesign in wakeup logic
Tag comparators for two source operands
Tag broadcast is expensive
Delay is a function of tag comparators and bus
length

Speeding up the scheduler
Clustered scheduler (Palacharla et al.)
Making a small window look bigger (Michaud et
al.)
Hierarchical scheduler (Lebeck et al., Hrishikesh
et al.)
Reducing wakeup bus load capacitance
Tag elimination last-tag speculation (Ernst
Austin)
Half-price technique sequential wakeup

9
Last-tag speculation (Ernst Austin, ISCA02)

Only the last-arriving operand initiates
instruction issue
Remove tag comparison logic for the
early-arriving operand
Fewer tag comparators ? reduced load on the bus
compact wakeup logic ? scheduling logic cycle
time improvement
A scoreboard checks correctness of scheduling
May hurt performance due to its speculative
nature
Implementation issue w/ broadcast-based selective
recovery
? Our technique exploits last-arriving operands
non-speculatively, achieving similar benefits

10
2-pending-source instructions

Many operands are already ready at insert time
416 of instructions have 2-pending-source
operands, requiring two wakeup signals before
being issued

2-src insts
2-pending- src insts
11
Slack between two wakeups

Many 2-pending-source instructions have wakeup
slack
Less than 3 of instructions have 0-slack wakeups
? Exploit wakeup slack to prioritize operand
wakeups

2-pending- src insts
simultaneous wakeup (0 slack)
12
½ technique - Sequential wakeup

Sequentially wake up ½ operands during wakeup
slack
Decouples half of tag comparators ? reduced load
on the bus
Flexible routing in slow wakeup bus ? compact
fast wakeup logic
No recovery, lower misprediction penalty (1-cycle
issue delay)
Instructions are issued non-speculatively in
terms of operand readiness
Simultaneous (0-slack) wakeups always incur
penalty
But, they are less than 3 of instructions

tag W
tag 1

OR

OR

readyL
tagL
readyR
tagR

put the tag predicted to be last-arriving
latch
Fast wakeup bus
Slow wakeup bus (1 clk behind)
13
Machine models

Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
Speculative scheduling
Alpha-style squashing scheduling recovery
invalidates all issued instructions (dependent /
independent) behind the miss
4-wide 64 RUUs, 32 LSQs, 2 memory ports
8-wide 128 RUUs, 64 LSQs, 4 memory ports
64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
Combined (bimodal gShare) branch prediction,
fetch until the first taken branch
Sequential wakeup
Last-arriving operand predictor 1k-entry,
PC-direct-mapped, 2-bit bimodal
Last-tag speculation
Same predictor
Scoreboard located next to the scheduler

14
Sequential wakeup performance
4-wide
8-wide

Sequential wakeup slowdown is slight avg 0.4 /
0.6, worst 2.1
Less than 4 of instructions incur penalty
Sequential wakeup is relatively insensitive to
predictor accuracy
? Sequential wakeup can reduce wakeup logic delay
with a minimal performance impact

15
Outline

Motivations
Half-price architecture
Reducing scheduler complexity
Reducing register file complexity
Sequential register file access
Conclusions Future work

16
Register file complexity

Overdesign in register file
2x read ports for two source operands
Superscalar processors need RF to be heavily
multiported
Area increases quadratically, latency increases
linearly
Two read ports are not fully utilized
0- / 1-source instructions do not require two
read ports
Many instructions frequently get values off the
bypass path
0.64 read ports / instruction (Balasubramonian
et. al, ISCA01)
Speeding up the RF
Reducing the number of register entries
Hierarchical register file (Cruz, Borch,
Balasubramonian, ..)
Reducing the number of ports
Fewer RF ports crossbar (Balasubramonian et al,
Park et al)
Half-price technique Sequential RF Access

17
Two RF read port accesses

Less than 4 of instructions need 2 read port
accesses
Many 2-source instructions read at least one
value off bypass path

4-wide
8-wide
2-src insts
require 2 read ports
18
½ technique Sequential RF access

Remove ½ register read ports
Only a single read port per issue slot
0 or 1-source instructions are processed without
any restriction
Sequentially access a single port twice for 2
values if needed (the execution latency
increases by 1 clock cycle)
However, speculative scheduling does not allow
variable-latency operations (Implementing
optimizations at decode time, ISCA02)
Load latency misprediction ? scheduling recovery
Variable RF latency ? scheduling recovery, too
? Sequential RF access should be reflected in
scheduling
How to detect if source values will be read off
the bypass path?
How to schedule dependent instructions
accordingly?

19
Scheduling in sequential RF access

Back-to-back issue Reading values off the
bypass
Back-to-back issue makes dependent instructions
fall within bypass window
Non-back-to-back issue or 2 ready sources at
insert time incur sequential RF access (assuming
1-clk cycle bypass window)
Scheduler considerations
!(wakeup selected) in the same cycle ?
sequential RF access
Delay tag broadcast by 1 clock cycle
Block the issue slot (only the one w/ seq RF
access) for 1 cycle fornon-pipelined RF access
operation

20
Machine models

Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
Speculative scheduling (same as before)
Alpha-style squashing scheduling recovery
invalidates all issued instructions (dependent /
independent) behind the miss
4-wide 64 RUUs, 32 LSQs, 2 memory ports
8-wide 128 RUUs, 64 LSQs, 4 memory ports
64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
Combined (bimodal gShare) branch prediction,
fetch until the first taken branch
Sequential RF access
½ read-ported RF (1 read port / issue slot)
Comparison cases
Pipelined RF (1 extra RF stage)
½ read-ported RF (same as sequential RF access)
crossbar

21
Sequential RF access performance
4-wide
8-wide

Seq RF access slowdown is slight avg 1.1 / 0.7,
worst 2.2
1-extra RF stage requires extra bypass paths
½ read ports crossbar almost achieves base
performance
crossbar complexity, global RF port arbitration
? Sequential RF access reduces the number of RF
read ports with a minimal performance impact

22
Sequential wakeup RF access

Performance degradation avg 2.2, worst 4.8
Reduced wakeup bus load capacitance, fewer read
ports of RF
? Half-price techniques reduce HW complexity,
reaping most of the performance of a conventional
pipeline

23
Conclusions Future work

Processors are overdesigned to process 0, 1,
2-source instructions at equal cost
Handling 2-source instructions may not be the
common case
Only a small fraction of instructions utilize
overdesigned hardware
Reduce HW complexity by restricting the
processors capability of handling 2-source
instructions
Sequential wakeup, sequential RF access
The performance impact is minimal
The basic concept can be extended to all pipeline
stages
Register rename, ready information check, bypass
logic
Changing the pipeline design from instruction- to
operand-granularity