Title: HalfPrice Architecture
1Half-Price Architecture
- Ilhyun Kim
- Mikko H. Lipasti
- PHARM Team
- University of WisconsinMadison
http//www.ece.wisc.edu/pharm
2Motivations
- Processors are designed to handle 0, 1 and
2-source instructions at equal cost - Satisfy the worst-case requirements of
instructions - No resource arbitrations / pipeline stalls in
handling source operands - Simple controls over instruction and data stream
- Handling source operands requires 2x machine
bandwidth - e.g. 2 read ports / 1 write port per instruction
- Heavily multi-ported structures in many pipeline
stages
3Making the common case faster
- 2x HW configuration assumes 2 source operands are
common - 1836 of instructions have 2 source operands
- But, structures for 2 source operands are not
fully utilized - Scheduler
- 416 of instructions need two wakeups
- Less than 3 of instructions handle 2 wakeups in
the same clock cycle - Register File
- 0.64 read port per instruction
- Less than 4 of instructions need two register
read ports - Handling 2 source operands may NOT be the common
case - ? Why not build a pipeline optimized for 1-source
instructions?
4Half-price Architecture
- Restrict the processors capability to handle 2
source operands - 0- or 1-source instructions are processed without
any restriction - 2-source instructions may execute more slowly
- But, they are not the common case
- ? Reduce hardware complexity incurred by 2 source
operands - ½ technique in scheduler Sequential wakeup
- ½ technique in RF Sequential register access
HW design point to match the worst-case
requirements
Opcode
Rdst
Rsrc 1
Rsrc 2
Opcode
Rdst
Rsrc 1
Half-price architecture design point
Needs more hardware
Opcode
Rdst / Rsrc
Opcode
52-source-format instructions
2-src-format insts
- 1836 of dynamic instructions have 2-source
format (excluding stores)
6Target identification2-source instructions
2-src-format insts
2-src insts
- 623 of instructions are 2-source instructions
- 2 unique source operands with dependences
- Dynamic behaviors of 2-source instructions will
expose greater opportunities
7Outline
- Motivations
- Half-price architecture
- Reducing scheduler complexity
- Sequential wakeup
- Reducing register file complexity
- Conclusions Future work
8Scheduler complexity
- Overdesign in wakeup logic
- Tag comparators for two source operands
- Tag broadcast is expensive
- Delay is a function of tag comparators and bus
length
- Speeding up the scheduler
- Clustered scheduler (Palacharla et al.)
- Making a small window look bigger (Michaud et
al.) - Hierarchical scheduler (Lebeck et al., Hrishikesh
et al.) - Reducing wakeup bus load capacitance
- Tag elimination last-tag speculation (Ernst
Austin) - Half-price technique sequential wakeup
9Last-tag speculation (Ernst Austin, ISCA02)
- Only the last-arriving operand initiates
instruction issue - Remove tag comparison logic for the
early-arriving operand - Fewer tag comparators ? reduced load on the bus
compact wakeup logic ? scheduling logic cycle
time improvement - A scoreboard checks correctness of scheduling
- May hurt performance due to its speculative
nature - Implementation issue w/ broadcast-based selective
recovery - ? Our technique exploits last-arriving operands
non-speculatively, achieving similar benefits
102-pending-source instructions
- Many operands are already ready at insert time
- 416 of instructions have 2-pending-source
operands, requiring two wakeup signals before
being issued
2-src insts
2-pending- src insts
11Slack between two wakeups
- Many 2-pending-source instructions have wakeup
slack - Less than 3 of instructions have 0-slack wakeups
- ? Exploit wakeup slack to prioritize operand
wakeups
2-pending- src insts
simultaneous wakeup (0 slack)
12½ technique - Sequential wakeup
- Sequentially wake up ½ operands during wakeup
slack - Decouples half of tag comparators ? reduced load
on the bus - Flexible routing in slow wakeup bus ? compact
fast wakeup logic - No recovery, lower misprediction penalty (1-cycle
issue delay) - Instructions are issued non-speculatively in
terms of operand readiness - Simultaneous (0-slack) wakeups always incur
penalty - But, they are less than 3 of instructions
tag W
tag 1
OR
OR
readyL
tagL
readyR
tagR
put the tag predicted to be last-arriving
latch
Fast wakeup bus
Slow wakeup bus (1 clk behind)
13Machine models
- Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
Speculative scheduling - Alpha-style squashing scheduling recovery
- invalidates all issued instructions (dependent /
independent) behind the miss - 4-wide 64 RUUs, 32 LSQs, 2 memory ports
- 8-wide 128 RUUs, 64 LSQs, 4 memory ports
- 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
- Combined (bimodal gShare) branch prediction,
fetch until the first taken branch - Sequential wakeup
- Last-arriving operand predictor 1k-entry,
PC-direct-mapped, 2-bit bimodal - Last-tag speculation
- Same predictor
- Scoreboard located next to the scheduler
14Sequential wakeup performance
4-wide
8-wide
- Sequential wakeup slowdown is slight avg 0.4 /
0.6, worst 2.1 - Less than 4 of instructions incur penalty
- Sequential wakeup is relatively insensitive to
predictor accuracy - ? Sequential wakeup can reduce wakeup logic delay
with a minimal performance impact
15Outline
- Motivations
- Half-price architecture
- Reducing scheduler complexity
- Reducing register file complexity
- Sequential register file access
- Conclusions Future work
16Register file complexity
- Overdesign in register file
- 2x read ports for two source operands
- Superscalar processors need RF to be heavily
multiported - Area increases quadratically, latency increases
linearly - Two read ports are not fully utilized
- 0- / 1-source instructions do not require two
read ports - Many instructions frequently get values off the
bypass path - 0.64 read ports / instruction (Balasubramonian
et. al, ISCA01) - Speeding up the RF
- Reducing the number of register entries
- Hierarchical register file (Cruz, Borch,
Balasubramonian, ..) - Reducing the number of ports
- Fewer RF ports crossbar (Balasubramonian et al,
Park et al) - Half-price technique Sequential RF Access
17Two RF read port accesses
- Less than 4 of instructions need 2 read port
accesses - Many 2-source instructions read at least one
value off bypass path
4-wide
8-wide
2-src insts
require 2 read ports
18½ technique Sequential RF access
- Remove ½ register read ports
- Only a single read port per issue slot
- 0 or 1-source instructions are processed without
any restriction - Sequentially access a single port twice for 2
values if needed (the execution latency
increases by 1 clock cycle) - However, speculative scheduling does not allow
variable-latency operations (Implementing
optimizations at decode time, ISCA02) - Load latency misprediction ? scheduling recovery
- Variable RF latency ? scheduling recovery, too
- ? Sequential RF access should be reflected in
scheduling - How to detect if source values will be read off
the bypass path? - How to schedule dependent instructions
accordingly?
19Scheduling in sequential RF access
- Back-to-back issue Reading values off the
bypass - Back-to-back issue makes dependent instructions
fall within bypass window - Non-back-to-back issue or 2 ready sources at
insert time incur sequential RF access (assuming
1-clk cycle bypass window) - Scheduler considerations
- !(wakeup selected) in the same cycle ?
sequential RF access - Delay tag broadcast by 1 clock cycle
- Block the issue slot (only the one w/ seq RF
access) for 1 cycle fornon-pipelined RF access
operation
20Machine models
- Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
Speculative scheduling (same as before) - Alpha-style squashing scheduling recovery
- invalidates all issued instructions (dependent /
independent) behind the miss - 4-wide 64 RUUs, 32 LSQs, 2 memory ports
- 8-wide 128 RUUs, 64 LSQs, 4 memory ports
- 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
- Combined (bimodal gShare) branch prediction,
fetch until the first taken branch - Sequential RF access
- ½ read-ported RF (1 read port / issue slot)
- Comparison cases
- Pipelined RF (1 extra RF stage)
- ½ read-ported RF (same as sequential RF access)
crossbar
21Sequential RF access performance
4-wide
8-wide
- Seq RF access slowdown is slight avg 1.1 / 0.7,
worst 2.2 - 1-extra RF stage requires extra bypass paths
- ½ read ports crossbar almost achieves base
performance - crossbar complexity, global RF port arbitration
- ? Sequential RF access reduces the number of RF
read ports with a minimal performance impact
22Sequential wakeup RF access
- Performance degradation avg 2.2, worst 4.8
- Reduced wakeup bus load capacitance, fewer read
ports of RF - ? Half-price techniques reduce HW complexity,
reaping most of the performance of a conventional
pipeline
23Conclusions Future work
- Processors are overdesigned to process 0, 1,
2-source instructions at equal cost - Handling 2-source instructions may not be the
common case - Only a small fraction of instructions utilize
overdesigned hardware - Reduce HW complexity by restricting the
processors capability of handling 2-source
instructions - Sequential wakeup, sequential RF access
- The performance impact is minimal
- The basic concept can be extended to all pipeline
stages - Register rename, ready information check, bypass
logic - Changing the pipeline design from instruction- to
operand-granularity
24Questions?
25Last-arriving operand predictor accuracy
4-wide
8-wide
26½ technique - Sequential wakeup
- Sequential wakeup example
time
ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
27½ technique Sequential RF access
- Scheduler changes for sequential RF access
- Sequential RF access example
ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6