HalfPrice Architecture - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

HalfPrice Architecture

Description:

Processors are designed to handle 0, 1 and 2-source ... crafty. eon. gap. gcc. gzip. mcf. parser. perl. twolf. vortex. vpr. 4-wide. 8-wide. June 9, 2003 ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 28
Provided by: ilhyu
Category:

less

Transcript and Presenter's Notes

Title: HalfPrice Architecture


1
Half-Price Architecture
  • Ilhyun Kim
  • Mikko H. Lipasti
  • PHARM Team
  • University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
Motivations
  • Processors are designed to handle 0, 1 and
    2-source instructions at equal cost
  • Satisfy the worst-case requirements of
    instructions
  • No resource arbitrations / pipeline stalls in
    handling source operands
  • Simple controls over instruction and data stream
  • Handling source operands requires 2x machine
    bandwidth
  • e.g. 2 read ports / 1 write port per instruction
  • Heavily multi-ported structures in many pipeline
    stages

3
Making the common case faster
  • 2x HW configuration assumes 2 source operands are
    common
  • 1836 of instructions have 2 source operands
  • But, structures for 2 source operands are not
    fully utilized
  • Scheduler
  • 416 of instructions need two wakeups
  • Less than 3 of instructions handle 2 wakeups in
    the same clock cycle
  • Register File
  • 0.64 read port per instruction
  • Less than 4 of instructions need two register
    read ports
  • Handling 2 source operands may NOT be the common
    case
  • ? Why not build a pipeline optimized for 1-source
    instructions?

4
Half-price Architecture
  • Restrict the processors capability to handle 2
    source operands
  • 0- or 1-source instructions are processed without
    any restriction
  • 2-source instructions may execute more slowly
  • But, they are not the common case
  • ? Reduce hardware complexity incurred by 2 source
    operands
  • ½ technique in scheduler Sequential wakeup
  • ½ technique in RF Sequential register access

HW design point to match the worst-case
requirements
Opcode
Rdst
Rsrc 1
Rsrc 2
Opcode
Rdst
Rsrc 1
Half-price architecture design point
Needs more hardware
Opcode
Rdst / Rsrc
Opcode
5
2-source-format instructions
2-src-format insts
  • 1836 of dynamic instructions have 2-source
    format (excluding stores)

6
Target identification2-source instructions
2-src-format insts
2-src insts
  • 623 of instructions are 2-source instructions
  • 2 unique source operands with dependences
  • Dynamic behaviors of 2-source instructions will
    expose greater opportunities

7
Outline
  • Motivations
  • Half-price architecture
  • Reducing scheduler complexity
  • Sequential wakeup
  • Reducing register file complexity
  • Conclusions Future work

8
Scheduler complexity
  • Overdesign in wakeup logic
  • Tag comparators for two source operands
  • Tag broadcast is expensive
  • Delay is a function of tag comparators and bus
    length
  • Speeding up the scheduler
  • Clustered scheduler (Palacharla et al.)
  • Making a small window look bigger (Michaud et
    al.)
  • Hierarchical scheduler (Lebeck et al., Hrishikesh
    et al.)
  • Reducing wakeup bus load capacitance
  • Tag elimination last-tag speculation (Ernst
    Austin)
  • Half-price technique sequential wakeup

9
Last-tag speculation (Ernst Austin, ISCA02)
  • Only the last-arriving operand initiates
    instruction issue
  • Remove tag comparison logic for the
    early-arriving operand
  • Fewer tag comparators ? reduced load on the bus
    compact wakeup logic ? scheduling logic cycle
    time improvement
  • A scoreboard checks correctness of scheduling
  • May hurt performance due to its speculative
    nature
  • Implementation issue w/ broadcast-based selective
    recovery
  • ? Our technique exploits last-arriving operands
    non-speculatively, achieving similar benefits

10
2-pending-source instructions
  • Many operands are already ready at insert time
  • 416 of instructions have 2-pending-source
    operands, requiring two wakeup signals before
    being issued

2-src insts
2-pending- src insts
11
Slack between two wakeups
  • Many 2-pending-source instructions have wakeup
    slack
  • Less than 3 of instructions have 0-slack wakeups
  • ? Exploit wakeup slack to prioritize operand
    wakeups

2-pending- src insts
simultaneous wakeup (0 slack)
12
½ technique - Sequential wakeup
  • Sequentially wake up ½ operands during wakeup
    slack
  • Decouples half of tag comparators ? reduced load
    on the bus
  • Flexible routing in slow wakeup bus ? compact
    fast wakeup logic
  • No recovery, lower misprediction penalty (1-cycle
    issue delay)
  • Instructions are issued non-speculatively in
    terms of operand readiness
  • Simultaneous (0-slack) wakeups always incur
    penalty
  • But, they are less than 3 of instructions

tag W
tag 1



OR

OR


readyL
tagL
readyR
tagR


put the tag predicted to be last-arriving
latch
Fast wakeup bus
Slow wakeup bus (1 clk behind)
13
Machine models
  • Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
    Speculative scheduling
  • Alpha-style squashing scheduling recovery
  • invalidates all issued instructions (dependent /
    independent) behind the miss
  • 4-wide 64 RUUs, 32 LSQs, 2 memory ports
  • 8-wide 128 RUUs, 64 LSQs, 4 memory ports
  • 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
  • Combined (bimodal gShare) branch prediction,
    fetch until the first taken branch
  • Sequential wakeup
  • Last-arriving operand predictor 1k-entry,
    PC-direct-mapped, 2-bit bimodal
  • Last-tag speculation
  • Same predictor
  • Scoreboard located next to the scheduler

14
Sequential wakeup performance
4-wide
8-wide
  • Sequential wakeup slowdown is slight avg 0.4 /
    0.6, worst 2.1
  • Less than 4 of instructions incur penalty
  • Sequential wakeup is relatively insensitive to
    predictor accuracy
  • ? Sequential wakeup can reduce wakeup logic delay
    with a minimal performance impact

15
Outline
  • Motivations
  • Half-price architecture
  • Reducing scheduler complexity
  • Reducing register file complexity
  • Sequential register file access
  • Conclusions Future work

16
Register file complexity
  • Overdesign in register file
  • 2x read ports for two source operands
  • Superscalar processors need RF to be heavily
    multiported
  • Area increases quadratically, latency increases
    linearly
  • Two read ports are not fully utilized
  • 0- / 1-source instructions do not require two
    read ports
  • Many instructions frequently get values off the
    bypass path
  • 0.64 read ports / instruction (Balasubramonian
    et. al, ISCA01)
  • Speeding up the RF
  • Reducing the number of register entries
  • Hierarchical register file (Cruz, Borch,
    Balasubramonian, ..)
  • Reducing the number of ports
  • Fewer RF ports crossbar (Balasubramonian et al,
    Park et al)
  • Half-price technique Sequential RF Access

17
Two RF read port accesses
  • Less than 4 of instructions need 2 read port
    accesses
  • Many 2-source instructions read at least one
    value off bypass path

4-wide
8-wide
2-src insts
require 2 read ports
18
½ technique Sequential RF access
  • Remove ½ register read ports
  • Only a single read port per issue slot
  • 0 or 1-source instructions are processed without
    any restriction
  • Sequentially access a single port twice for 2
    values if needed (the execution latency
    increases by 1 clock cycle)
  • However, speculative scheduling does not allow
    variable-latency operations (Implementing
    optimizations at decode time, ISCA02)
  • Load latency misprediction ? scheduling recovery
  • Variable RF latency ? scheduling recovery, too
  • ? Sequential RF access should be reflected in
    scheduling
  • How to detect if source values will be read off
    the bypass path?
  • How to schedule dependent instructions
    accordingly?

19
Scheduling in sequential RF access
  • Back-to-back issue Reading values off the
    bypass
  • Back-to-back issue makes dependent instructions
    fall within bypass window
  • Non-back-to-back issue or 2 ready sources at
    insert time incur sequential RF access (assuming
    1-clk cycle bypass window)
  • Scheduler considerations
  • !(wakeup selected) in the same cycle ?
    sequential RF access
  • Delay tag broadcast by 1 clock cycle
  • Block the issue slot (only the one w/ seq RF
    access) for 1 cycle fornon-pipelined RF access
    operation

20
Machine models
  • Simplescalar-Alpha-based, 12-stage, 4/8-wide OoO
    Speculative scheduling (same as before)
  • Alpha-style squashing scheduling recovery
  • invalidates all issued instructions (dependent /
    independent) behind the miss
  • 4-wide 64 RUUs, 32 LSQs, 2 memory ports
  • 8-wide 128 RUUs, 64 LSQs, 4 memory ports
  • 64K IL1 (2), 64K DL1 (2), 512K unified L2 (8)
  • Combined (bimodal gShare) branch prediction,
    fetch until the first taken branch
  • Sequential RF access
  • ½ read-ported RF (1 read port / issue slot)
  • Comparison cases
  • Pipelined RF (1 extra RF stage)
  • ½ read-ported RF (same as sequential RF access)
    crossbar

21
Sequential RF access performance
4-wide
8-wide
  • Seq RF access slowdown is slight avg 1.1 / 0.7,
    worst 2.2
  • 1-extra RF stage requires extra bypass paths
  • ½ read ports crossbar almost achieves base
    performance
  • crossbar complexity, global RF port arbitration
  • ? Sequential RF access reduces the number of RF
    read ports with a minimal performance impact

22
Sequential wakeup RF access
  • Performance degradation avg 2.2, worst 4.8
  • Reduced wakeup bus load capacitance, fewer read
    ports of RF
  • ? Half-price techniques reduce HW complexity,
    reaping most of the performance of a conventional
    pipeline

23
Conclusions Future work
  • Processors are overdesigned to process 0, 1,
    2-source instructions at equal cost
  • Handling 2-source instructions may not be the
    common case
  • Only a small fraction of instructions utilize
    overdesigned hardware
  • Reduce HW complexity by restricting the
    processors capability of handling 2-source
    instructions
  • Sequential wakeup, sequential RF access
  • The performance impact is minimal
  • The basic concept can be extended to all pipeline
    stages
  • Register rename, ready information check, bypass
    logic
  • Changing the pipeline design from instruction- to
    operand-granularity

24
Questions?
25
Last-arriving operand predictor accuracy
4-wide
8-wide
26
½ technique - Sequential wakeup
  • Sequential wakeup example

time
ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
27
½ technique Sequential RF access
  • Scheduler changes for sequential RF access
  • Sequential RF access example

ADD r1, r2, r3 SUB r3, r4, r5 XOR r5, 1, r6
Write a Comment
User Comments (0)
About PowerShow.com