Survey of LowComplexity, Low Power Instruction Scheduling - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Survey of LowComplexity, Low Power Instruction Scheduling

Description:

Survey of Low-Complexity, Low Power Instruction Scheduling. Alex Li. Lewen Lo ... Abella and Gonzalez 04. Waiting Instruction Buffer. LD r1, 1024(r0) ADD r3, r1, r2 ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 35
Provided by: lewe5
Category:

less

Transcript and Presenter's Notes

Title: Survey of LowComplexity, Low Power Instruction Scheduling


1
Survey of Low-Complexity, Low Power Instruction
Scheduling
  • Alex Li
  • Lewen Lo
  • Sara Sadeghi Baghsorkhi

2
Motivation
  • Scalability of instruction window size
  • Extract greater ILP
  • Power consumption
  • CAM logic is power hungry
  • Complexity
  • Wire delay of associative logic dominates gate
    delay in scheduler

3
Outline
  • Wakeup logic optimizations
  • Distributed instruction queues
  • Waiting Instruction Buffer
  • Preschedulers
  • Cyclone
  • Wakeup-Free

4
Wakeup Logic
Result Reg Tag
Wakeup Logic


Opcode
FU Type
Dest Reg
V1
Src Reg 1
V2
Src Reg 2
R

To Select Logic
5
Gated Tag Matching
head
tail
Rationale Parts of IQ wasting energy Energy-wasti
ng sources Empty area Ready operands Issued
instructions Solution Gate the comparators!
Folegnani et al. ISCA2001
6
Gated Tag Matching
Furthermore young instr. contributes little to
performance Solution Dynamic resizing Use limit
pointer performance counters Reduce size as
long as lt IPC threshold Increase size if gt IPC
threshold for a set period Cost Additional
logic for Gated comparators Performance
counter Claims 128-entry queue, effective size
43 4 performance loss 90.7 wakeup logic energy
savings 14.9 chip energy savings Significant
energy savings Based on conventional design, no
performance benefit
limit
head
tail
Folegnani et al. ISCA2001
7
Tag Elimination
Rationale ?1 ready operand in most instr.
(80-96) Last arriving operand wakes up instr.
Base Approach Issue window w/ 2-, 1-, and
0-comparator entries Insert instr. based on
operand readiness Advanced Approach Eliminate
2-comparator entries Predict last-arriving
operand Re-issue on misprediction Results on (32
1-comp/32 0-comp) Slight IPC loss (1-3), Account
for reduced delay, good speedup (25-45) 65-75
lower energy-delay product Drastically reduce
associative logic (1/4) reduce energy no
performance impact (even speedup)


2-comp entries




1-comp entries


0-comp entries
Ernst et al. ISCA2002
8
N-use Issue Logic
  • Rationale
  • 1 (or few) dep. instr. for most instr (75-78)
  • Approach
  • More SRAM (N-use table)
  • Less CAM (I-buffer)
  • Wakeup dependents only
  • Claims
  • 2-use table, 2-entry I-buffer comparable to
    64-entry CAM (4 slowdown)
  • 96 regs ? 192 entries in 2-use table!
  • Justifications
  • DOES reduce CAM (64 to 2 cells)
  • Energy to support 2-use table ? gated entries
  • Less complex, but maybe more area
  • Cycle time may be reduced
  • Drastically different design

Canal et al. ICS2001
9
Distributed Instruction Queue(FIFO)
  • Instructions in a queue are a dependence chain.
  • Only instructions at the head of the queues can
    be ready.
  • Works well for INT codes, but poor for FP codes.
  • Large number of FIFOs increases its complexity

Palacharla et al 97
10
Distributed Instruction Queue(Buffer)
  • Multiple dependence chains share a queue.
  • Queues are not FIFOs but they do not require
    wake-up.
  • Different Dispatch Order and Issue Order
  • Use latencies at issue time to decide which will
    be the next selected instruction.
  • Still a Simple Selection Logic
  • Same Performance and Less Power Consumption

Abella and Gonzalez 04
11
Selection Logic
Abella and Gonzalez 04
12
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
LD r1, 1024(r0)
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
ADD r5, r3, r4
Instruction Dispatch
Issued
Data Cache
LD r6, 256(r5)
Functional Unit
ADD r6, r6, r0
Lebeck et al 02
13
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
Cache Miss
ADD r5, r3, r4
LD r6, 256(r5)
Instruction Dispatch
Data Cache
ADD r6, r6, r0
Functional Unit
Load miss on r1
Lebeck et al 02
14
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
Cache Miss
ADD r5, r3, r4
LD r6, 256(r5)
Instruction Dispatch
Data Cache
ADD r6, r6, r0
Functional Unit
Lebeck et al 02
15
Waiting Instruction Buffer
r3, r4
Waiting Instruction Buffer
Issue Queue
SLL r3, 0x4, r3
SUB r4, r4, r2
ADD r5, r3, r4
LD r6, 256(r5)
Cache Miss
ADD r6, r6, r0
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
16
Waiting Instruction Buffer
r3, r4
Waiting Instruction Buffer
Issue Queue
ADD r5, r3, r4
LD r6, 256(r5)
ADD r6, r6, r0
Cache Miss
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
17
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
LD r6, 256(r5)
ADD r6, r6, r0
.
Miss Resolved
.
.
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
18
Waiting Instruction Buffer
Instructions reinserted
Waiting Instruction Buffer
Issue Queue
LD r6, 256(r5)
ADD r6, r6, r0
.
.
.
ADD r3, r1, r2
Instruction Dispatch
Data Cache
ADD r4, r1, r4
Functional Unit
SLL r3, 0x4, r3
Lebeck et al 02
19
Waiting Instruction Buffer
  • No support for Back-to-back Execution with Parent
    Loads that Miss in the Cache
  • Power Consumption
  • Several Instructions Moves between the Issue
    Queue and the WIB
  • A Large WIB

20
Motivation behind Preschedulers
  • Compiler-heavy scheduling
  • Dumber scheduler
  • More conservative (on branches, load/store
    addresses, other run-time things)
  • Hardware-intensive scheduling
  • Takes advantage of knowledge at run-time
  • Much more complex

21
Motivation behind Preschedulers
  • Some dead instructions sit in scheduler slots
  • Reduce dead slots by only sending fireable
    instructions
  • Increases effective instruction window
  • Eliminates associative logic, decreasing
  • Complexity
  • Delay (allowing for a possible clock speed
    increase)
  • Power consumption

22
Dataflow-based Prescheduler
  • Register Use Line Table (RULT), width W
  • Active line ready instructions
  • line max(a,b,c) x
  • Max line of current line, lines of both operands
  • Circular setup
  • Each cycle, increment active line

23
Dataflow Prescheduler Performance
  • 8-entry issue buffer, 12 lines, 8 FIFOs
    16-entry issue buffer, 12 lines, 16 FIFOs
  • Avg. 54 performance increase for 8-entry buffer
  • Avg. 33 performance increase for 16-entry buffer

Michaud et al. HPCA2001
24
Cyclone
  • Re-vamp the scheduler (take advantage of higher
    perf.)
  • Instrs from prescheduler go into countdown
  • When countdown reaches N/2 -gt main queue
  • Main queue entries promote to the right
  • Column 0 is issued each cycle

Ernst et al. ISCA2003
25
Cyclone (contd)
  • Replay mechanism
  • Register File Ready Bits for final operand check
  • Store set predictor
  • A conservative method avoiding load/store
    dependence messiness

26
Cyclone Performance
  • Decrease in
  • latency
  • 8-decode, 8-issue Cyclone takes 12 of area
    compared to 64-instruction 8-issue CAM

Ernst et al. ISCA2003
27
Cyclone Analysis
  • Eliminates both wakeup and selection logic
  • Competition for issue ports
  • Congestion
  • Collisions during promotion (modifying promotion
    paths only shifts the pressure)
  • Replay-decode collisions

28
Wakeup-Free (WF) schemesWF-Replay
  • Latency counters selection logic
  • Uses entire scheduler
  • For 32 entry queue, issue width 4, 9 performance
    hit (vs. 25.5 of cyclone)
  • Issue width 6, performance hit of 0.2, Issue
    width 8, performance hit of 0

Hu et al. HPCA2004
29
WF-Precheck
  • Do a precheck instead of replay
  • Check Reg Ready Bits before issuing
  • If not ready, recalculate timing
  • Increases complexity of selection logic

Hu et al. HPCA2004
30
Segmented Issue Queue
Hu et al. HPCA2004
31
Segmented Issue Queue Commentary
  • Rows represent different classes of latencies
  • Only select on lowest row (latency 0)
  • Sinking/Collapsing structure to prevent pileups

32
WF-Segment Performance
  • 5.8 perf. loss (3.5 vs. Precheck)

Hu et al. HPCA2004
33
Conclusions
  • Low-power optimizations tend to target control
    logic
  • Dont change underlying structure
  • Low-complexity optimizations
  • More creative designs
  • Low power
  • No appreciable performance loss (possibly speedup
    ?)

34
Backup Slides
Write a Comment
User Comments (0)
About PowerShow.com