Survey of LowComplexity, Low Power Instruction Scheduling - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Survey of LowComplexity, Low Power Instruction Scheduling

Description:

Survey of Low-Complexity, Low Power Instruction Scheduling. Alex Li. Lewen Lo ... Abella and Gonzalez 04. Waiting Instruction Buffer. LD r1, 1024(r0) ADD r3, r1, r2 ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 35

Provided by: lewe5

Category:

more less

Transcript and Presenter's Notes

Title: Survey of LowComplexity, Low Power Instruction Scheduling

1
Survey of Low-Complexity, Low Power Instruction
Scheduling

Alex Li
Lewen Lo
Sara Sadeghi Baghsorkhi

2
Motivation

Scalability of instruction window size
Extract greater ILP
Power consumption
CAM logic is power hungry
Complexity
Wire delay of associative logic dominates gate
delay in scheduler

3
Outline

Wakeup logic optimizations
Distributed instruction queues
Waiting Instruction Buffer
Preschedulers
Cyclone
Wakeup-Free

4
Wakeup Logic
Result Reg Tag
Wakeup Logic

Opcode
FU Type
Dest Reg
V1
Src Reg 1
V2
Src Reg 2
R

To Select Logic
5
Gated Tag Matching
head
tail
Rationale Parts of IQ wasting energy Energy-wasti
ng sources Empty area Ready operands Issued
instructions Solution Gate the comparators!
Folegnani et al. ISCA2001
6
Gated Tag Matching
Furthermore young instr. contributes little to
performance Solution Dynamic resizing Use limit
pointer performance counters Reduce size as
long as lt IPC threshold Increase size if gt IPC
threshold for a set period Cost Additional
logic for Gated comparators Performance
counter Claims 128-entry queue, effective size
43 4 performance loss 90.7 wakeup logic energy
savings 14.9 chip energy savings Significant
energy savings Based on conventional design, no
performance benefit
limit
head
tail
Folegnani et al. ISCA2001
7
Tag Elimination
Rationale ?1 ready operand in most instr.
(80-96) Last arriving operand wakes up instr.
Base Approach Issue window w/ 2-, 1-, and
0-comparator entries Insert instr. based on
operand readiness Advanced Approach Eliminate
2-comparator entries Predict last-arriving
operand Re-issue on misprediction Results on (32
1-comp/32 0-comp) Slight IPC loss (1-3), Account
for reduced delay, good speedup (25-45) 65-75
lower energy-delay product Drastically reduce
associative logic (1/4) reduce energy no
performance impact (even speedup)

2-comp entries

1-comp entries

0-comp entries
Ernst et al. ISCA2002
8
N-use Issue Logic

Rationale
1 (or few) dep. instr. for most instr (75-78)
Approach
More SRAM (N-use table)
Less CAM (I-buffer)
Wakeup dependents only
Claims
2-use table, 2-entry I-buffer comparable to
64-entry CAM (4 slowdown)
96 regs ? 192 entries in 2-use table!
Justifications
DOES reduce CAM (64 to 2 cells)
Energy to support 2-use table ? gated entries
Less complex, but maybe more area
Cycle time may be reduced
Drastically different design

Canal et al. ICS2001
9
Distributed Instruction Queue(FIFO)

Instructions in a queue are a dependence chain.
Only instructions at the head of the queues can
be ready.
Works well for INT codes, but poor for FP codes.
Large number of FIFOs increases its complexity

Palacharla et al 97
10
Distributed Instruction Queue(Buffer)

Multiple dependence chains share a queue.
Queues are not FIFOs but they do not require
wake-up.
Different Dispatch Order and Issue Order
Use latencies at issue time to decide which will
be the next selected instruction.
Still a Simple Selection Logic
Same Performance and Less Power Consumption

Abella and Gonzalez 04
11
Selection Logic
Abella and Gonzalez 04
12
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
LD r1, 1024(r0)
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
ADD r5, r3, r4
Instruction Dispatch
Issued
Data Cache
LD r6, 256(r5)
Functional Unit
ADD r6, r6, r0
Lebeck et al 02
13
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
Cache Miss
ADD r5, r3, r4
LD r6, 256(r5)
Instruction Dispatch
Data Cache
ADD r6, r6, r0
Functional Unit
Load miss on r1
Lebeck et al 02
14
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
Cache Miss
ADD r5, r3, r4
LD r6, 256(r5)
Instruction Dispatch
Data Cache
ADD r6, r6, r0
Functional Unit
Lebeck et al 02
15
Waiting Instruction Buffer
r3, r4
Waiting Instruction Buffer
Issue Queue
SLL r3, 0x4, r3
SUB r4, r4, r2
ADD r5, r3, r4
LD r6, 256(r5)
Cache Miss
ADD r6, r6, r0
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
16
Waiting Instruction Buffer
r3, r4
Waiting Instruction Buffer
Issue Queue
ADD r5, r3, r4
LD r6, 256(r5)
ADD r6, r6, r0
Cache Miss
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
17
Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
LD r6, 256(r5)
ADD r6, r6, r0
.
Miss Resolved
.
.
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
18
Waiting Instruction Buffer
Instructions reinserted
Waiting Instruction Buffer
Issue Queue
LD r6, 256(r5)
ADD r6, r6, r0
.
.
.
ADD r3, r1, r2
Instruction Dispatch
Data Cache
ADD r4, r1, r4
Functional Unit
SLL r3, 0x4, r3
Lebeck et al 02
19
Waiting Instruction Buffer

No support for Back-to-back Execution with Parent
Loads that Miss in the Cache
Power Consumption
Several Instructions Moves between the Issue
Queue and the WIB
A Large WIB

20
Motivation behind Preschedulers

Compiler-heavy scheduling
Dumber scheduler
More conservative (on branches, load/store
addresses, other run-time things)
Hardware-intensive scheduling
Takes advantage of knowledge at run-time
Much more complex

21
Motivation behind Preschedulers

Some dead instructions sit in scheduler slots
Reduce dead slots by only sending fireable
instructions
Increases effective instruction window
Eliminates associative logic, decreasing
Complexity
Delay (allowing for a possible clock speed
increase)
Power consumption

22
Dataflow-based Prescheduler

Register Use Line Table (RULT), width W
Active line ready instructions
line max(a,b,c) x
Max line of current line, lines of both operands
Circular setup
Each cycle, increment active line

23
Dataflow Prescheduler Performance

8-entry issue buffer, 12 lines, 8 FIFOs
16-entry issue buffer, 12 lines, 16 FIFOs
Avg. 54 performance increase for 8-entry buffer
Avg. 33 performance increase for 16-entry buffer

Michaud et al. HPCA2001
24
Cyclone

Re-vamp the scheduler (take advantage of higher
perf.)
Instrs from prescheduler go into countdown
When countdown reaches N/2 -gt main queue
Main queue entries promote to the right
Column 0 is issued each cycle

Ernst et al. ISCA2003
25
Cyclone (contd)

Replay mechanism
Register File Ready Bits for final operand check
Store set predictor
A conservative method avoiding load/store
dependence messiness

26
Cyclone Performance

Decrease in
latency
8-decode, 8-issue Cyclone takes 12 of area
compared to 64-instruction 8-issue CAM

Ernst et al. ISCA2003
27
Cyclone Analysis

Eliminates both wakeup and selection logic
Competition for issue ports
Congestion
Collisions during promotion (modifying promotion
paths only shifts the pressure)
Replay-decode collisions

28
Wakeup-Free (WF) schemesWF-Replay