Title: Survey of LowComplexity, Low Power Instruction Scheduling
1Survey of Low-Complexity, Low Power Instruction
Scheduling
- Alex Li
- Lewen Lo
- Sara Sadeghi Baghsorkhi
2Motivation
- Scalability of instruction window size
- Extract greater ILP
- Power consumption
- CAM logic is power hungry
- Complexity
- Wire delay of associative logic dominates gate
delay in scheduler
3Outline
- Wakeup logic optimizations
- Distributed instruction queues
- Waiting Instruction Buffer
- Preschedulers
- Cyclone
- Wakeup-Free
4Wakeup Logic
Result Reg Tag
Wakeup Logic
Opcode
FU Type
Dest Reg
V1
Src Reg 1
V2
Src Reg 2
R
To Select Logic
5Gated Tag Matching
head
tail
Rationale Parts of IQ wasting energy Energy-wasti
ng sources Empty area Ready operands Issued
instructions Solution Gate the comparators!
Folegnani et al. ISCA2001
6Gated Tag Matching
Furthermore young instr. contributes little to
performance Solution Dynamic resizing Use limit
pointer performance counters Reduce size as
long as lt IPC threshold Increase size if gt IPC
threshold for a set period Cost Additional
logic for Gated comparators Performance
counter Claims 128-entry queue, effective size
43 4 performance loss 90.7 wakeup logic energy
savings 14.9 chip energy savings Significant
energy savings Based on conventional design, no
performance benefit
limit
head
tail
Folegnani et al. ISCA2001
7Tag Elimination
Rationale ?1 ready operand in most instr.
(80-96) Last arriving operand wakes up instr.
Base Approach Issue window w/ 2-, 1-, and
0-comparator entries Insert instr. based on
operand readiness Advanced Approach Eliminate
2-comparator entries Predict last-arriving
operand Re-issue on misprediction Results on (32
1-comp/32 0-comp) Slight IPC loss (1-3), Account
for reduced delay, good speedup (25-45) 65-75
lower energy-delay product Drastically reduce
associative logic (1/4) reduce energy no
performance impact (even speedup)
2-comp entries
1-comp entries
0-comp entries
Ernst et al. ISCA2002
8N-use Issue Logic
- Rationale
- 1 (or few) dep. instr. for most instr (75-78)
- Approach
- More SRAM (N-use table)
- Less CAM (I-buffer)
- Wakeup dependents only
- Claims
- 2-use table, 2-entry I-buffer comparable to
64-entry CAM (4 slowdown) - 96 regs ? 192 entries in 2-use table!
- Justifications
- DOES reduce CAM (64 to 2 cells)
- Energy to support 2-use table ? gated entries
- Less complex, but maybe more area
- Cycle time may be reduced
- Drastically different design
Canal et al. ICS2001
9Distributed Instruction Queue(FIFO)
- Instructions in a queue are a dependence chain.
- Only instructions at the head of the queues can
be ready. - Works well for INT codes, but poor for FP codes.
- Large number of FIFOs increases its complexity
Palacharla et al 97
10Distributed Instruction Queue(Buffer)
- Multiple dependence chains share a queue.
- Queues are not FIFOs but they do not require
wake-up. - Different Dispatch Order and Issue Order
- Use latencies at issue time to decide which will
be the next selected instruction. - Still a Simple Selection Logic
- Same Performance and Less Power Consumption
Abella and Gonzalez 04
11Selection Logic
Abella and Gonzalez 04
12Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
LD r1, 1024(r0)
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
ADD r5, r3, r4
Instruction Dispatch
Issued
Data Cache
LD r6, 256(r5)
Functional Unit
ADD r6, r6, r0
Lebeck et al 02
13Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
Cache Miss
ADD r5, r3, r4
LD r6, 256(r5)
Instruction Dispatch
Data Cache
ADD r6, r6, r0
Functional Unit
Load miss on r1
Lebeck et al 02
14Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
ADD r3, r1, r2
ADD r4, r1, r4
SLL r3, 0x4, r3
SUB r4, r4, r2
Cache Miss
ADD r5, r3, r4
LD r6, 256(r5)
Instruction Dispatch
Data Cache
ADD r6, r6, r0
Functional Unit
Lebeck et al 02
15Waiting Instruction Buffer
r3, r4
Waiting Instruction Buffer
Issue Queue
SLL r3, 0x4, r3
SUB r4, r4, r2
ADD r5, r3, r4
LD r6, 256(r5)
Cache Miss
ADD r6, r6, r0
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
16Waiting Instruction Buffer
r3, r4
Waiting Instruction Buffer
Issue Queue
ADD r5, r3, r4
LD r6, 256(r5)
ADD r6, r6, r0
Cache Miss
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
17Waiting Instruction Buffer
Waiting Instruction Buffer
Issue Queue
LD r6, 256(r5)
ADD r6, r6, r0
.
Miss Resolved
.
.
Instruction Dispatch
Data Cache
Functional Unit
Lebeck et al 02
18Waiting Instruction Buffer
Instructions reinserted
Waiting Instruction Buffer
Issue Queue
LD r6, 256(r5)
ADD r6, r6, r0
.
.
.
ADD r3, r1, r2
Instruction Dispatch
Data Cache
ADD r4, r1, r4
Functional Unit
SLL r3, 0x4, r3
Lebeck et al 02
19Waiting Instruction Buffer
- No support for Back-to-back Execution with Parent
Loads that Miss in the Cache - Power Consumption
- Several Instructions Moves between the Issue
Queue and the WIB - A Large WIB
20Motivation behind Preschedulers
- Compiler-heavy scheduling
- Dumber scheduler
- More conservative (on branches, load/store
addresses, other run-time things) - Hardware-intensive scheduling
- Takes advantage of knowledge at run-time
- Much more complex
21Motivation behind Preschedulers
- Some dead instructions sit in scheduler slots
- Reduce dead slots by only sending fireable
instructions - Increases effective instruction window
- Eliminates associative logic, decreasing
- Complexity
- Delay (allowing for a possible clock speed
increase) - Power consumption
22Dataflow-based Prescheduler
- Register Use Line Table (RULT), width W
- Active line ready instructions
- line max(a,b,c) x
- Max line of current line, lines of both operands
- Circular setup
- Each cycle, increment active line
23Dataflow Prescheduler Performance
- 8-entry issue buffer, 12 lines, 8 FIFOs
16-entry issue buffer, 12 lines, 16 FIFOs - Avg. 54 performance increase for 8-entry buffer
- Avg. 33 performance increase for 16-entry buffer
Michaud et al. HPCA2001
24Cyclone
- Re-vamp the scheduler (take advantage of higher
perf.) - Instrs from prescheduler go into countdown
- When countdown reaches N/2 -gt main queue
- Main queue entries promote to the right
- Column 0 is issued each cycle
Ernst et al. ISCA2003
25Cyclone (contd)
- Replay mechanism
- Register File Ready Bits for final operand check
- Store set predictor
- A conservative method avoiding load/store
dependence messiness
26Cyclone Performance
- Decrease in
- latency
- 8-decode, 8-issue Cyclone takes 12 of area
compared to 64-instruction 8-issue CAM
Ernst et al. ISCA2003
27Cyclone Analysis
- Eliminates both wakeup and selection logic
- Competition for issue ports
- Congestion
- Collisions during promotion (modifying promotion
paths only shifts the pressure) - Replay-decode collisions
28Wakeup-Free (WF) schemesWF-Replay
- Latency counters selection logic
- Uses entire scheduler
- For 32 entry queue, issue width 4, 9 performance
hit (vs. 25.5 of cyclone) - Issue width 6, performance hit of 0.2, Issue
width 8, performance hit of 0
Hu et al. HPCA2004
29WF-Precheck
- Do a precheck instead of replay
- Check Reg Ready Bits before issuing
- If not ready, recalculate timing
- Increases complexity of selection logic
Hu et al. HPCA2004
30Segmented Issue Queue
Hu et al. HPCA2004
31Segmented Issue Queue Commentary
- Rows represent different classes of latencies
- Only select on lowest row (latency 0)
- Sinking/Collapsing structure to prevent pileups
32WF-Segment Performance
- 5.8 perf. loss (3.5 vs. Precheck)
Hu et al. HPCA2004
33Conclusions
- Low-power optimizations tend to target control
logic - Dont change underlying structure
- Low-complexity optimizations
- More creative designs
- Low power
- No appreciable performance loss (possibly speedup
?)
34Backup Slides