Title: Half Price Architecture
1Half Price Architecture
- Authors Ilhyun Kim and Mikko H. Lipasti
2Motivation
- Processors are overdesigned
- Handle 0, 1, 2 operand instructions equally
- Simplifies control
- Requires multi-port register files
- Requires large broadcast busses for wakeup logic
- Results in slower clock frequency
3Scarcity of 2-source instructions
- Characterize frequency of 2-source instructions
- Simplescalar Alpha 3.0 (sim-outorder)
- Spec 2000 integer benchmark suite
4Dynamic 2-source instructions
- 18-36 use 2-source format
- But some are zero or same operand twice
5Dynamic 2-source instructions
- 6-23 use 2-source format without zero register
or duplicate operands
6Deeper Study
- 2-source instructions not very dominant
- Justifies further study into overdesign
- Scheduling Logic Wakeup logic
- Register file access
7Processor Model
8Wakeup Logic
- Wakeup Logic The logic which notifies a queued
instruction that its operands will be ready off
the bypass bus - Once both operands ready it may be selected for
issue - Destinations are broadcast to all
- Broadcast is slow (high fanout)
9Wakeup Logic System
Dispatch
Issue Queue
Add r6, r7, r2 Sub r4, r5, r1 Add r1, r2, r3
Instructions
Selector 4-way
10Wakeup Logic System
Dispatch
Issue Queue
Add r1, r2, r3
Add r6, r7, r2 Sub r4, r5, r1
Instructions
Selector 4-way
11Wakeup Logic System
Dispatch
Issue Queue
Add r1, r2, r3
Instructions
Selector 4-way
12Wakeup Logic - overdesign
- Destinations broadcast simultaneously to both
operands - Useful only when
- Both operands fetched from bypass
- Both operands ready in same cycle
131. Both operands requiring bypass
- Some operands are already ready (dont need
wakeup) - 4-16 have 2 pending operands in scheduler
142. Both operands ready in same cycle
- Operands become ready in different cycles
- lt3 become ready in same cycle
15Previous WorkTag Elimination
- Ernst and Austin
- Predict latest arriving operand
- Use only one comparator for it
- Incurs penalty for mispredictions
- Implementing with selective recovery is
impractical
16Sequential Wakeup
- Less bus loading, different timing
17Example
r2
dest
rdy
rdy
r1
ADD
0
0
r2
r3
Cycle 1
r3
SUB
0
0
r4
r5
r5
XOR
0
0
_
r6
18Example
r1, r4
r2
dest
rdy
rdy
r1
ADD
1
1
r2
r3
issue
Cycle 2
r3
SUB
0
0
r4
r5
r5
XOR
0
0
_
r6
19Example
r3
r1, r4
dest
rdy
rdy
r1
ADD
1
1
r2
r3
issued
Cycle 3
r3
SUB
1
1
r4
r5
issue
r5
XOR
0
0
_
r6
20Example
r5
r3
dest
rdy
rdy
r1
ADD
1
1
r2
r3
issued
Cycle 4
r3
SUB
1
1
r4
r5
issued
r5
XOR
1
0
_
r6
issue
21Last Operand Predictability
22Last Operand Predictor
- PC-based, direct-mapped, 2-bit saturating
23Advantages/Disadvantages of Sequential Wakeup
- Advantages
- No recovery needed on mispredict
- Easily integrates with selective recovery
- Reduces bus load capacitance
- 26.4 delay speedup for 4-way 64-entry scheduler
- Disadvantages
- All mispredictions and simultaneous arrivals
issued one cycle later
24Register File Access
- 2 read ports and 1 write per issue slot
- More ports causes
- Quadratic growth in area
- Linear growth in latency
- Having 2 read ports is an overdesign
- Often have 0, 1 sources or use bypass
- lt4 instructions need 2 read port accesses
25Need for Two Read Ports
26Sequential Register Access
- Have only one read port
- Structural hazard when 2 ports needed
- Perform both reads sequentially
- Cacti 3.0 model in 0.18µ
- 160-entry register file going from 24 to 16 ports
reduces latency by 20.5
27Example
28Sequential Register Access and Scheduling Logic
- Speculative Scheduling does not allow variable
latencies - Scheduling logic must detect sequential register
access - Authors use a conservative approach
- Only back-to-back dont require 2 cycles
29Scheduling Logic with Sequential Register Access
Wakeup Logic
Select Logic
30Performance of Sequential Wakeup
- IPC Degradation 0.4 4-way, 0.6 8-way
- Outperforms tag elimination, even w/o pred
31Performance of Sequential Register Access
- IPC Degradation 1.1 4-way, 0.7 8-way
32Performance of Combined
- IPC Degradation 2.2 on average
- Worse than sum of both
- Mispredict gt 2 sequential register reads
33Negative interference
dest
rdy
rdy
r1
ADD
1
1
r2
r3
issue
Cycle 1
r4
SUB
1
0
r3
r5
34Negative interference
r3
dest
rdy
rdy
r1
ADD
1
1
r2
r3
issued
Cycle 2
r4
SUB
1
0
r3
r5
- Mispredicted - r3 put on slow side
35Negative interference
r5
r3
dest
rdy
rdy
r1
ADD
1
1
r2
r3
issued
Cycle 3
r4
SUB
1
1
r3
r5
issue
- Add Sub werent issued back-to-back
- Conservatively assume 2 reg reads
36Conclusion
- Established overdesign of
- Wakeup Logic
- Register File multi-porting
- Wakeup logic sped up (26.4)
- lt1 IPC reduction
- Register file ports reduced, and latency
decreased (20.5) - 1 IPC reduction
- Together 2.2 IPC reduction