Title: Lecture 16: Core Design
1Lecture 16 Core Design
- Today basics of implementing a correct ooo
core - register renaming, commit, LSQ,
issue queue
2The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
Committed Reg Map R1?P1 R2?P2
Register File P1-P64
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ?
P33P34 P36 ? P35P34
ALU
ALU
ALU
Speculative Reg Map R1?P36 R2?P34
Instr Fetch Queue
Results written to regfile and tags broadcast to
IQ
Issue Queue (IQ)
3Rename
A lr1 ? lr2 lr3 B lr2 ? lr4 lr5 C
lr6 ? lr1 lr3 D lr6 ? lr1 lr2 RAR
lr3 RAW lr1 WAR lr2 WAW
lr6 A BC D
pr7 ? pr2 pr3 pr8 ? pr4 pr5 pr9 ? pr7
pr3 pr10 ? pr7 pr8 RAR pr3 RAW pr7 WAR
x WAW x AB CD
4Commit Example
Assume a processor with 6 logical regs and 10
physical regs
Map Old / New lr1 pr1 pr7 lr2 pr2 pr8 lr6
pr6 pr9 lr6 pr9 pr10 lr3 pr3 pr1 lr4
pr4 pr2
A lr1 ? lr2 lr3 B lr2 ? lr4 lr5 C
lr6 ? lr1 lr3 D lr6 ? lr1 lr2 E lr3 ?
lr6 lr2 F lr4 ? lr3 lr4
pr7 ? pr2 pr3 pr8 ? pr4 pr5 pr9 ? pr7
pr3 pr10 ? pr7 pr8 pr1 ? pr10 pr8 pr2 ?
pr1 pr4
5Out-of-Order Loads/Stores
Ld
R1 ? R2
Ld
R3 ? R4
St
R5 ? R6
Ld
R7 ? R8
Ld
R9?R10
6Memory Dependence Checking
Ld
0x abcdef
- The issue queue checks for
- register dependences and
- executes instructions as soon
- as registers are ready
- Loads/stores access memory
- as well must check for RAW,
- WAW, and WAR hazards for
- memory as well
- Hence, first check for register
- dependences to compute
- effective addresses then check
- for memory dependences
Ld
St
Ld
Ld
0x abcdef
St
0x abcd00
Ld
0x abc000
Ld
0x abcd00
7Memory Dependence Checking
- Load and store addresses are
- maintained in program order in
- the Load/Store Queue (LSQ)
- Loads can issue if they are
- guaranteed to not have true
- dependences with earlier stores
- Stores can issue only if we are
- ready to modify memory (can not
- recover if an earlier instr raises
- an exception)
Ld
0x abcdef
Ld
St
Ld
Ld
0x abcdef
St
0x abcd00
Ld
0x abc000
Ld
0x abcd00
8The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr
6 Instr 7
Committed Reg Map R1?P1 R2?P2
Register File P1-P64
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2 LD R4 ? 8R3 ST R4 ? 8R1
Decode Rename
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ?
P33P34 P36 ? P35P34 P37 ? 8P35 P37 ? 8P36
ALU
ALU
ALU
Speculative Reg Map R1?P36 R2?P34
Results written to regfile and tags broadcast to
IQ
Instr Fetch Queue
Issue Queue (IQ)
ALU
P37 ? P35 8 P37 ? P36 8
D-Cache
LSQ
9Speculative Issue
- Instr I1 leaves the issue queue at start of
cycle 6 the instr - then reads operands from the regfile, wires are
traversed, - instruction executes, result is available at
end of cycle 8 - If operand availability is broadcast to issue
queue in cycle 9, - dependent instruction leaves in cycle 10
- This causes a 4-cycle gap between successive
instrs - Hence, if we know that the instruction takes a
cycle to - execute, the operand is broadcast to the issue
queue in - cycle 6 and the dependent instr leaves issue
queue in - cycle 7 the input operand is correctly
bypassed at the FU
10Load Hit Speculation
- The previous optimization assumes that we know
the exact - latency for every operation
- This is true for all ops except loads (cache hit
or miss?) - Assume hit and schedule accordingly on a cache
miss, - must squash all speculatively issued
instructions an - instruction therefore sits in the queue until
load hits are - determined
11Register Rename Logic
Map Table
Physical Source Regs
Physical Dest Regs
Logical Source Regs
Mux
Free Pool
Dependence Check Logic
Logical Dest Regs
Logical Source Reg
12Map Table RAM
7-bits
7-bits
7-bits
7-bits
7-bits
Phys reg id
Num entries Num logical regs
Shadow copies (shift register)
13Map Table CAM
5-bits
1-bit
1-bit
Logical reg id
v a l i d
Num entries Num phys regs
Shadow copies
14Wakeup Logic
tag1
tagIW
or
or
rdyL
rdyR
tagR
tagL
. . .
. . .
rdyL
rdyR
tagR
tagL
15Selection Logic
Issue window
req
grant
enable
anyreq
Arbiter cell
enable
- For multiple FUs, will need sequential selectors
16Structure Complexities
- Critical structures
- register map tables, issue queue, LSQ,
register file, - register bypass
- Cycle time is heavily influenced by
- window size (physical register size),
issue width (FUs) - Conflict between the desire to increase IPC and
clock speed
17Title