Federation: Repurposing Scalar Cores for OutofOrder Instruction Issue

About This Presentation

Title:

Federation: Repurposing Scalar Cores for OutofOrder Instruction Issue

Description:

If cores are small, single cycle communication between neighbors is feasible ... 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 31

Provided by: srt8

Category:

more less

Transcript and Presenter's Notes

Title: Federation: Repurposing Scalar Cores for OutofOrder Instruction Issue

1
Federation Repurposing Scalar Cores for
Out-of-Order Instruction Issue

David Tarjan, Michael Boyer, and Kevin Skadron
University of Virginia
Department of Computer Science
Currently on internship/sabbatical at NVIDIA
Research

2
Motivation
Adaptive(Federation)
Homogeneous
Heterogeneous
3
Basic Insights

A multithreaded in-order core has many registers
which can be reused for a reorder buffer
oractive list
If cores are small, single cycle communication
between neighbors is feasible
Prior work on making large OOO cores feasible can
be applied at the low end to make low-cost OOO
possible

4
In-order Out-of-order Pipelines
In-order
Out-of-order
Bpred
Fetch
Decode
Execute
Execute
Allocate
Mem
Mem
Rename
Writeback
Writeback
Issue
Commit
5
Issue Queue Example

1
1
1
IQ2
IQ3
1
IQ3
0
1

2
0
0
1
1

3
Huang et al., Energy-Efficient Hybrid Wakeup
Logic, ISLPED 2002 Sassone et al., Matrix
Scheduler Reloaded, ISCA 2007
6
Simplified Load-Store Queue

Memory Alias Table (MAT)
No store forwarding
No conservative waiting on stores
Only detect memory order violations after they
have occurred and flush the pipeline when the
offending instruction commits

Amir Roth, Store Vulnerability Window (SVW)
Re-Execution Filtering for Enhanced Load
Optimization, ISCA 2005
7
MAT Example
st 0x13, r5
ld r1, 0x13
8
MAT Example
st 0x13, r5
ld r1, 0x13
EXE
ld executes and increments counter
9
MAT Example
st 0x13, r5
COM
ld r1, 0x13
st commits and sets flag
10
MAT Example
ld r1, 0x13
COM
Flush
ld commits, sees flag, and flushes pipeline
11
MAT Example
ld r1, 0x13
MAT is reset and execution resumes
12
Performance Impact
13
Performance
14
Energy Efficiency
15
Area Efficiency
16
Conclusions

Two in-order cores can be federated at run-time
to form a 2-way OO core
Almost doubling IPC of throughput core is
possible with very little extra hardware
Dont want traditional OO structures because
their performance comes at too high a price
Best combined area- and energy-efficiency

17
Q A
18
Backup
19
Core Fusion Data
Figure from Ipek et al., Core Fusion
Accommodating Software Diversity in Chip
Multiprocessors , ISCA 2007
20
Overall Results

Scalar in-order core is 8KB I/D, 256KB L2
Base 2-way core has 16KB I and D-Caches, 256KB
L2, 32 entry ROB, 16 entry issue queue, 16 entry
LSQ, bimodal bpred
4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32
IQ and LSQ, tournament bpred

21
Branch Prediction

Use only a Next Line and Set (NLS) predictor,
Bimodal predictor and a Return Address Stack
(RAS)
NLS ok if your instruction working set not gt I
size
Small bimodal predictor ik ok for small window
processor

22
Fetch

Two Is act as a I of twice the size and
associativity (and random replacement)
More logic and buffers to capture two
instructions
Extra cycle to route instructions from two Is
to two decoders

23
Decode

Cancel second instruction if first turns out to
be branch
Extra cycle to route decoded instructions to new
allocate stage

24
Allocate

New logic and free lists to allocate ROB, IQ
entries

25
Rename

New table since it has too many ports
One, centralized rename table, not distributed
Has separate table (or field in each RAT entry)
for each registers producer instructions IQ-slot
number (see our new issue queue)

26
Issue

Uses a simple lookup table as wakeup structure,
where instructions subscribe to their input
instructions (explained in detail later)
Centralized, one IQ for the two cores

27
Register File

Register file is mirrored in the two cores
No extra copy instructions or load-balancing
questions

28
Execute

Add extra cycle for copying result to other
cores register file (like EV6)

29
Memory Access

The two Ds are checked in parallel, each
responsible for half of the merged Ds ways
No standard LSQ, only a Memory Alias Table
(details later)
Only detects ordering violations and send signal
to pipeline