Federation: Repurposing Scalar Cores for OutofOrder Instruction Issue - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Federation: Repurposing Scalar Cores for OutofOrder Instruction Issue

Description:

If cores are small, single cycle communication between neighbors is feasible ... 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 31
Provided by: srt8
Category:

less

Transcript and Presenter's Notes

Title: Federation: Repurposing Scalar Cores for OutofOrder Instruction Issue


1
Federation Repurposing Scalar Cores for
Out-of-Order Instruction Issue
  • David Tarjan, Michael Boyer, and Kevin Skadron
  • University of Virginia
  • Department of Computer Science
  • Currently on internship/sabbatical at NVIDIA
    Research

2
Motivation
Adaptive(Federation)
Homogeneous
Heterogeneous
3
Basic Insights
  • A multithreaded in-order core has many registers
    which can be reused for a reorder buffer
    oractive list
  • If cores are small, single cycle communication
    between neighbors is feasible
  • Prior work on making large OOO cores feasible can
    be applied at the low end to make low-cost OOO
    possible

4
In-order Out-of-order Pipelines
In-order
Out-of-order
Bpred
Fetch
Decode
Execute
Execute
Allocate
Mem
Mem
Rename
Writeback
Writeback
Issue
Commit
5
Issue Queue Example

1
1
1
IQ2
IQ3
1
IQ3
0
1

2
0
0
1
1

3
Huang et al., Energy-Efficient Hybrid Wakeup
Logic, ISLPED 2002 Sassone et al., Matrix
Scheduler Reloaded, ISCA 2007
6
Simplified Load-Store Queue
  • Memory Alias Table (MAT)
  • No store forwarding
  • No conservative waiting on stores
  • Only detect memory order violations after they
    have occurred and flush the pipeline when the
    offending instruction commits

Amir Roth, Store Vulnerability Window (SVW)
Re-Execution Filtering for Enhanced Load
Optimization, ISCA 2005
7
MAT Example
st 0x13, r5
ld r1, 0x13
8
MAT Example
st 0x13, r5
ld r1, 0x13
EXE
ld executes and increments counter
9
MAT Example
st 0x13, r5
COM
ld r1, 0x13
st commits and sets flag
10
MAT Example
ld r1, 0x13
COM
Flush
ld commits, sees flag, and flushes pipeline
11
MAT Example
ld r1, 0x13
MAT is reset and execution resumes
12
Performance Impact
13
Performance
14
Energy Efficiency
15
Area Efficiency
16
Conclusions
  • Two in-order cores can be federated at run-time
    to form a 2-way OO core
  • Almost doubling IPC of throughput core is
    possible with very little extra hardware
  • Dont want traditional OO structures because
    their performance comes at too high a price
  • Best combined area- and energy-efficiency

17
Q A
18
Backup
19
Core Fusion Data
Figure from Ipek et al., Core Fusion
Accommodating Software Diversity in Chip
Multiprocessors , ISCA 2007
20
Overall Results
  • Scalar in-order core is 8KB I/D, 256KB L2
  • Base 2-way core has 16KB I and D-Caches, 256KB
    L2, 32 entry ROB, 16 entry issue queue, 16 entry
    LSQ, bimodal bpred
  • 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32
    IQ and LSQ, tournament bpred

21
Branch Prediction
  • Use only a Next Line and Set (NLS) predictor,
    Bimodal predictor and a Return Address Stack
    (RAS)
  • NLS ok if your instruction working set not gt I
    size
  • Small bimodal predictor ik ok for small window
    processor

22
Fetch
  • Two Is act as a I of twice the size and
    associativity (and random replacement)
  • More logic and buffers to capture two
    instructions
  • Extra cycle to route instructions from two Is
    to two decoders

23
Decode
  • Cancel second instruction if first turns out to
    be branch
  • Extra cycle to route decoded instructions to new
    allocate stage

24
Allocate
  • New logic and free lists to allocate ROB, IQ
    entries

25
Rename
  • New table since it has too many ports
  • One, centralized rename table, not distributed
  • Has separate table (or field in each RAT entry)
    for each registers producer instructions IQ-slot
    number (see our new issue queue)

26
Issue
  • Uses a simple lookup table as wakeup structure,
    where instructions subscribe to their input
    instructions (explained in detail later)
  • Centralized, one IQ for the two cores

27
Register File
  • Register file is mirrored in the two cores
  • No extra copy instructions or load-balancing
    questions

28
Execute
  • Add extra cycle for copying result to other
    cores register file (like EV6)

29
Memory Access
  • The two Ds are checked in parallel, each
    responsible for half of the merged Ds ways
  • No standard LSQ, only a Memory Alias Table
    (details later)
  • Only detects ordering violations and send signal
    to pipeline

30
Commit
  • Centralized commit, no slippage
  • Recover from branch mispredictions since no
    checkpoints of RAT on branches
  • Recover from memory order violations (or false
    positives) from MAT
Write a Comment
User Comments (0)
About PowerShow.com