Realizing High IPC Through a Scalable, Multipath Microarchitecture - PowerPoint PPT Presentation

About This Presentation
Title:

Realizing High IPC Through a Scalable, Multipath Microarchitecture

Description:

Northeastern University. Computer Architecture Research Laboratory. Boston, MA USA. The Team ... Northeastern University. Boston, MA USA. The Road to High IPC ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 33
Provided by: davidr8
Category:

less

Transcript and Presenter's Notes

Title: Realizing High IPC Through a Scalable, Multipath Microarchitecture


1
Realizing High IPC Through a Scalable, Multipath
Microarchitecture
David Kaeli Northeastern University Computer
Architecture Research Laboratory Boston, MA USA
2
The Team
David Morano Alireza Khalafi Marcos de
Alba Northeastern University Boston, MA USA
  • Augustus Uht
  • Sean Langford
  • University of Rhode Island
  • Kingston, RI USA
  • (now at CMU)

3
The Road to High IPC
  • Many studies have concluded that typical programs
    (e.g., SPECint) contain a significant amount of
    Instruction Level Parallelism (ILP)
  • Lam and Wilson reported an IPC of 40 for
    SP-CD-MF (speculative execution, perfect control
    dependence information, multi-path execution)
  • Gonzalez and Gonzalez reported an IPC of 37 for
    an infinite instruction window, but no value
    prediction (IPC went down to just under 10 for a
    128 entry instruction window)
  • So why are we still living with low,
    single-digit, IPCs???
  • Nobody has been aggressive enough!!!

4
Machine Philosophy
  • Issue a column of instructions on every cycle
    (not always possible)
  • Spend the rest of the time executing, squashing,
    snarfing and re-executing as necessary to
    preserve true control flow and data flow
    dependencies
  • Retire instructions at a rate of a column at a
    time
  • Design a datapath that is scalable in terms of
    latency as the size of the machine grows
  • ISA independent

5
Outline for this Talk
  • Overview of the Levo microarchitecture
  • Discussion of scalability within the Levo
    datapath
  • Disjoint execution
  • Simulation methodology and results
  • Comments and summary

6
Levo Microarchitectural Features
  • In-order instruction load, in-order retirement,
    rampantly out-of-order execution
  • Active stations a more intelligent version of
    Tomasulos reservation stations
  • Instruction/operand/memory/predicate time tags
    used to enforce data and control dependencies in
    a distributed fashion
  • Hardware runtime predication used for all BBs
    with targets within the execution window
  • Distributed register file reduces contention
    for a shared register file
  • Aggressive speculation execute instructions,
    independent of any data flow or control flow
    dependencies
  • Disjoint execution to cover control hazards
  • Limit study with real hardware constraints

7
In-order Instruction Load
  • Instructions are fetched in static order from
    I-cache, except
  • Unconditional jump paths are followed
  • Loops are dynamically unrolled
  • Conditional branches with far targets (the target
    is greater than 2/3rds the size of the execution
    window), if the branch is strongly predicted
    taken, begin static fetching from the target
  • A conventional 2-level gshare branch predictor is
    used
  • Dynamic run-time predicates are generated so that
    every branch domain in the Execution Window is
    control independent
  • Nullify operations are broadcast to cause
    dependent instructions to re-execute

Microarchitecture
8
Memory Window
Microarchitecture
I-Cache
n x m Time-ordered Execution Window
9
Active Stations
  • More intelligent version of Tomasulo reservation
    stations
  • Each AS holds
  • A single instruction
  • Instruction operands
  • A time tag denoting its logical position in the
    execution window
  • Each AS shares a processing element with a number
    of other ASs (as defined by the size of a
    sharing group)

Microarchitecture
10
Active Stations
  • Communicate with other active stations in order
    to
  • Snoop for the latest operand values
  • Forward the results to other active stations
  • Request a value from other active stations
  • Re-execute its instruction with new operand
    values
  • Handles control flow changes through runtime
    predication

Microarchitecture
11
Time Tags
  • Enforce the nominal sequential order of the
    instructions executed
  • Accompany all in-flight register values, memory
    values and predicate values
  • Have two parts
  • Column tag is decremented by 1 whenever the
    left-most column is loaded
  • Row tag does not change

Row
Column
Microarchitecture
12
Execution Window
Sharing Group
Column m-1
Column 0
AS(0,m-1)
Row 0
Row 0
1
AS(1,m-1)
1
2
2
3
3
PE
AS(2,m-1)
n-1
n-1
AS(3,m-1)
n rows by m columns
A sharing group of 4 mainline ASs sharing a
single PE
Microarchitecture
13
Active Station Operand Snooping and Snarfing
result operand forwarding bus
time tag
address
value
AS time tag
path
time tag
value
address
gt
lt
!
time tag
address
value
path
time tag
execute or re-execute
Microarchitecture
14
Last Snarfed
Instruction,
Time Tag
Instruction
Result Time Tag
(LSTT).
Number
(ResTT)
In Active Station
.
R4 1
R4 1
1.
1
R4a 1

5.
5
R4 2
R4 2

R4b 2
9.
9
R3 R4
R3 R4
1, then 5
R3 R4b
Out-of-Order (OOO) Execution.
Sequential
Out-of-Order (OOO) Execution.
Execution
- I9 only snarfs I5 result
- I1 result and ResTT broadcast,
(at end,
(at end, R3 holds 2)
R3 1, LSTT 1
R3 holds 2)
- I5 result and ResTT broadcast,
R3 2, LSTT 5
(at end, R3 holds 2)
(Same result if I5 broadcasts first
LSTT is set to and stays at 5
I1 result not snarfed by I9.)
(a)
Program Code
(b)
With Renaming
(c)
With Time Tags
15
Scalable Microarchitecture
  • Time tags size grows linearly with the total
    number of ASs
  • No reorder buffer (typically grows O(n2))
  • No centralized architected register file
  • Register forwarding units hold the ISA-defined
    register state
  • Forwarding transactions maintain state
  • Segmented result buses fixed length
  • Distributed L0 caching in the datapath

16
Observation About Register Lifetimes
  • The MultiScalar Project demonstrated that
    register lifetimes are short (spanning 1-2 basic
    blocks, within 32 instructions)
  • If we have instructions laid out in a
    time-ordered fashion, the probability we will
    have to forward in time very far is low
  • As a result, we can segment our interconnection
    fabric, assuming that communicates will only span
    either the current, or at most the next, segment

17
Segmented Buses (Spanning Buses)
  • Use segmented buses to propagate execution
    results to later stations
  • Adjacent segments are interconnected with
    Forwarding Units (one forwarding unit, per bus,
    per column)
  • Register Forwarding/Filter Units (RFUs) hold a
    version of the ISA register state
  • Memory Forwarding/Filter Units (MFUs) and
    Predicate Forwarding Units (PFUs) are also
    provided
  • Backwarding buses are also provided
  • The number of I/Os to a FU is independent of the
    machine size and only depends on the column
    height
  • Segmented buses help to preserve scalability in
    our datapath

Microarchitecture
18
from previous column
from previous column
from previous column
M D
M D
M D
FU
FU
FU
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
M D
M D
M D
FU
FU
FU
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
M D
M D
M D
FU
FU
FU
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
FU
FU
FU
to next column
to next column
to next column
19
  • Register Forwarding/Filter Units
  • Capture the persistent register state
  • All buses are register transaction buses
  • Consolidate update transactions on input
  • Updates are forwarded to the output bus request
    logic immediately when possible
  • Requests are filtered based on time-tag value
  • Updates are managed in the file store in FIFO
    order

backward in time
forward in time
backwarding read buses
backwarding write bus
ISA register file per path
primary backwarding read bus
forwarding read buses
logic
logic
primary forwarding read bus
forwardwarding write bus
time-tag
20
  • Memory Forwarding/Filter Units
  • Serve as an L0 cache
  • All buses are memory buses (number of which set
    according to interleave factor)
  • Consolidate update transactions on input
  • Updates are "forwarded" to the output bus request
    logic immediately when possible
  • Requests are filtered based on time-tag value
  • Current policy is to queue outgoing requests or
    responses in FIFOs until the buses are granted
    for use

backward in time
forward in time
backwarding write buses
memory cache
backwarding read buses
FIFO
logic
logic
forwardwarding write buses
forwarding read buses
FIFO
time-tag
21
Disjoint Path Execution
  • Levo can only obtain high IPC if
  • we can provide a large window of instructions to
    execute
  • a large percentage of the instructions on the
    eventual committed control-flow path are included
    in the window
  • To address the issues with hard-to-predict
    conditional control flow, we utilize disjoint
    path spawning in Levo

and DEE
22
Disjoint Path Execution
  • To enable path spawning we provide a disjoint
    path (D-path) set of ASs that share a processing
    element with a mainline set of ASs
  • D-paths are spawned in the case of hammock
    branches
  • The D-path is copied from the mainline path
  • The sign of the associated predicate is inverted
    for the D-path
  • The D-path receives lower priority for the PE
    than the mainline
  • When a hammock branch is mispredicted, we can
    treat the D-path as the new mainline path, and
    continue execution accordingly

and DEE
23
A100 LW R2,20(R4) A104 SUB R2,R2,1 A108 BEQZ
R2,TAR1
Label Addr Instruction History
START A100 LW R2,20(R4) A104 SUB
R2,R2,1 A108 BEQZ R2,TAR1 Weakly T A10C ADD
R2,R2,4 A110 SW 30(R4),R2 TAR1 A114 LW
R2,30(R4) A118 SUB R2,R2,8 A11C BEQZ
R2,TAR2 Weakly NT A120 SW 20(R4),R2 TAR2 A124 AD
D R2,R2,10 A128 SUB R1,R1,1 A12C BNEQZ
R1,START Strongly T A130 SW 40(R4),R2 . .
A10C ADD R2,R2,4 A110 SW 30(R4),R2
A114 LW R2,28(R4) A118 SUB R2,R2,8 A11C BEQZ
R2, TAR2
A120 SW 20(R4),R2
A124 ADD R2,R2,10 A128 SUB R1,R1,1 A12C BNEQZ
R1,START
Mainline path Disjoint path
A130 SW 40(R4),R2
24
Modeling and Results
  • Present work utilizes
  • MIPS-1/MIPS-2 machine
  • SGI compiler
  • SPECint 95 (compress, go and ijpeg) and 2000
    (bzip2, crafty, gcc, gzip, mcf, parser and
    vertex) benchmarks
  • 3 levels of modeling
  • Trace-driven model (FastLevo) results in this
    presentation
  • Detailed cycle-accurate model (LevoSim) still
    under development
  • Synthesizable VHDL hardware model (HDLevo)
    validation
  • Design space exploration
  • Impact of D-paths
  • Real vs. ideal memory
  • Range of bus latency issues

performance
25
Modeling parms
L1 1,D geometry 64KB, 2WSA, 32B
L2 unified I/D geometry 2MB, direct mapped, 32B
Main memory geometry infinite, 4W interleaved
L0, L1, L2, memory hit latencies 1, 1, 10, 100 cycles (does not include bus latency)
Branch predictor 2-level 1024 entry BHT 4096 entry GPHT 2-bit 16 entry RAS one per E-window row
Data value predictor 4096 stride predictor, one per E-window row
PE Element latencies same as MIPS R4000
performance
26
Modeling parms
L0 geometry 32-32b, fully associative, 32b line
Spanning bus delay 1 cycle
FU/BU delay (no contention) 1 cycle
Buses per RFU and per MFU 2 buses
Buses per PFU 1 bus
Columns per D-path, ML-D switch time 1 column, 1 cycle plus time to broadcast new D-path values as ML
performance
27
IPC obtained with Levo
performance
28
Speedup obtained using D-paths versus single
path execution(harmonic means)
performance
29
IPC of Levo compared to modeling 100 L1 I/D
hitsharmonic means
performance
30
Summary of additional experiments
  • Varying the L1-D/L2 hit time (versus 1 cycle)
  • Increased L1-D HT to 2/4/8 cycles 10/22/43 IPC
    loss
  • Increased L2 HT to 2/4/8/16 cycles
    .8/2.3/4.7/8.9 IPC loss
  • Varying the number of buses per FU
  • Decreased to 1 bus/FU 14 IPC loss
  • Increased to 4 buses/FU 3 IPC gain
  • Removal of stride predictor .8 IPC loss
  • Varying the number of columns per D-path
  • Increased to 2 cols/D-path 8 IPC loss
  • Use of D-paths 45 IPC gain
  • Varying the number of branch prediction tables
  • Decreased from 1 per row to a single of same
    total size .4 IPC loss

performance
31
Comments and Future Directions
  • I-fetch is the main barrier to further gains in
    IPC
  • The use of a detailed VHDL model of critical
    components in Levo has allowed us to design
    scalable resources
  • A number of novel microarchitectural features are
    present in a single design
  • Future challenges in Levo include
  • Improved I-fetch (EV8, trace cache, dynamic
    D-paths)
  • Finish design of an ARB-like memory
  • Consider compiler support to aid in-order issue
    and D-path execution
  • Consider multithreaded extensions to support
    coarse-grained multithreading

32
To learn more about visit
  • http//www.ece.neu.edu/info/architecture/research/
    Levo.html
  • Also see our paper at europar02.
Write a Comment
User Comments (0)
About PowerShow.com