Title: Realizing High IPC Through a Scalable, Multipath Microarchitecture
1Realizing High IPC Through a Scalable, Multipath
Microarchitecture
David Kaeli Northeastern University Computer
Architecture Research Laboratory Boston, MA USA
2The Team
David Morano Alireza Khalafi Marcos de
Alba Northeastern University Boston, MA USA
- Augustus Uht
- Sean Langford
- University of Rhode Island
- Kingston, RI USA
- (now at CMU)
3The Road to High IPC
- Many studies have concluded that typical programs
(e.g., SPECint) contain a significant amount of
Instruction Level Parallelism (ILP) - Lam and Wilson reported an IPC of 40 for
SP-CD-MF (speculative execution, perfect control
dependence information, multi-path execution) - Gonzalez and Gonzalez reported an IPC of 37 for
an infinite instruction window, but no value
prediction (IPC went down to just under 10 for a
128 entry instruction window) - So why are we still living with low,
single-digit, IPCs??? - Nobody has been aggressive enough!!!
4Machine Philosophy
- Issue a column of instructions on every cycle
(not always possible) - Spend the rest of the time executing, squashing,
snarfing and re-executing as necessary to
preserve true control flow and data flow
dependencies - Retire instructions at a rate of a column at a
time - Design a datapath that is scalable in terms of
latency as the size of the machine grows - ISA independent
5Outline for this Talk
- Overview of the Levo microarchitecture
- Discussion of scalability within the Levo
datapath - Disjoint execution
- Simulation methodology and results
- Comments and summary
6Levo Microarchitectural Features
- In-order instruction load, in-order retirement,
rampantly out-of-order execution - Active stations a more intelligent version of
Tomasulos reservation stations - Instruction/operand/memory/predicate time tags
used to enforce data and control dependencies in
a distributed fashion - Hardware runtime predication used for all BBs
with targets within the execution window - Distributed register file reduces contention
for a shared register file - Aggressive speculation execute instructions,
independent of any data flow or control flow
dependencies - Disjoint execution to cover control hazards
- Limit study with real hardware constraints
7In-order Instruction Load
- Instructions are fetched in static order from
I-cache, except - Unconditional jump paths are followed
- Loops are dynamically unrolled
- Conditional branches with far targets (the target
is greater than 2/3rds the size of the execution
window), if the branch is strongly predicted
taken, begin static fetching from the target - A conventional 2-level gshare branch predictor is
used - Dynamic run-time predicates are generated so that
every branch domain in the Execution Window is
control independent - Nullify operations are broadcast to cause
dependent instructions to re-execute
Microarchitecture
8Memory Window
Microarchitecture
I-Cache
n x m Time-ordered Execution Window
9 Active Stations
- More intelligent version of Tomasulo reservation
stations - Each AS holds
- A single instruction
- Instruction operands
- A time tag denoting its logical position in the
execution window - Each AS shares a processing element with a number
of other ASs (as defined by the size of a
sharing group)
Microarchitecture
10 Active Stations
- Communicate with other active stations in order
to - Snoop for the latest operand values
- Forward the results to other active stations
- Request a value from other active stations
- Re-execute its instruction with new operand
values - Handles control flow changes through runtime
predication
Microarchitecture
11 Time Tags
- Enforce the nominal sequential order of the
instructions executed - Accompany all in-flight register values, memory
values and predicate values - Have two parts
- Column tag is decremented by 1 whenever the
left-most column is loaded - Row tag does not change
Row
Column
Microarchitecture
12Execution Window
Sharing Group
Column m-1
Column 0
AS(0,m-1)
Row 0
Row 0
1
AS(1,m-1)
1
2
2
3
3
PE
AS(2,m-1)
n-1
n-1
AS(3,m-1)
n rows by m columns
A sharing group of 4 mainline ASs sharing a
single PE
Microarchitecture
13Active Station Operand Snooping and Snarfing
result operand forwarding bus
time tag
address
value
AS time tag
path
time tag
value
address
gt
lt
!
time tag
address
value
path
time tag
execute or re-execute
Microarchitecture
14Last Snarfed
Instruction,
Time Tag
Instruction
Result Time Tag
(LSTT).
Number
(ResTT)
In Active Station
.
R4 1
R4 1
1.
1
R4a 1
5.
5
R4 2
R4 2
R4b 2
9.
9
R3 R4
R3 R4
1, then 5
R3 R4b
Out-of-Order (OOO) Execution.
Sequential
Out-of-Order (OOO) Execution.
Execution
- I9 only snarfs I5 result
- I1 result and ResTT broadcast,
(at end,
(at end, R3 holds 2)
R3 1, LSTT 1
R3 holds 2)
- I5 result and ResTT broadcast,
R3 2, LSTT 5
(at end, R3 holds 2)
(Same result if I5 broadcasts first
LSTT is set to and stays at 5
I1 result not snarfed by I9.)
(a)
Program Code
(b)
With Renaming
(c)
With Time Tags
15Scalable Microarchitecture
- Time tags size grows linearly with the total
number of ASs - No reorder buffer (typically grows O(n2))
- No centralized architected register file
- Register forwarding units hold the ISA-defined
register state - Forwarding transactions maintain state
- Segmented result buses fixed length
- Distributed L0 caching in the datapath
16Observation About Register Lifetimes
- The MultiScalar Project demonstrated that
register lifetimes are short (spanning 1-2 basic
blocks, within 32 instructions) - If we have instructions laid out in a
time-ordered fashion, the probability we will
have to forward in time very far is low - As a result, we can segment our interconnection
fabric, assuming that communicates will only span
either the current, or at most the next, segment
17Segmented Buses (Spanning Buses)
- Use segmented buses to propagate execution
results to later stations - Adjacent segments are interconnected with
Forwarding Units (one forwarding unit, per bus,
per column) - Register Forwarding/Filter Units (RFUs) hold a
version of the ISA register state - Memory Forwarding/Filter Units (MFUs) and
Predicate Forwarding Units (PFUs) are also
provided - Backwarding buses are also provided
- The number of I/Os to a FU is independent of the
machine size and only depends on the column
height - Segmented buses help to preserve scalability in
our datapath
Microarchitecture
18from previous column
from previous column
from previous column
M D
M D
M D
FU
FU
FU
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
M D
M D
M D
FU
FU
FU
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
M D
M D
M D
FU
FU
FU
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
AS
FU
FU
FU
to next column
to next column
to next column
19- Register Forwarding/Filter Units
- Capture the persistent register state
- All buses are register transaction buses
- Consolidate update transactions on input
- Updates are forwarded to the output bus request
logic immediately when possible - Requests are filtered based on time-tag value
- Updates are managed in the file store in FIFO
order
backward in time
forward in time
backwarding read buses
backwarding write bus
ISA register file per path
primary backwarding read bus
forwarding read buses
logic
logic
primary forwarding read bus
forwardwarding write bus
time-tag
20- Memory Forwarding/Filter Units
- Serve as an L0 cache
- All buses are memory buses (number of which set
according to interleave factor) - Consolidate update transactions on input
- Updates are "forwarded" to the output bus request
logic immediately when possible - Requests are filtered based on time-tag value
- Current policy is to queue outgoing requests or
responses in FIFOs until the buses are granted
for use
backward in time
forward in time
backwarding write buses
memory cache
backwarding read buses
FIFO
logic
logic
forwardwarding write buses
forwarding read buses
FIFO
time-tag
21Disjoint Path Execution
- Levo can only obtain high IPC if
- we can provide a large window of instructions to
execute - a large percentage of the instructions on the
eventual committed control-flow path are included
in the window - To address the issues with hard-to-predict
conditional control flow, we utilize disjoint
path spawning in Levo
and DEE
22Disjoint Path Execution
- To enable path spawning we provide a disjoint
path (D-path) set of ASs that share a processing
element with a mainline set of ASs - D-paths are spawned in the case of hammock
branches - The D-path is copied from the mainline path
- The sign of the associated predicate is inverted
for the D-path - The D-path receives lower priority for the PE
than the mainline - When a hammock branch is mispredicted, we can
treat the D-path as the new mainline path, and
continue execution accordingly
and DEE
23A100 LW R2,20(R4) A104 SUB R2,R2,1 A108 BEQZ
R2,TAR1
Label Addr Instruction History
START A100 LW R2,20(R4) A104 SUB
R2,R2,1 A108 BEQZ R2,TAR1 Weakly T A10C ADD
R2,R2,4 A110 SW 30(R4),R2 TAR1 A114 LW
R2,30(R4) A118 SUB R2,R2,8 A11C BEQZ
R2,TAR2 Weakly NT A120 SW 20(R4),R2 TAR2 A124 AD
D R2,R2,10 A128 SUB R1,R1,1 A12C BNEQZ
R1,START Strongly T A130 SW 40(R4),R2 . .
A10C ADD R2,R2,4 A110 SW 30(R4),R2
A114 LW R2,28(R4) A118 SUB R2,R2,8 A11C BEQZ
R2, TAR2
A120 SW 20(R4),R2
A124 ADD R2,R2,10 A128 SUB R1,R1,1 A12C BNEQZ
R1,START
Mainline path Disjoint path
A130 SW 40(R4),R2
24Modeling and Results
- Present work utilizes
- MIPS-1/MIPS-2 machine
- SGI compiler
- SPECint 95 (compress, go and ijpeg) and 2000
(bzip2, crafty, gcc, gzip, mcf, parser and
vertex) benchmarks - 3 levels of modeling
- Trace-driven model (FastLevo) results in this
presentation - Detailed cycle-accurate model (LevoSim) still
under development - Synthesizable VHDL hardware model (HDLevo)
validation - Design space exploration
- Impact of D-paths
- Real vs. ideal memory
- Range of bus latency issues
performance
25Modeling parms
L1 1,D geometry 64KB, 2WSA, 32B
L2 unified I/D geometry 2MB, direct mapped, 32B
Main memory geometry infinite, 4W interleaved
L0, L1, L2, memory hit latencies 1, 1, 10, 100 cycles (does not include bus latency)
Branch predictor 2-level 1024 entry BHT 4096 entry GPHT 2-bit 16 entry RAS one per E-window row
Data value predictor 4096 stride predictor, one per E-window row
PE Element latencies same as MIPS R4000
performance
26Modeling parms
L0 geometry 32-32b, fully associative, 32b line
Spanning bus delay 1 cycle
FU/BU delay (no contention) 1 cycle
Buses per RFU and per MFU 2 buses
Buses per PFU 1 bus
Columns per D-path, ML-D switch time 1 column, 1 cycle plus time to broadcast new D-path values as ML
performance
27IPC obtained with Levo
performance
28Speedup obtained using D-paths versus single
path execution(harmonic means)
performance
29IPC of Levo compared to modeling 100 L1 I/D
hitsharmonic means
performance
30Summary of additional experiments
- Varying the L1-D/L2 hit time (versus 1 cycle)
- Increased L1-D HT to 2/4/8 cycles 10/22/43 IPC
loss - Increased L2 HT to 2/4/8/16 cycles
.8/2.3/4.7/8.9 IPC loss - Varying the number of buses per FU
- Decreased to 1 bus/FU 14 IPC loss
- Increased to 4 buses/FU 3 IPC gain
- Removal of stride predictor .8 IPC loss
- Varying the number of columns per D-path
- Increased to 2 cols/D-path 8 IPC loss
- Use of D-paths 45 IPC gain
- Varying the number of branch prediction tables
- Decreased from 1 per row to a single of same
total size .4 IPC loss
performance
31Comments and Future Directions
- I-fetch is the main barrier to further gains in
IPC - The use of a detailed VHDL model of critical
components in Levo has allowed us to design
scalable resources - A number of novel microarchitectural features are
present in a single design - Future challenges in Levo include
- Improved I-fetch (EV8, trace cache, dynamic
D-paths) - Finish design of an ARB-like memory
- Consider compiler support to aid in-order issue
and D-path execution - Consider multithreaded extensions to support
coarse-grained multithreading
32To learn more about visit
- http//www.ece.neu.edu/info/architecture/research/
Levo.html - Also see our paper at europar02.