Title: Exposing ILP in the Presence of Loops
1Exposing ILP in the Presence of Loops
- Marcos Rubén de Alba Rosano
- David Kaeli
- Department of Electrical and Computer Engineering
- Northeastern University
2Exposing ILP in the Presence of Loops
- To enable wide-issue microarchitectures to obtain
high throughput rates, a large window of
instructions must be available - Programs spend 90 of their execution time in 10
of their code (in loops) HP - Current compilers can unroll less than 50 of all
loops in integer codes deAlba 2000 - Present microarchitectures are not designed to
execute loops efficiently Vajapeyam 1999 Rosner
2001 - We may need to consider developing customized
loop predication hardware de Alba 2001
3Exposing ILP in the Presence of Loops
- We need to understand whether entire loop
executions can be predicted - Could expose large amounts of instruction level
parallelism - If patterns exist in loop execution, we need to
build a dynamic profiling system that can capture
these patterns - If we are able to detect patterns through
profiling, we can guide aggressive instruction
fetch/issue, effectively unrolling multiple
iterations of the loop
4Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
5Introduction
- Loops possess high temporal locality and present
a good opportunity for caching - Reduce pressure on other instruction caching
structures - Provide for aggressively runtime unrolling
- Applications that possess large loops tend to be
good targets to extract ILP - Common in scientific codes (e.g., SPECfp2000)
- Uncommon in integer and multimedia codes (e.g.,
SPECint2000 and Mediabench)
6Introduction
- We propose a path-based multi-level
hardware-based loop caching scheme that can - identify the presence of a loop in the execution
stream - profile loop execution, building loop execution
histories - cache entire unrolled loop execution traces in a
dedicated loop cache - utilize loop history to predict future loop
visits at runtime - Combine a loop prediction mechanism with other
aggressive instruction fetch mechanisms to
improve instruction delivery
7Loop cache elements
- Loop cache stack profiles loops that are
presently live, uses a stack structure to
accommodate nesting - Loop table a first-level table used to identify
loops and index into the loop cache - Loop cache a second-level table used to hold
unrolled loop bodies
8Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
9Related work
- Software-based
- loop unrolling Ellis, 1986
- software pipelining Lam , 1988
- loop quantization Nicolau, 1988
- static loop characteristics Davidson, 1995
- Limitations
- A compiler cannot unroll a loop if
- the loop body is too large
- the loop induction variable is not an integer
- the loop induction variable is not inc/dec by 1
- the inc/dec value can not be deduced at compile
time - the loop exit condition is not based on the value
of a constant - if there is conditional control flow in the loop
body - More than 50 of the loops present in our
workloads could not be unrolled by the compiler
10Related work
- Hardware-based
- loop buffer Thornton, 1964 Anderson, 1967
Hintz, 1972 - multiple-block ahead loop prediction Seznec,
1996 - trace cache Rotenberg, 1996
- loop detection Kobayashi, 1984 Tubella, 1998
Gonzalez, 1998 - dynamic vectorization Vajapeyam, 1999
- loop termination Sherwood, 2000
- loop caches Texas Inst. Uh 1999 Motorola
Vahid 2002 - hybrid approaches Holler, 1997 Hinton, 2001
- Limitations
- These techniques can effectively cache
well-structured loops - Conditional control flow present in loop bodies
can limit the number of loops that can be cached - The conditional control flow found in loops
generates complex, yet predictable patterns
11Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Hybrid fetch engine
- Experimental methodology
- Results
- Summary and future work
12Loop terminology
Static terms
i0 b1 i1 b2 i3 b3 i4 i5 i6 i7 b4 i8
loop head
loop body
loop tail
13Loop terminology
Dynamic terms
i0 b1 i1 b2 i3 b3 i4 i5 i6 i7 b4 i8
path to loop
path-in-iteration A (b3 NT)
path-in-iteration B (b3 T)
loop visit entering a loop body iteration
returning to the loop head before exiting a
loop path-in-loop the complete set
path-in-iterations for an entire loop visit
14Importance of path-to-loop
b1
NT
T
b2
b3
T
NT
NT
b6
b4
b5
NT
T
NT
b8
b10
b9
b7
NT
T
T
Enter loop
b1, b2, b6, b9 NT, NT, T, NT
b11
Do not enter loop
b1, b3, b5, b9 T, NT, NT, T
NT
b12
15Importance of path-to-loop
For loop caching to be successful, we must
be able to predict b9 very accurately.
b1
NT
T
b2
b3
T
NT
NT
b6
b4
b5
NT
T
NT
b8
b10
b9
b7
NT
T
T
Enter loop
b1, b2, b6, b9 NT, NT, T, NT
b11
Do not enter loop
b1, b3, b5, b9 T, NT, NT, T
NT
b12
16Static view of loop
i1 Top i2 i3 i4 i5 i6 i7 ba zero,
A i8 i9 A i10 i11 i12 i13 bb zero,
B i14 i15 B i16 i17 i18 bc zero, Top i19
Dynamic view of a loop (all possible paths during
a single loop iteration)
17Static view of loop
i1 Top i2 i3 i4 i5 i6 i7 ba zero,
A i8 i9 A i10 i11 i12 i13 bb zero,
B i14 i15 B i16 i17 i18 bc zero, Top i19
Dynamic view of a loop (all possible paths during
a single loop iteration)
For loop caching to be successful, we must
be able to predict the path followed on each
iteration.
18Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
19Loop characterization
- Important to characterize loop behavior in order
to guide design tradeoff choices associated with
the implementation of the loop cache - Loops possess a range of characteristics that
affect their predictability - number of loops
- number of loop visits
- number of iterations per loop visit
- dynamic loop body size
- number of conditional branches found in an
iteration - many more in the thesis
20Application of Characterization Study
- The number of loops found in our applications
ranged from 21 (g721) to 1266 (gcc), with most
applications containing less than 100 loops - Guides the size choice for the number of entries
in the first level loop table - In 9 of the 12 benchmarks studied, more than 40
of the loops were visited only 2 - 64 times - Guides the design of the loop cache replacement
algorithm - For more than 80 of all loops visits, the number
of iterations executed per visit was less than 17
- Guides the design of the hardware unrolling logic
-
21Application of Characterization Study
- The weighted average for the number of
instructions executed per iteration ranged from
15.87 99.83 - Guides the selection of the length of the loop
cache line - In 10 of the 12 benchmarks studied, the largest
loop body size was less than 8192 instructions - Guides the select of the size of the loop cache
- On average, 85 of loop iterations contained 3 or
fewer conditional branches in their loop bodies - Guides the selection of the path-in-iteration
pattern history register width - Maximum level of nesting 2 7 loops
- Guides the selection of the loop cache stack
depth -
22Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
23Dynamic loop profiling
- To be able to accurately predict loop execution,
we need to - dynamically identify loops
- predict loop entry
- select loops to cache
- select loops to unroll
- We utilize a loop cache stack mechanism that
builds loop history before entering a loop into
the loop cache
24Building Loop Cache Histories
25Dynamic loop caching and unrolling
- Loop unrolling hardware
- captures loop visits in loop prediction table
- interrogates the loop predictor to obtain
information for future loops - utilizes loop predictor information to
dynamically replicate loop bodies in the loop
cache
26Loop predictor
I-cache
Loop unrolling control
Loop identification Predicted iterations Paths-in-
iteration
Dispatch/decode
Loop cache
Queue of speculated instructions
Unrolled loop
to execution
27Dynamic loop caching and unrolling
- Loop unrolling hardware
- uses path-to-loop information to predict a future
loop visit - extracts the number of predicted iterations and
use it as the initial unroll factor (unless the
loop cache size is exceeded) - as long as the number of predicted iterations is
larger than 1, unroll the loop in the loop cache - use the paths-in-iteration information to create
the correct trace of instructions on every loop
iteration
28Loop prediction table
path-to-loop
hash function
bn-1
bn-2
bn-3
b0
...
Y
N
tag match last branch?
N
Y
preditns gt 1 ?
The information is used used to interrogate the
loop cache hardware.
There is no information for this loop, proceed
with normal fetching.
29Loop Cache Lookup Table
Loop Cache
tag
index
loop start address
Y
N
Captured Loop
Build dynamic trace fetch mode basic
fetch mode LOOP CACHE
30Loop cache
i1 i2 b0 i3 i4 i5 b1 b2 i8
b
i1 i2 b0 i4 i5 b1 b2
i8 b
I-cache
60 i1 64 i2 68 b0 6c i3 70 i4 74 i5 78
b1 7c i6 80 b2 84 i7 88 b3 8c i8 90 b
60 94 i9
(60, 90, 4, 011, 011, 111,1001)
i1 i2 b0 i3 i4 i5 b1
b2 i8 b
i1 i2 b0 i4 i5 b1 i6
b2 i7 b3 i9
Loop cache control
If loop is not in loop cache, then request
instructions from the I-cache and build
dynamic traces according to information in loop
table.
31Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
32Experimental methodology
- Modified the Simplescalar 3.0b Alpha EV6 pipeline
to model the following features - loop head/tail detection
- loop visit prediction
- loop cache fill/replacement management
- loop stack and first-level table operations
- trace cache model
- hybrid fetch engine operation
33(No Transcript)
34Decode width 16
Commit width 16
Instruction fetch-Q 16
8 Int. Func. Units 1 cycle latency
2 Int. multipliers 7 cycle latency
4 FP adders 4 cycle latency
2 FP multipliers 4 cycle latency
2 FP divide units 12 cycle latency
2 FP SQRT units 23 cycle latency
Branch prediction Bi-modal 4096 entry, 2-level adaptive, 8-entry RAS
L1 D-cache 16KB 4wsa
L1 I-cache 16KB 4wsa
L1 latency 2 cycles
L2 unified cache 256KB 4wsa
L2 latency 10 cycles
Memory latency 250 cycles
TLB 128 entry 4wsa, 4KB page, 30 cycle
Baseline Architecture Parameters
35Loop table parameters 512 entries 4-way 1 cycle hit latency 3 cycle hit penalty 16 branches path length up to 16 iterations captured
Loop stack parameters 8 entries 1 cycle access
Loop cache parameters 8KB 1 cycle hit latency
Loop Cache Architecture Parameters
36Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
37(No Transcript)
38(No Transcript)
39Outline
- Introduction
- Related work
- Loop terminology
- Loop-based workload characterization
- Loop caching and unrolling hardware
- Experimental methodology
- Results
- Hybrid fetch engine
- Summary and future work
40Hybrid fetch engine
- Trace caches have been shown to greatly improve
fetch efficiency Rosner 2001 - Trace caches have not been designed to handle
loops effectively - Replication of loop bodies in the trace cache
space - Handling of changing conditional control flow in
loop bodies - We explore how to combine a loop cache with a
trace cache to filter out problematic loops
41Hybrid fetch engine strategy
- Capture all non-loop instructions with a trace
cache - Capture easy-to-predict loops with a trace cache
- Capture complex loops with a loop cache
- Provide new fetch logic to steer instruction
fetching to the appropriate source
42Hybrid fetch engine strategy
- Trace cache misses
- branch flags mismatch multiple branch predictor
- branches in trace are mispredicted
- trace is not found in the trace cache
- Trigger loop cache fetching when any of these
happen
43Hybrid fetching scheme
Trace Cache
EN
Tri-state bus arbitrer
1
aIFQ width
trace
instructions
Loop Cache
Fetch Mode
loop cache
EN
1
ßIFQ width
basic
IFQ width
Demux
1
L1 Cache
EN
fIFQ width
Fetch queue
44(No Transcript)
45(No Transcript)
46(No Transcript)
47TC vs. TC LC
48Publications on Loop Prediction
- M. R de Alba, D. R. Kaeli, and J. Gonzalez
Improving the Effectiveness of Trace Caching
Using a Loop Cache, NUCAR technical report. - M. R de Alba and D.R. Kaeli, Characterization
and Evaluation of Hardware Loop Unrolling, 1st
Boston Area Architecture Conference, Jan 2003. - M. R. de Alba and D. R. Kaeli Path-based
Hardware Loop Prediction, Proc. of the 4th
International Conference on Control, Virtual
Instrumentation and Digital Systems, August 2002. - A. Uht, D. Morano, A. Khalafi, M. de Alba, D. R.
Kaeli, Realizing High IPC Using Time-Tagged
Resource-Flow Computing, Proc. of Europar,
August 2002. - M. R. de Alba and D. R. Kaeli, Runtime
Predictability of Loops, Proc. of the 4th
Annual IEEE International Workshop on Workload
Characterization, December 2001. - M. R. de Alba, D. R. Kaeli, and E. S. Kim
Dynamic Analysis of Loops, Proc. of the 3rd
International Conference on Control, Virtual
Instrumentation and Digital Systems, August 2001.
49Conclusions
- Branch correlation helps to detect loops in
advance - Loops have patterns of behavior (iterations,
dynamic body size, in-loop paths) - From studied benchmarks, on average more than 50
of loops contain branches - In-loop branches can be predicted and used to
guide unrolling
50Conclusions
- Dynamic instruction traces are built using loop
profiling and prediction - Multiple loops can be simultaneously unrolled
- Combining a trace cache and a loop cache more
useful and less redundant instruction streams are
built - Performance benefits are gained with a hybrid
fetch engine mechanism