Exposing ILP in the Presence of Loops - PowerPoint PPT Presentation

About This Presentation

Title:

Exposing ILP in the Presence of Loops

Description:

Exposing ILP in the Presence of Loops Marcos Rub n de Alba Rosano David Kaeli Department of Electrical and Computer Engineering Northeastern University – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 51

Provided by: gaby114

Learn more at: https://studies.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exposing ILP in the Presence of Loops

1
Exposing ILP in the Presence of Loops

Marcos Rubén de Alba Rosano
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University

2
Exposing ILP in the Presence of Loops

To enable wide-issue microarchitectures to obtain
high throughput rates, a large window of
instructions must be available
Programs spend 90 of their execution time in 10
of their code (in loops) HP
Current compilers can unroll less than 50 of all
loops in integer codes deAlba 2000
Present microarchitectures are not designed to
execute loops efficiently Vajapeyam 1999 Rosner
2001
We may need to consider developing customized
loop predication hardware de Alba 2001

3
Exposing ILP in the Presence of Loops

We need to understand whether entire loop
executions can be predicted
Could expose large amounts of instruction level
parallelism
If patterns exist in loop execution, we need to
build a dynamic profiling system that can capture
these patterns
If we are able to detect patterns through
profiling, we can guide aggressive instruction
fetch/issue, effectively unrolling multiple
iterations of the loop

4
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

5
Introduction

Loops possess high temporal locality and present
a good opportunity for caching
Reduce pressure on other instruction caching
structures
Provide for aggressively runtime unrolling
Applications that possess large loops tend to be
good targets to extract ILP
Common in scientific codes (e.g., SPECfp2000)
Uncommon in integer and multimedia codes (e.g.,
SPECint2000 and Mediabench)

6
Introduction

We propose a path-based multi-level
hardware-based loop caching scheme that can
identify the presence of a loop in the execution
stream
profile loop execution, building loop execution
histories
cache entire unrolled loop execution traces in a
dedicated loop cache
utilize loop history to predict future loop
visits at runtime
Combine a loop prediction mechanism with other
aggressive instruction fetch mechanisms to
improve instruction delivery

7
Loop cache elements

Loop cache stack profiles loops that are
presently live, uses a stack structure to
accommodate nesting
Loop table a first-level table used to identify
loops and index into the loop cache
Loop cache a second-level table used to hold
unrolled loop bodies

8
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

9
Related work

Software-based
loop unrolling Ellis, 1986
software pipelining Lam , 1988
loop quantization Nicolau, 1988
static loop characteristics Davidson, 1995
Limitations
A compiler cannot unroll a loop if
the loop body is too large
the loop induction variable is not an integer
the loop induction variable is not inc/dec by 1
the inc/dec value can not be deduced at compile
time
the loop exit condition is not based on the value
of a constant
if there is conditional control flow in the loop
body
More than 50 of the loops present in our
workloads could not be unrolled by the compiler

10
Related work

Hardware-based
loop buffer Thornton, 1964 Anderson, 1967
Hintz, 1972
multiple-block ahead loop prediction Seznec,
1996
trace cache Rotenberg, 1996
loop detection Kobayashi, 1984 Tubella, 1998
Gonzalez, 1998
dynamic vectorization Vajapeyam, 1999
loop termination Sherwood, 2000
loop caches Texas Inst. Uh 1999 Motorola
Vahid 2002
hybrid approaches Holler, 1997 Hinton, 2001
Limitations
These techniques can effectively cache
well-structured loops
Conditional control flow present in loop bodies
can limit the number of loops that can be cached
The conditional control flow found in loops
generates complex, yet predictable patterns

11
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Hybrid fetch engine
Experimental methodology
Results
Summary and future work

12
Loop terminology
Static terms
i0 b1 i1 b2 i3 b3 i4 i5 i6 i7 b4 i8
loop head
loop body
loop tail
13
Loop terminology
Dynamic terms
i0 b1 i1 b2 i3 b3 i4 i5 i6 i7 b4 i8
path to loop
path-in-iteration A (b3 NT)
path-in-iteration B (b3 T)
loop visit entering a loop body iteration
returning to the loop head before exiting a
loop path-in-loop the complete set
path-in-iterations for an entire loop visit
14
Importance of path-to-loop
b1
NT
T
b2
b3
T
NT
NT
b6
b4
b5
NT
T
NT
b8
b10
b9
b7
NT
T
T
Enter loop
b1, b2, b6, b9 NT, NT, T, NT
b11
Do not enter loop
b1, b3, b5, b9 T, NT, NT, T
NT
b12
15
Importance of path-to-loop
For loop caching to be successful, we must
be able to predict b9 very accurately.
b1
NT
T
b2
b3
T
NT
NT
b6
b4
b5
NT
T
NT
b8
b10
b9
b7
NT
T
T
Enter loop
b1, b2, b6, b9 NT, NT, T, NT
b11
Do not enter loop
b1, b3, b5, b9 T, NT, NT, T
NT
b12
16
Static view of loop
i1 Top i2 i3 i4 i5 i6 i7 ba zero,
A i8 i9 A i10 i11 i12 i13 bb zero,
B i14 i15 B i16 i17 i18 bc zero, Top i19
Dynamic view of a loop (all possible paths during
a single loop iteration)
17
Static view of loop
i1 Top i2 i3 i4 i5 i6 i7 ba zero,
A i8 i9 A i10 i11 i12 i13 bb zero,
B i14 i15 B i16 i17 i18 bc zero, Top i19
Dynamic view of a loop (all possible paths during
a single loop iteration)
For loop caching to be successful, we must
be able to predict the path followed on each
iteration.
18
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

19
Loop characterization

Important to characterize loop behavior in order
to guide design tradeoff choices associated with
the implementation of the loop cache
Loops possess a range of characteristics that
affect their predictability
number of loops
number of loop visits
number of iterations per loop visit
dynamic loop body size
number of conditional branches found in an
iteration
many more in the thesis

20
Application of Characterization Study

The number of loops found in our applications
ranged from 21 (g721) to 1266 (gcc), with most
applications containing less than 100 loops
Guides the size choice for the number of entries
in the first level loop table
In 9 of the 12 benchmarks studied, more than 40
of the loops were visited only 2 - 64 times
Guides the design of the loop cache replacement
algorithm
For more than 80 of all loops visits, the number
of iterations executed per visit was less than 17
Guides the design of the hardware unrolling logic

21
Application of Characterization Study

The weighted average for the number of
instructions executed per iteration ranged from
15.87 99.83
Guides the selection of the length of the loop
cache line
In 10 of the 12 benchmarks studied, the largest
loop body size was less than 8192 instructions
Guides the select of the size of the loop cache
On average, 85 of loop iterations contained 3 or
fewer conditional branches in their loop bodies
Guides the selection of the path-in-iteration
pattern history register width
Maximum level of nesting 2 7 loops
Guides the selection of the loop cache stack
depth

22
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

23
Dynamic loop profiling

To be able to accurately predict loop execution,
we need to
dynamically identify loops
predict loop entry
select loops to cache
select loops to unroll
We utilize a loop cache stack mechanism that
builds loop history before entering a loop into
the loop cache

24
Building Loop Cache Histories
25
Dynamic loop caching and unrolling

Loop unrolling hardware
captures loop visits in loop prediction table
interrogates the loop predictor to obtain
information for future loops
utilizes loop predictor information to
dynamically replicate loop bodies in the loop
cache

26
Loop predictor
I-cache
Loop unrolling control
Loop identification Predicted iterations Paths-in-
iteration
Dispatch/decode
Loop cache
Queue of speculated instructions
Unrolled loop
to execution
27
Dynamic loop caching and unrolling

Loop unrolling hardware
uses path-to-loop information to predict a future
loop visit
extracts the number of predicted iterations and
use it as the initial unroll factor (unless the
loop cache size is exceeded)
as long as the number of predicted iterations is
larger than 1, unroll the loop in the loop cache
use the paths-in-iteration information to create
the correct trace of instructions on every loop
iteration

28
Loop prediction table
path-to-loop
hash function
bn-1
bn-2
bn-3
b0
...
Y
N
tag match last branch?
N
Y
preditns gt 1 ?
The information is used used to interrogate the
loop cache hardware.
There is no information for this loop, proceed
with normal fetching.
29
Loop Cache Lookup Table
Loop Cache
tag
index
loop start address
Y
N
Captured Loop
Build dynamic trace fetch mode basic
fetch mode LOOP CACHE
30
Loop cache
i1 i2 b0 i3 i4 i5 b1 b2 i8
b
i1 i2 b0 i4 i5 b1 b2
i8 b
I-cache
60 i1 64 i2 68 b0 6c i3 70 i4 74 i5 78
b1 7c i6 80 b2 84 i7 88 b3 8c i8 90 b
60 94 i9
(60, 90, 4, 011, 011, 111,1001)
i1 i2 b0 i3 i4 i5 b1
b2 i8 b
i1 i2 b0 i4 i5 b1 i6
b2 i7 b3 i9
Loop cache control
If loop is not in loop cache, then request
instructions from the I-cache and build
dynamic traces according to information in loop
table.
31
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

32
Experimental methodology

Modified the Simplescalar 3.0b Alpha EV6 pipeline
to model the following features
loop head/tail detection
loop visit prediction
loop cache fill/replacement management
loop stack and first-level table operations
trace cache model
hybrid fetch engine operation

33
(No Transcript)
34
Decode width 16
Commit width 16
Instruction fetch-Q 16
8 Int. Func. Units 1 cycle latency
2 Int. multipliers 7 cycle latency
4 FP adders 4 cycle latency
2 FP multipliers 4 cycle latency
2 FP divide units 12 cycle latency
2 FP SQRT units 23 cycle latency
Branch prediction Bi-modal 4096 entry, 2-level adaptive, 8-entry RAS
L1 D-cache 16KB 4wsa
L1 I-cache 16KB 4wsa
L1 latency 2 cycles
L2 unified cache 256KB 4wsa
L2 latency 10 cycles
Memory latency 250 cycles
TLB 128 entry 4wsa, 4KB page, 30 cycle
Baseline Architecture Parameters
35
Loop table parameters 512 entries 4-way 1 cycle hit latency 3 cycle hit penalty 16 branches path length up to 16 iterations captured
Loop stack parameters 8 entries 1 cycle access
Loop cache parameters 8KB 1 cycle hit latency
Loop Cache Architecture Parameters
36
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

37
(No Transcript)
38
(No Transcript)
39
Outline

Introduction
Related work
Loop terminology
Loop-based workload characterization
Loop caching and unrolling hardware
Experimental methodology
Results
Hybrid fetch engine
Summary and future work

40
Hybrid fetch engine

Trace caches have been shown to greatly improve
fetch efficiency Rosner 2001
Trace caches have not been designed to handle
loops effectively
Replication of loop bodies in the trace cache
space
Handling of changing conditional control flow in
loop bodies
We explore how to combine a loop cache with a
trace cache to filter out problematic loops

41
Hybrid fetch engine strategy

Capture all non-loop instructions with a trace
cache
Capture easy-to-predict loops with a trace cache
Capture complex loops with a loop cache
Provide new fetch logic to steer instruction
fetching to the appropriate source

42
Hybrid fetch engine strategy

Trace cache misses
branch flags mismatch multiple branch predictor
branches in trace are mispredicted
trace is not found in the trace cache
Trigger loop cache fetching when any of these
happen

43
Hybrid fetching scheme
Trace Cache
EN
Tri-state bus arbitrer
1
aIFQ width
trace
instructions
Loop Cache
Fetch Mode
loop cache
EN
1
ßIFQ width
basic
IFQ width
Demux
1
L1 Cache
EN
fIFQ width
Fetch queue
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
TC vs. TC LC
48
Publications on Loop Prediction

M. R de Alba, D. R. Kaeli, and J. Gonzalez
Improving the Effectiveness of Trace Caching
Using a Loop Cache, NUCAR technical report.
M. R de Alba and D.R. Kaeli, Characterization
and Evaluation of Hardware Loop Unrolling, 1st
Boston Area Architecture Conference, Jan 2003.
M. R. de Alba and D. R. Kaeli Path-based
Hardware Loop Prediction, Proc. of the 4th
International Conference on Control, Virtual
Instrumentation and Digital Systems, August 2002.
A. Uht, D. Morano, A. Khalafi, M. de Alba, D. R.
Kaeli, Realizing High IPC Using Time-Tagged
Resource-Flow Computing, Proc. of Europar,
August 2002.
M. R. de Alba and D. R. Kaeli, Runtime
Predictability of Loops, Proc. of the 4th
Annual IEEE International Workshop on Workload
Characterization, December 2001.
M. R. de Alba, D. R. Kaeli, and E. S. Kim
Dynamic Analysis of Loops, Proc. of the 3rd
International Conference on Control, Virtual
Instrumentation and Digital Systems, August 2001.

49
Conclusions

Branch correlation helps to detect loops in
advance
Loops have patterns of behavior (iterations,
dynamic body size, in-loop paths)
From studied benchmarks, on average more than 50
of loops contain branches
In-loop branches can be predicted and used to
guide unrolling

50
Conclusions

Dynamic instruction traces are built using loop
profiling and prediction
Multiple loops can be simultaneously unrolled
Combining a trace cache and a loop cache more
useful and less redundant instruction streams are
built
Performance benefits are gained with a hybrid
fetch engine mechanism

Write a Comment

User Comments (0)