Exposing ILP in the Presence of Loops - PowerPoint PPT Presentation

About This Presentation
Title:

Exposing ILP in the Presence of Loops

Description:

Exposing ILP in the Presence of Loops Marcos Rub n de Alba Rosano David Kaeli Department of Electrical and Computer Engineering Northeastern University – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 51
Provided by: gaby114
Category:

less

Transcript and Presenter's Notes

Title: Exposing ILP in the Presence of Loops


1
Exposing ILP in the Presence of Loops
  • Marcos Rubén de Alba Rosano
  • David Kaeli
  • Department of Electrical and Computer Engineering
  • Northeastern University

2
Exposing ILP in the Presence of Loops
  • To enable wide-issue microarchitectures to obtain
    high throughput rates, a large window of
    instructions must be available
  • Programs spend 90 of their execution time in 10
    of their code (in loops) HP
  • Current compilers can unroll less than 50 of all
    loops in integer codes deAlba 2000
  • Present microarchitectures are not designed to
    execute loops efficiently Vajapeyam 1999 Rosner
    2001
  • We may need to consider developing customized
    loop predication hardware de Alba 2001

3
Exposing ILP in the Presence of Loops
  • We need to understand whether entire loop
    executions can be predicted
  • Could expose large amounts of instruction level
    parallelism
  • If patterns exist in loop execution, we need to
    build a dynamic profiling system that can capture
    these patterns
  • If we are able to detect patterns through
    profiling, we can guide aggressive instruction
    fetch/issue, effectively unrolling multiple
    iterations of the loop

4
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

5
Introduction
  • Loops possess high temporal locality and present
    a good opportunity for caching
  • Reduce pressure on other instruction caching
    structures
  • Provide for aggressively runtime unrolling
  • Applications that possess large loops tend to be
    good targets to extract ILP
  • Common in scientific codes (e.g., SPECfp2000)
  • Uncommon in integer and multimedia codes (e.g.,
    SPECint2000 and Mediabench)

6
Introduction
  • We propose a path-based multi-level
    hardware-based loop caching scheme that can
  • identify the presence of a loop in the execution
    stream
  • profile loop execution, building loop execution
    histories
  • cache entire unrolled loop execution traces in a
    dedicated loop cache
  • utilize loop history to predict future loop
    visits at runtime
  • Combine a loop prediction mechanism with other
    aggressive instruction fetch mechanisms to
    improve instruction delivery

7
Loop cache elements
  • Loop cache stack profiles loops that are
    presently live, uses a stack structure to
    accommodate nesting
  • Loop table a first-level table used to identify
    loops and index into the loop cache
  • Loop cache a second-level table used to hold
    unrolled loop bodies

8
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

9
Related work
  • Software-based
  • loop unrolling Ellis, 1986
  • software pipelining Lam , 1988
  • loop quantization Nicolau, 1988
  • static loop characteristics Davidson, 1995
  • Limitations
  • A compiler cannot unroll a loop if
  • the loop body is too large
  • the loop induction variable is not an integer
  • the loop induction variable is not inc/dec by 1
  • the inc/dec value can not be deduced at compile
    time
  • the loop exit condition is not based on the value
    of a constant
  • if there is conditional control flow in the loop
    body
  • More than 50 of the loops present in our
    workloads could not be unrolled by the compiler

10
Related work
  • Hardware-based
  • loop buffer Thornton, 1964 Anderson, 1967
    Hintz, 1972
  • multiple-block ahead loop prediction Seznec,
    1996
  • trace cache Rotenberg, 1996
  • loop detection Kobayashi, 1984 Tubella, 1998
    Gonzalez, 1998
  • dynamic vectorization Vajapeyam, 1999
  • loop termination Sherwood, 2000
  • loop caches Texas Inst. Uh 1999 Motorola
    Vahid 2002
  • hybrid approaches Holler, 1997 Hinton, 2001
  • Limitations
  • These techniques can effectively cache
    well-structured loops
  • Conditional control flow present in loop bodies
    can limit the number of loops that can be cached
  • The conditional control flow found in loops
    generates complex, yet predictable patterns

11
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Hybrid fetch engine
  • Experimental methodology
  • Results
  • Summary and future work

12
Loop terminology
Static terms
i0 b1 i1 b2 i3 b3 i4 i5 i6 i7 b4 i8
loop head
loop body
loop tail
13
Loop terminology
Dynamic terms
i0 b1 i1 b2 i3 b3 i4 i5 i6 i7 b4 i8
path to loop
path-in-iteration A (b3 NT)
path-in-iteration B (b3 T)
loop visit entering a loop body iteration
returning to the loop head before exiting a
loop path-in-loop the complete set
path-in-iterations for an entire loop visit
14
Importance of path-to-loop
b1
NT
T
b2
b3
T
NT
NT
b6
b4
b5
NT
T
NT
b8
b10
b9
b7
NT
T
T
Enter loop
b1, b2, b6, b9 NT, NT, T, NT
b11
Do not enter loop
b1, b3, b5, b9 T, NT, NT, T
NT
b12
15
Importance of path-to-loop
For loop caching to be successful, we must
be able to predict b9 very accurately.
b1
NT
T
b2
b3
T
NT
NT
b6
b4
b5
NT
T
NT
b8
b10
b9
b7
NT
T
T
Enter loop
b1, b2, b6, b9 NT, NT, T, NT
b11
Do not enter loop
b1, b3, b5, b9 T, NT, NT, T
NT
b12
16
Static view of loop
i1 Top i2 i3 i4 i5 i6 i7 ba zero,
A i8 i9 A i10 i11 i12 i13 bb zero,
B i14 i15 B i16 i17 i18 bc zero, Top i19
Dynamic view of a loop (all possible paths during
a single loop iteration)
17
Static view of loop
i1 Top i2 i3 i4 i5 i6 i7 ba zero,
A i8 i9 A i10 i11 i12 i13 bb zero,
B i14 i15 B i16 i17 i18 bc zero, Top i19
Dynamic view of a loop (all possible paths during
a single loop iteration)
For loop caching to be successful, we must
be able to predict the path followed on each
iteration.
18
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

19
Loop characterization
  • Important to characterize loop behavior in order
    to guide design tradeoff choices associated with
    the implementation of the loop cache
  • Loops possess a range of characteristics that
    affect their predictability
  • number of loops
  • number of loop visits
  • number of iterations per loop visit
  • dynamic loop body size
  • number of conditional branches found in an
    iteration
  • many more in the thesis

20
Application of Characterization Study
  • The number of loops found in our applications
    ranged from 21 (g721) to 1266 (gcc), with most
    applications containing less than 100 loops
  • Guides the size choice for the number of entries
    in the first level loop table
  • In 9 of the 12 benchmarks studied, more than 40
    of the loops were visited only 2 - 64 times
  • Guides the design of the loop cache replacement
    algorithm
  • For more than 80 of all loops visits, the number
    of iterations executed per visit was less than 17
  • Guides the design of the hardware unrolling logic

21
Application of Characterization Study
  • The weighted average for the number of
    instructions executed per iteration ranged from
    15.87 99.83
  • Guides the selection of the length of the loop
    cache line
  • In 10 of the 12 benchmarks studied, the largest
    loop body size was less than 8192 instructions
  • Guides the select of the size of the loop cache
  • On average, 85 of loop iterations contained 3 or
    fewer conditional branches in their loop bodies
  • Guides the selection of the path-in-iteration
    pattern history register width
  • Maximum level of nesting 2 7 loops
  • Guides the selection of the loop cache stack
    depth

22
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

23
Dynamic loop profiling
  • To be able to accurately predict loop execution,
    we need to
  • dynamically identify loops
  • predict loop entry
  • select loops to cache
  • select loops to unroll
  • We utilize a loop cache stack mechanism that
    builds loop history before entering a loop into
    the loop cache

24
Building Loop Cache Histories
25
Dynamic loop caching and unrolling
  • Loop unrolling hardware
  • captures loop visits in loop prediction table
  • interrogates the loop predictor to obtain
    information for future loops
  • utilizes loop predictor information to
    dynamically replicate loop bodies in the loop
    cache

26
Loop predictor
I-cache
Loop unrolling control
Loop identification Predicted iterations Paths-in-
iteration
Dispatch/decode
Loop cache
Queue of speculated instructions
Unrolled loop
to execution
27
Dynamic loop caching and unrolling
  • Loop unrolling hardware
  • uses path-to-loop information to predict a future
    loop visit
  • extracts the number of predicted iterations and
    use it as the initial unroll factor (unless the
    loop cache size is exceeded)
  • as long as the number of predicted iterations is
    larger than 1, unroll the loop in the loop cache
  • use the paths-in-iteration information to create
    the correct trace of instructions on every loop
    iteration

28
Loop prediction table
path-to-loop
hash function
bn-1
bn-2
bn-3
b0
...
Y
N
tag match last branch?
N
Y
preditns gt 1 ?
The information is used used to interrogate the
loop cache hardware.
There is no information for this loop, proceed
with normal fetching.
29
Loop Cache Lookup Table
Loop Cache
tag
index
loop start address
Y
N
Captured Loop
Build dynamic trace fetch mode basic
fetch mode LOOP CACHE
30
Loop cache
i1 i2 b0 i3 i4 i5 b1 b2 i8
b
i1 i2 b0 i4 i5 b1 b2
i8 b
I-cache
60 i1 64 i2 68 b0 6c i3 70 i4 74 i5 78
b1 7c i6 80 b2 84 i7 88 b3 8c i8 90 b
60 94 i9
(60, 90, 4, 011, 011, 111,1001)
i1 i2 b0 i3 i4 i5 b1
b2 i8 b
i1 i2 b0 i4 i5 b1 i6
b2 i7 b3 i9
Loop cache control
If loop is not in loop cache, then request
instructions from the I-cache and build
dynamic traces according to information in loop
table.
31
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

32
Experimental methodology
  • Modified the Simplescalar 3.0b Alpha EV6 pipeline
    to model the following features
  • loop head/tail detection
  • loop visit prediction
  • loop cache fill/replacement management
  • loop stack and first-level table operations
  • trace cache model
  • hybrid fetch engine operation

33
(No Transcript)
34
Decode width 16
Commit width 16
Instruction fetch-Q 16
8 Int. Func. Units 1 cycle latency
2 Int. multipliers 7 cycle latency
4 FP adders 4 cycle latency
2 FP multipliers 4 cycle latency
2 FP divide units 12 cycle latency
2 FP SQRT units 23 cycle latency
Branch prediction Bi-modal 4096 entry, 2-level adaptive, 8-entry RAS
L1 D-cache 16KB 4wsa
L1 I-cache 16KB 4wsa
L1 latency 2 cycles
L2 unified cache 256KB 4wsa
L2 latency 10 cycles
Memory latency 250 cycles
TLB 128 entry 4wsa, 4KB page, 30 cycle
Baseline Architecture Parameters
35
Loop table parameters 512 entries 4-way 1 cycle hit latency 3 cycle hit penalty 16 branches path length up to 16 iterations captured
Loop stack parameters 8 entries 1 cycle access
Loop cache parameters 8KB 1 cycle hit latency
Loop Cache Architecture Parameters
36
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

37
(No Transcript)
38
(No Transcript)
39
Outline
  • Introduction
  • Related work
  • Loop terminology
  • Loop-based workload characterization
  • Loop caching and unrolling hardware
  • Experimental methodology
  • Results
  • Hybrid fetch engine
  • Summary and future work

40
Hybrid fetch engine
  • Trace caches have been shown to greatly improve
    fetch efficiency Rosner 2001
  • Trace caches have not been designed to handle
    loops effectively
  • Replication of loop bodies in the trace cache
    space
  • Handling of changing conditional control flow in
    loop bodies
  • We explore how to combine a loop cache with a
    trace cache to filter out problematic loops

41
Hybrid fetch engine strategy
  • Capture all non-loop instructions with a trace
    cache
  • Capture easy-to-predict loops with a trace cache
  • Capture complex loops with a loop cache
  • Provide new fetch logic to steer instruction
    fetching to the appropriate source

42
Hybrid fetch engine strategy
  • Trace cache misses
  • branch flags mismatch multiple branch predictor
  • branches in trace are mispredicted
  • trace is not found in the trace cache
  • Trigger loop cache fetching when any of these
    happen

43
Hybrid fetching scheme
Trace Cache
EN
Tri-state bus arbitrer
1
aIFQ width
trace
instructions
Loop Cache
Fetch Mode
loop cache
EN
1
ßIFQ width
basic
IFQ width
Demux
1
L1 Cache
EN
fIFQ width
Fetch queue
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
TC vs. TC LC
48
Publications on Loop Prediction
  • M. R de Alba, D. R. Kaeli, and J. Gonzalez
    Improving the Effectiveness of Trace Caching
    Using a Loop Cache, NUCAR technical report.
  • M. R de Alba and D.R. Kaeli, Characterization
    and Evaluation of Hardware Loop Unrolling, 1st
    Boston Area Architecture Conference, Jan 2003.
  • M. R. de Alba and D. R. Kaeli Path-based
    Hardware Loop Prediction, Proc. of the 4th
    International Conference on Control, Virtual
    Instrumentation and Digital Systems, August 2002.
  • A. Uht, D. Morano, A. Khalafi, M. de Alba, D. R.
    Kaeli, Realizing High IPC Using Time-Tagged
    Resource-Flow Computing, Proc. of Europar,
    August 2002.
  • M. R. de Alba and D. R. Kaeli, Runtime
    Predictability of Loops, Proc. of the 4th
    Annual IEEE International Workshop on Workload
    Characterization, December 2001.
  • M. R. de Alba, D. R. Kaeli, and E. S. Kim
    Dynamic Analysis of Loops, Proc. of the 3rd
    International Conference on Control, Virtual
    Instrumentation and Digital Systems, August 2001.

49
Conclusions
  • Branch correlation helps to detect loops in
    advance
  • Loops have patterns of behavior (iterations,
    dynamic body size, in-loop paths)
  • From studied benchmarks, on average more than 50
    of loops contain branches
  • In-loop branches can be predicted and used to
    guide unrolling

50
Conclusions
  • Dynamic instruction traces are built using loop
    profiling and prediction
  • Multiple loops can be simultaneously unrolled
  • Combining a trace cache and a loop cache more
    useful and less redundant instruction streams are
    built
  • Performance benefits are gained with a hybrid
    fetch engine mechanism
Write a Comment
User Comments (0)
About PowerShow.com