Lecture 11: Advanced Static ILP - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 11: Advanced Static ILP

Description:

... S.D F16, 8(R1) DADDUI R1, R1, # -32 S.D BNE R1,R2, Loop S.D Static Vs. Dynamic New iterations completed Cycles Dynamic ILP 1 New iterations completed ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 16
Provided by: RajeevBala4
Category:
Tags: ilp | advanced | lecture | static

less

Transcript and Presenter's Notes

Title: Lecture 11: Advanced Static ILP


1
Lecture 11 Advanced Static ILP
  • Topics loop unrolling, software pipelining
    (Section 4.4)

2
Loop Dependences
  • If a loop only has dependences within an
    iteration, the loop
  • is considered parallel ? multiple iterations
    can be executed
  • together so long as order within an iteration
    is preserved
  • If a loop has dependeces across iterations, it
    is not parallel
  • and these dependeces are referred to as
    loop-carried
  • Not all loop-carried dependences imply lack of
    parallelism
  • Parallel loops are especially desireable in a
    multiprocessor
  • system

3
Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
4
Finding Dependences the GCD Test
  • Do Aai b and Aci d refer to the same
    element?
  • Restrict ourselves to affine array indices
    (expressible as
  • ai b, where i is the loop index, a and b are
    constants)
  • example of non-affine index xyi
  • For a dependence to exist, must have two indices
    j and k
  • that are within the loop bounds, such that
  • aj b ck d
  • aj ck d b
  • G GCD(a,c)
  • (aj/G - ck/G) (d-b)/G
  • If (d-b)/G is not an integer, the initial
    equality can not be true

5
Static vs. Dynamic ILP
Loop L.D F0, 0(R1) F0
array element ADD.D F4, F0, F2
add scalar S.D F4,
0(R1) store result
DADDUI R1, R1, -8 decrement address
pointer BNE R1, R2, Loop
branch if R1 ! R2
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) ADD.D F4, F0, F2
ADD.D F8, F6, F2 ADD.D
F12, F10, F2 ADD.D F16, F14,
F2 S.D F4, 0(R1)
S.D F8, -8(R1) DADDUI
R1, R1, -32 S.D F12,
16(R1) BNE R1,R2, Loop
S.D F16, 8(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) ..
Statically unrolled loop
Large window dynamic ooo proc
6
Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
Renamed
7
Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
1 3 6 1 3 2 4 7 2 4 3 5 8 3 5 4 6 9 4 6
Cycle of Issue
Renamed
8
Loop Pipeline
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE
9
Statically Unrolled Loop
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) L.D F18, -32(R1)
ADD.D F4, F0, F2 L.D
F22, -40(R1) ADD.D F8, F6, F2
L.D F26, -48(R1) ADD.D F12, F10, F2
L.D F30, -56(R1) ADD.D
F16, F14, F2 L.D F34,
-64(R1) ADD.D F20, F18, F2 S.D
F4, 0(R1) L.D F38, -72(R1)
ADD.D F24, F22, F2 S.D F8, -8(R1)

S.D
F12, 16(R1)

S.D F16, 8(R1) DADDUI
R1, R1, -32 S.D
BNE R1,R2, Loop S.D
10
Static Vs. Dynamic
New iterations completed
1
Dynamic ILP
Cycles
New iterations completed
1
Static ILP
Cycles
  • What if I doubled the number of resources in
    each processor?
  • What if I unrolled the loop and executed it on a
    dynamic ILP processor?

11
Static vs. Dynamic
  • Dynamic because of the loop index, at most one
    iteration
  • can start every cycle even fewer if there are
    resource
  • constraints in other words, we have a
    pipeline that has
  • a throughput of one iteration per cycle!
  • Static by eliminating loop index, each
    iteration is
  • independent ? as many loops can start in a
    cycle as there
  • are resources however, after a while, we
    dont start any
  • more iterations thus, loop unrolling provides
    a brief steady
  • state, where an iteration starts/finishes every
    cycle and the
  • rest is start-up/wind-down for each unrolled
    loop

12
Software Pipeline?!
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE
13
Software Pipelining
Loop L.D F0, 0(R1)
ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
Loop S.D F4, 16(R1)
ADD.D F4, F0, F2 L.D
F0, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
  • Advantages achieves nearly the same effect as
    loop unrolling, but
  • without the code expansion an unrolled loop
    may have inefficiencies
  • at the start and end of each iteration, while a
    sw-pipelined loop is
  • almost always in steady state a sw-pipelined
    loop can also be unrolled
  • to reduce loop overhead
  • Disadvantages does not reduce loop overhead,
    may require more
  • registers

14
Midterm Exam
  • Show up early! Time will be a constraint
  • Attempt all questions grading will be lenient
  • Open books and notes

15
Title
  • Bullet
Write a Comment
User Comments (0)
About PowerShow.com