Title: Array Dependence Analysis with the Chains of Recurrences Framework for Loop Optimization
1Array Dependence Analysis with the Chains of
Recurrences Framework for Loop Optimization
- Robert van Engelen
- Florida State University
Also thanks to J. Birch, Y. Shou, and K. Gallivan
2Outline
- Motivation
- Restructuring compilers
- Chains of recurrences algebra and associated
algorithms for the GCC and Polaris compilers - Nonlinear array dependence testing for loop
restructuring and vectorization - Experimental results
- Conclusions
3Motivation
- Intel CTO the increased power requirements of
newer chips will lead to CPUs that are hotter
than the surface of the sun by 2010 - Enter multi-core CPUs
- Increase the overall system speed by adding CPU
cores - Speed up multi-threaded applications
- Can effectively lower the power consumption
- Enter (more?) multi-media extensions
- Vector-like instruction sets MMX, SSE, AltiVec
- Speed up multi-media codes, such as JPEG, MPEG
4Code Optimization by Hand or Automatic?
- Rewriting applications by hand to exploit
parallelism is doable, if - Tasks can be identified that run independently,
such as a Web browsers rendering and
communications tasks - Course-grain parallelism tasks must have
sufficient work - Rewriting applications by hand to exploit lots of
fine-grain parallelism is not doable - Thousands of read-after-write (RAW),
write-after-read (WAR), and write-after-write
(WAW), data dependences must be analyzed
5Restructuring Compilers
- A restructuring compiler typically applies
source-code transformations automatically to meet
various performance enhancement criteria - Exploit parallelism in loops by reordering the
loop structure to run loop iterations in parallel - Find small loops to replace with vector
instructions - Optimize data locality by reordering code to
change memory access order and cache - All code changes are safe as long as RAW, WAR,
and WAW data dependences are preserved!
6Example Loop Fission
S1 DO I 1, 10S2 DO J 1, 10S3 A(I,J)
B(I,J) C(I,J)S4 D(I,J) A(I,J-1) 2.0S5
ENDDO S6 ENDDO
- Loop fission splits a single loop into multiple
loops - Allows vectorization and parallelization of the
new loops when original loop was sequential - Loop fission must preserve all dependence
relations of the original loop
S3 ?(,lt) S4
S1 DO I 1, 10S2 DO J 1, 10S3 A(I,J)
B(I,J) C(I,J)Sx ENDDO Sy DO J 1, 10S4
D(I,J) A(I,J-1) 2.0S5 ENDDO S6 ENDDO
S3 ?(,lt) S4
S1 PARALLEL DO I 1, 10S3 A(I,110)B(I,110)
C(I,110)S4 D(I,110)A(I,09) 2.0S6 ENDDO
S3 ?(,lt) S4
7Loop Fission Algorithm
S1 DO I 1, 10S2 A(I) A(I) B(I-1)S3
B(I) C(I-1)X ZS4 C(I) 1/B(I)S5 D(I)
sqrt(C(I))S6 ENDDO
- Compute the acyclic condensation of the
dependence graph to find a legal order of the
loops
S3 ?(lt) S2S4 ?(lt) S3 S3 ?() S4S4 ?() S5
S2
S1 DO I 1, 10S3 B(I) C(I-1)X ZS4
C(I) 1/B(I)Sx ENDDO S2 A(110) A(110)
B(09)S5 D(110) sqrt(C(110))
1
S3 S4
S3
0
1
S2
S5
S4
0
Acyclic condensation
S5
Dependence graph
8Example Loop Interchange
S1 DO I 1, NS2 DO J 1, MS3 A(I,J)
A(I,J-1) B(I,J)S4 ENDDOS5 ENDDO
- Changes the loop nesting order
- Allows vectorization of an outer loop and more
effective parallelization of an inner loop - Can be used to improve spatial locality
- Loop interchange must preserve all dependence
relations of the original loop
S3 ?(,lt) S3
S2 DO J 1, MS1 DO I 1, NS3 A(I,J)
A(I,J-1) B(I,J)S4 ENDDOS5 ENDDO
S3 ?(lt,) S3
S2 DO J 1, MS3 A(1N,J)A(1N,J-1)B(1N,J)S
5 ENDDO
S3 ?(lt,) S3
9Loop Interchange Algorithm
S1 DO I 1, NS2 DO J 1, MS3 DO K 1,
LS4 A(I1,J1,K) A(I,J,K)
A(I,J1,K1)S5 ENDDOS6 ENDDOS7
ENDDO
- Compute the direction matrix and find which
columns (and therefore which loops) can be
permuted without violating dependence relations
in the original loop nest
S4 ?(lt,lt,) S4S4 ?(lt,,gt) S4
lt lt lt gt
lt lt gt lt
lt lt lt gt
Invalid
Direction matrix
lt lt lt gt
lt lt lt gt
Valid
10Complications
- Loop restructuring is complicated by
- The presence of several induction variables
- Nonlinear and symbolic array index expressions
- The use of pointer arithmetic instead of arrays
in C - Non-unit loop strides and unstructured loops
- Control flow
- Need loop normalization and preprocessing
- Apply induction variable substitution
- Convert pointer dereferences to array accesses
- Normalize the loop iteration space
11Induction Variable Substitution
Example loop After IV substitution (IVS) (note the affine indexes) After parallelization
I 0 J 1 while (IltN) I I1 AJ J J2 K 2I AK endwhile for i0 to N-1 S1 A2i1 S2 A2i2 endfor forall (i0,N-1) A2i1 A2i2 endforall
Dep test
IVS
GCD test to solve dependence equation 2id - 2iu
-1 Since 2 does not divide 1 there is no data
dependence.
A
W R W R W R
A2i1
A2i2
12IV Recognitionon SSA Forms
Cytron91, Wolfe92
I1 3M1 0do I2 ?(I1,I3) J1 ?(?,J3)
K1 ?(?,K2) L1 ?(?,L2) M2 ?(M1,M3) J2
3 I3 I21 L2 M21 M3 L22 J3
I3J2 K2 2J3while ()
Spanningtree
I2(i) 3i J1(i) 7iL2(i) 13i
K1(i) 142iM2(i) 3i
13Symbolic Differencing
Haghighat95
Use abstract interpretation to evaluate loop
iterations and construct symbolic difference
table of the IV values
do x xz y z1 z y1while ()
Iteration x x x y y z z
1 xz diff z1 diff z diff
2 x2z2 z2 diff z3 2 z2 2
3 x3z6 z4 2 z5 2 z4 2
x(i) x0 z0i (i2-i) y(i) z0 2i
1 z(i) z0 2i
14Pointer-to-Array Conversion
vanEngelen01, Franke01
f 2lsp 2for (i 2 i lt 5 i) f
f-2 for (j 1 j lt i j, f--) f
f-2-2(lsp)f-1 f - 2(lsp) f
i lsp 2
for (i 0 i lt 3 i) fi2 fi
for (j 0 j lt i j) fi-j2 fi-j-
2lsp2i2fi-j1 f1 -
2lsp2i2
Lsp_az speech codec segmentfrom ETSI with
pointer updates.
Lsp_az speech codec segmentafter
pointer-to-array conversion.Note that all array
indexexpressions are affine.
15Control-Flow Issues
- Conditional array accesses and conditionally
updated induction variables present problems
do K 3 K KJ if () J K
else J J3 AJ while (JltN)
DO I1,10 IF J J2 ELSE J I
ENDIF A(J) ENDDO
for () if () AI else
AJ
Assume RAW andWAR dependences
Extensive analysisreveals that JJ3
Problem J has nosingle recurrence form
16Chains of Recurrences for Compiler Optimization
- Chains of recurrence forms and algebra can be
used to - Detect (non)linear coupled IVs
- Analyze pointer arithmetic
- Effectively handle control flow
- Implement array dependence testing
17Chains of Recurrences
- A chain of recurrences (CR) represents a
polynomial or exponential function or mix
evaluated over a unit-distance grid Zima92 - Basic form init, ?, stride
Iteration init, ?, stride f(i) 2i1 1,,2 f(i) 2i 1,,2
i 0 init 1 1
i 1 init ? stride 3 2
i 2 init ? stride ? stride 5 4
i 3 init ? stride ? stride ? stride 7 8
18Chains of RecurrencesGeneral Formulation
- The key idea is to represent a non-constant CR
stride in CR form itself, thereby forming a chain
of recurrences - Example f(i) i2 0, , s(i-1) 0, , 1,
, 2 where s(i-1) 1, , 2
Iteration init, ?, s(i-1) s(i) 1, , 2 f(i) 0, , s(i-1)
i 0 init 1 0
i 1 init ? s(0) 3 1
i 2 init ? s(0) ? s(1) 5 4
i 3 init ? s(0) ? s(1) ? s(2) 7 9
19CRs for Expediting Function Evaluations on Grids
- Suppose f(i) a bi ci2 a, , bc, ,
2c - We have two IVs x and yf(i) x x0, , y
with x0 as(i) y y0, , 2c with y0 bc - Implement loop to update x and y for efficient
evaluation of f(i) over a unit-distance grid i
0, , n
s(i)
x ay bcfor i0 to n fi x x xy
y y2cendfor
20Multi-Dimensional Example
- Let f(i,j) i2 ij 1
- Create IV k for f(i,j) in j-loopf(i,j) kj
pi, , rij with pi i2 1 and ri i - Create IVs for pi and ri in i-looppi p0, ,
qii with p0 1qi q0, , 2i with q0 1ri
r0, , 1i with r0 0 - Implement k, p, q, and r ini-j-loop nest
p 1q 1r 0for i 0 to n k p for j
0 to m fi,j k k kr endfor p
pq q q2 r r1endfor
21CR Construction with the CR Algebra
- To construct the CR form of a symbolic function
f(i) - Replace i with CR 0,,1
- Apply CR algebra rewrite rules (selected rules
shown) - Examplef(i) c(ia) c(0, , 1a) ca,
, 1 ca, , c
x, , y c ? xc, , y
cx, , y ? cx, , cy
x, , y u, , v ? xu, , yv
x, , y u, , v ? xu, , yu, , vvx, , yyv
22Loop Analysis with CR Forms
vanEngelen01
- The basic idea
- Scan the loop to detect IV updates
- Construct the CR form for each IV using the CR
algebra
do J JI I I3 P 2P while () J J0, , I J J0, , I0, , 3 I I0, , 3 P P0, , 2
23Algorithm 1 Find Recurrences
- Input Loop L with live variable
informationOutput Set S of recurrence relations
of IVs - Start with set S ?v, v? v is live at loop
header - Search L from bottom to topfor each assignment
v x of expression x to scalar variable v update
tuples ?u, y? in S by replacing v in y with
x
Loop L Step Changes to S ?H, H?, ?I, I?, ?J, J?, ?K, K?
do M 2 L J-H J LM K KMI I I1 while () 54321 S5 ?H, H?, ?I, I1?, ?J, J-H2?, ?K, K2I?S4 ?H, H?, ?I, I1?, ?J, J-HM?, ?K, KMI?S3 ?H, H?, ?I, I1?, ?J, LM?, ?K, KMI?S2 ?H, H?, ?I, I1?, ?J, J?, ?K, KMI?S1 ?H, H?, ?I, I1?, ?J, J?, ?K, K?
24Algorithm 2 Compute CR Forms
- Input Set S with recurrence relationsOutput CR
forms for IVs in S - For each relation ?v, x? in S doif x is of the
form v then v v0 (v is loop invariant) if x
is of the form v y then v v0, , yif x is
of the form v y then v v0, , yif x does
not contain v then v v0, , y (v is wrap
around) - Simplify the CR forms with the CR algebra rewrite
rules
Recurrence relation in S CR form Simplified CR form
?H, H? ?I, I1? ?J, J-H2? ?K, K2I? H H0 I I0, , 1 J J0, , 2-H K K0, , 2I H H0 I I0, , 1 J J0, , 2-H0 K K0, , 2I0, , 2
25Algorithm 3 Solve
- Input CR forms for IVsOutput Closed-form
solutions for IVs (when possible) - For each CR form of v apply the CR inverse
algebra, assuming loop is normalized for i 0,
, n - Certain exotic mixed non-polynomial and
non-exponential CR forms may not have closed forms
Loop L Simplified CR form Closed form
do M 2 L J-H J LM K KMI I I1 while () J J0, , 2-H0 K K0, , 2I0, , 2 I I0, , 1 J(i) J0 (2-H0)i K(i) K0 i2 (2I0-1)i I(i) I0 i
26Example 1
Loop L Step S ?x, x?, ?z, z? CR form Closed form
x 2 z 0 do A(x) A(z) x xz y z1 z y1 while (zltN) 321 S3 ?x, xz?, ?z, z2?S2 ?x, x?, ?z, z2?S1 ?x, x?, ?z, y1? x x0, , z z z0, , 2 x(i) x0 z0i i2-i z(i) z02i
do i0,2N-2 A(ii-i2) A(2i)end do
27Example 2
DO I1,M DO J1,I ij ij1 ijkl
ijklI-J1 DO KI1,M DO L1,K
ijkl ijkl1 xijklijklxklL
ENDDO ENDDO ijkl ijklijleft
ENDDOENDDO
DO I0,M-1 DO J0,I DO K0,M-I-2 DO
L0,IK1 tmp ijklLI(K(MMM2left
6)/4)J(left(MMM)/2)((IIMM)2(KK3KII
(left1))MII)/42 xijkltmp
xklL1 ENDDO ENDDO ENDDOENDDO
IVS
TRFD code segmentfrom Perfect Benchmarkwith IV
updates
TRFD after aggressiveinduction variable
substitution
28Example 3 (SSA)
a 1 a0 1while (alt10)
if (a0gt10) goto L2 x a2
L1 a a1 a1 ?(a0, a2)
x0 a1 2
a2 a11
if (a2lt10) goto L1 L2
GCC 4.x uses our approachapplied to SSA
form.Note GCC developers referto CRs as
scalar evolutions
a1 1,,1
29Example 4 (SSA)
- x 0 x0 0
i 1 i0 1while
(ilt10) if (i0gt10) goto L2 x xi
L1 x1 ?(x0, x2) i i1
i1 ?(i0, i2) x2
x1i1 i2 i11
if (i2lt10) goto L1
L2
i1 1,,1x1 0,,i1 0,,1,,1
30Example 5 (SSA)
j0 0 i0 1 if (i0gt10) goto L2 L1 i1
?(i0, i2) j1 ?(j0, j4) if (!p) goto
L3 j2
j12 goto L4 L3 j3 j13 L4 j4 ?(j2,
j3) i2 i11 if (i2lt10) goto L1 L2
- j 0
- i 1
- while (ilt10)
- if (p)
- j j2
- else
- j j3
- i i1
0,,2 lt j1 lt 0,,3
31Recognizing Mixed Functional Forms and Reductions
Loop L Simplified CR form Factorial
I 1 do F FI I I1 while () F F0, , 1, , 1 I 1, , 1 F F0 i!
Loop L Simplified CR form Reduction
I 0 S 0 do S SAI I I2 while () S 0, , A0, , 2 I 0, , 2 S ? A2i
32Pointer Access Descriptions of Pointer and Array
References
- A pointer access description (PAD) vanEngelen01
is a CR form of a pointer or array reference in a
loop nest - PADs are computed with the CR-based IV algorithms
short a, pint ip afor(i0i)
Loop Code PAD Sequence
ai a, , 1 a0,a1,a2,a3
a2i1 a1, , 2 a1,a3,a5,a7
a(ii-i)/2 a, , 0, , 1 a0,a0,a1,a3
a1ltlti a1, , 1, , 2 a1,a2,a4,a8
p a, , 1 a0,a1,a2,a3
pi a, , 0, , 1 a0,a0,a1,a3
33CR-Enhanced Array Dependence Testing
- Basic idea construct dependence equations in CR
form for both pointer and array accesses - Determine the solution intervals by computing the
value ranges of the equations in CR form - If the solution space is empty, there is no
dependence
34Example
S
float a, p, q p a q a2n for
(i0 iltn i) t p S p q
q-- t
Dependence equationa, , 1id a2n,
,-1iuConstraints0 lt id lt n-10 lt iu lt n-1
pa, , 1qa2n, , -1
Compute solution intervalLow-2n, , 1iu, ,
1id Low-2n, , 1iu -2n Up-2n, ,
1iu, , 1id Up-2n, , 1iu n-1 Up-2n
2n - 2 -2
Rewrite dependence equationa, , 1id a2n,
, -1iu? a, , 1id - a2n, , -1iu 0?
-2n, , 1iu, , 1id 0
No dependence
35Determining the Value Range of a CR Form
- Suppose x(i) x0, , s(i-1) for i 0, , n
- If s(i-1) gt 0 then x(i) is monotonically
increasing - If s(i-1) lt 0 then x(i) is monotonically
decreasing - If a function is monotonic on its domain, then it
is trivial to find its exact value range
36Example Nonlinear and Symbolic Dependence Testing
float a, p, qp q a for (i0 iltn
i) for (j0 jlti j) q p
q
DO i 1, M1 S1 AIN10 ... S2
... A2IK K 2KN ENDDO
S1 AN10, , Ni S2 AK02N, , K0 N2,
, 2i
p a1, , 1, , 1i, , 1j
a(i2i)/2j1q a, , 1i ai
CR range test disprovesdependence whenKN gt 10
and K gt 2
CR dep. test disprovesflow dependence (lt, lt)
37Results
- Implemented a CR-enhanced trapezoidal Banerjee
test - Relatively simple test
- Enhanced with support for nonlinear forms
- Enhanced with support for conditional flow
- Construct dependence equations in CR form
- Implementation based on the Polaris compiler
- Pros can compare to powerful dependence tests
such as Omega and Range test - Cons Fortran only
38Additional Independences Filtered over Omega Test
LAPACK
Perf. Benchmark
39Additional Independences Filtered over Range Test
40Additional Independences Filtered over OmegaRange
41Percentage of Conditional IVs w/o Closed Forms in
LAPACK
42Timing Comparison Perf Bench.
43Timing Comparison LAPACK
44Conclusions
- A CR-based compiler framework has advantages
- Applicable to CFG, AST, and SSA forms
- Handles conditional flow
- Handles nonlinear and symbolic induction variable
expressions - Allows array and pointer-based dependence testing
to be applied directly to the CR forms without
induction variable substitution - Future work
- Improve GCC implementation
- Enhance other dependence tests with CR forms
45Further Reading
- Robert van Engelen, Johnnie Birch, Yixin Shou,
Burt Walsh, and Kyle Gallivan, A Unified
Framework for Nonlinear Dependence Testing and
Symbolic Analysis, in the proceedings of the ACM
International Conference on Supercomputing (ICS),
2004, pages 106-115. - Robert van Engelen, Johnnie Birch, and Kyle
Gallivan, Array Dependence Testing with the
Chains of Recurrences Algebra, in the
proceedings of the IEEE International Workshop on
Innovative Architectures for Future Generation
High-Performance Processors and Systems (IWIA),
January 2004, pages 70-81. - Robert van Engelen and Kyle Gallivan, An
Efficient Algorithm for Pointer-to-Array Access
Conversion for Compiling and Optimizing DSP
Applications, in proceedings of the 2001
International Workshop on Innovative
Architectures for Future Generation
High-Performance Processors and Systems (IWIA),
January 2001, pages 80-89. - Robert van Engelen, Efficient Symbolic Analysis
for Optimizing Compilers, in proceedings of the
International Conference on Compiler
Construction, ETAPS 2001, LNCS 2027, pages
118-132.
46The End