Title: Introducing Control Flow into Vectorized Code
1Introducing Control Flow into Vectorized Code
- PACT07 September. 19. 2007
2Outline
- Introduction
- Automatic vectorization
- Vectorization in the presence of control flow
- Nested control flow in vector codes
- Experimental results
- Conclusion
3The Fastest Computer BG/L
Taken from top500.org
4Levels of Parallelism
5Scalar vs. SIMD Operation
1
r3
Scalar add r1,r2,r3
2
r2
3
r1
6Automatic Vectorization ICC, GCC, XLC,
for (i0 ilt1024 i) Ci AiBi
for (i0 ilt1024 i4) Cii3
Aii3Bii3
7Issues
- Control flow
- Vectorization in the presence of complex control
flow - Overhead of executing all control paths
- Function calls
- Function calls are not vectorized.
- Function calls prevent vectorization.
- Alignment
- Aligned 0, 16, 32,
- Misaligned 4, 20, 36,
- Unaligned 0, 8, 16,
- ETC
- Nonunit strides
- Dependences
2005
2004,2007
8Vectorization in the Presence of Control Flow
(2005)
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3vec_sel(Cii3,Bi
i3,vP) (NP1,NP2,NP3,NP4)vNP
if(NP1)DiDi-1 if(NP2)Di1Di
if(NP3)Di2Di1 if(NP4)Di3Di2
for (i0 ilt1024 i) if (Aigt0) Ci
Bi else DiDi-1
for (i0 ilt1024 i) PAigt0 NP!P
Ci Bi ltPgt DiDi-1 ltNPgt
If-conversion
OUT
IN
vectorization
Remove scalar predicates
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3Bii3
ltvPgt (NP1,NP2,NP3,NP4)vNP DiDi-1
ltNP1gt Di1Di ltNP2gt
Di2Di1 ltNP3gt Di3Di2
ltNP4gt
Remove vector predicates
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3vec_sel(Cii3,Bi
i3,vP) (NP1,NP2,NP3,NP4)vNP DiDi-1
ltNP1gt Di1Di
ltNP2gt Di2Di1 ltNP3gt
Di3Di2 ltNP4gt
9Branch-On-Superword-Condition-Code (BOSCC)
branch-on-none( src ) lttarget labelgt
branch-on-all( src ) lttarget labelgt
10BOSCC by Example (2004)
for (i0 ilt16 i) if (ai ! 0) bi
for (i0 ilt16 i4) pred aii3 ! (0,
0, 0, 0) old bii3 new old (1,
1, 1, 1) bii3 SELECT(old, new, pred)
vectorization
Bypass vector instructions
for (i0 ilt16 i4) pred aii3 ! (0,
0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
11Existing Approach single-level BOSCCs
Scalar CFG
Vector CFG
12Our Approach nesting BOSCCs (2007)
Scalar CFG
Vector CFG
13Key Idea Inverse-Implies
- Definition A inverse-implies B iff A ? B.
- A region of instructions guarded by one vector
predicate (B) can be nested inside another (A) if
A ? B.
branch-on-none(A) LA BOSCC region A
branch-on-none(B) LB BOSCC region
B LB LA
Nest BOSCCs!
branch-on-none(A) LA ... LA branch-on-none(B)
LB ... LB
branch-on-none(A) LA BOSCC region A
BOSCC region B LA
Nest BOSCC regions!
14A ? B if ...
- A is an ancestor of B All conditions used to
compute B are derived directly or indirectly from
A. - B implies A A subset of conditions used to
compute B are used to compute A.
if(bilt0) bi0 if(aigt9) ci1
goto L1 else bi1 if(aigt0)
L1 ci2
P2 is an ancestor of p5.
T
F
P5 implies p6.
T
p2
F
T
F
Inverse-Implies Graph
C code
CFG
Ph.D. dissertation, Scott Mahlke, 1995
15Algorithm
- Collect the instructions guarded by the same or
the descendant predicate so that they are
textually contiguous. - Post-DFS order of II-graph
- Insert BOSCCs
- Reverse post-DFS order of II-graph
- Profitability of a BOSCC based on conditional
probability of the nested predicate being true
with respect to the nesting predicate
branch-on-none(A) LA BOSCC region A
branch-on-none(B) LB BOSCC region
B LB LA
P(BA)P(AnB)/P(A) P(B)/P(A) ex)P(A)10,P
(B)10 P(BA)100
16Nesting Over Single-Level BOSCCs
17vec_any_
Previous approach
if (ai gt 0)
pT vec_cmpgt(aii3,(0,0,0,0)) if(vec_any_ne(
pT,(0,0,0,0)))
Our approach
if(vec_any_gt(aii3,(0,0,0,0))) pT
vec_cmpgt(aii3,(0,0,0,0))
18vec_any_ Over vec_any_ne Only
19Predicate Restoring vs. Predicate Preserving
20Predicate Restoring Over Predicate Preserving
21Overall Speedups Over the Scalar Baseline
22Conclusion
- By nesting BOSCCs,
- we can bypass multiple predicate regions by a
single BOSCC and, - even the associated BOSCCs are bypassed.
- Experiments on 14 benchmarks show the speedups of
11.94 over the previous BOSCC generation
technique. - A significant portion of the vectorization
overhead in the presence of arbitrarily complex
control flow is eliminated. - Future work
- Whole function vectorization
- Look for more applications
23Questions ?
24Why SIMD ?
- Simple design
- Replicated functional units
- Small die area
- No heavily ported register files
- Die area
- MAX-2(HP) 0.1
- VIS(Sun) 3.0
- More parallelism
- When parallelism is abundant
- SIMD in addition to ILP
- Must be explicitly exposed to the hardware
- By the compiler or by the programmer
Low cost
High performance
25Programming SIMD Units
- Language extension No Standard !
- Programming interface similar to function call
- C built-in functions, Fortran intrinsics
- Most native compilers support SIMD intrinsics of
the machine - AltiVec dst vec_add(src1, src2)
- SSE2 dst _mm_add_ps(src1, src2)
- BG/L dst __fpadd(src1, src2)
- GCC -faltivec, -msse2
- Library calls not portable (as is) !
- BG/L MASS, MASSV, ESSL
- Ex y sqrt(x) vs. vsqrt(Y,X,500)
- Need automatic tools.
26Approach 1 Loop-Level Automatic Vectorization
for (i0 ilt1024 i4) vA vec_ld(Ai)
vB vec_ld(Bi) vC vec_add(vA,vB)
vec_st(vC,Ci)
for (i0 ilt1024 i) Ci AiBi
Strip mining
Code generation
for (i0 ilt1024 i4) for (ii0 iilt4 ii)
tAii Aiii for (ii0 iilt4 ii) tBii
Biii for (ii0 iilt4 ii) tCii
tAii tBii for (ii0 iilt4 ii) Ciii
tCii
for (i0 ilt1024 i4) for (ii0 iilt4 ii)
Ciii AiiiBiii
Loop distribution
27Approach 2 Basic Block-Level Automatic
Vectorization
for (i0 ilt1024 i) Ci AiBi
for (i0 ilt1024 i4) vA vec_ld(Ai)
vB vec_ld(Bi) vC vec_add(vA,vB)
vec_st(vC,Ci)
unrolling
for (i0 ilt1024 i4) sA0 Ai0 sB0
Bi0 sC0 sA0sB0 Ci0sC0 sA3
Ai3 sB3 Bi3 sC3 sA3sB3
Ci3sC3
Code generation
Packing Isomorphic statements
for (i0 ilt1024 i4) (sA0,sA1,sA2,sA3)Aii
3 (sB0,sB1,sB2,sB3)Bii3
(sC0,sC1,sC2,sC3) (sA0,sA1,sA2,sA3)(sB0,sB1,sB
2,sB3) Cii3 (sC0,sC1,sC2,sC3)
28Inverse Implies Graph of logf