Introducing Control Flow into Vectorized Code - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Introducing Control Flow into Vectorized Code

Description:

Vectorization in the presence of control flow. Nested control ... Opteron Dual Core 2.4. Opteron Dual Core 2.6. PowerPC 440. Processor. 36864. BNL. 82. BG/L ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 29
Provided by: Jaew2
Category:

less

Transcript and Presenter's Notes

Title: Introducing Control Flow into Vectorized Code


1
Introducing Control Flow into Vectorized Code
  • PACT07 September. 19. 2007

2
Outline
  • Introduction
  • Automatic vectorization
  • Vectorization in the presence of control flow
  • Nested control flow in vector codes
  • Experimental results
  • Conclusion

3
The Fastest Computer BG/L
Taken from top500.org
4
Levels of Parallelism



5
Scalar vs. SIMD Operation
1
r3

Scalar add r1,r2,r3
2
r2
3
r1
6
Automatic Vectorization ICC, GCC, XLC,
for (i0 ilt1024 i) Ci AiBi
for (i0 ilt1024 i4) Cii3
Aii3Bii3
7
Issues
  • Control flow
  • Vectorization in the presence of complex control
    flow
  • Overhead of executing all control paths
  • Function calls
  • Function calls are not vectorized.
  • Function calls prevent vectorization.
  • Alignment
  • Aligned 0, 16, 32,
  • Misaligned 4, 20, 36,
  • Unaligned 0, 8, 16,
  • ETC
  • Nonunit strides
  • Dependences

2005
2004,2007
8
Vectorization in the Presence of Control Flow
(2005)
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3vec_sel(Cii3,Bi
i3,vP) (NP1,NP2,NP3,NP4)vNP
if(NP1)DiDi-1 if(NP2)Di1Di
if(NP3)Di2Di1 if(NP4)Di3Di2
for (i0 ilt1024 i) if (Aigt0) Ci
Bi else DiDi-1
for (i0 ilt1024 i) PAigt0 NP!P
Ci Bi ltPgt DiDi-1 ltNPgt
If-conversion
OUT
IN
vectorization
Remove scalar predicates
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3Bii3
ltvPgt (NP1,NP2,NP3,NP4)vNP DiDi-1
ltNP1gt Di1Di ltNP2gt
Di2Di1 ltNP3gt Di3Di2
ltNP4gt
Remove vector predicates
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3vec_sel(Cii3,Bi
i3,vP) (NP1,NP2,NP3,NP4)vNP DiDi-1
ltNP1gt Di1Di
ltNP2gt Di2Di1 ltNP3gt
Di3Di2 ltNP4gt
9
Branch-On-Superword-Condition-Code (BOSCC)
branch-on-none( src ) lttarget labelgt
branch-on-all( src ) lttarget labelgt
10
BOSCC by Example (2004)
for (i0 ilt16 i) if (ai ! 0) bi
for (i0 ilt16 i4) pred aii3 ! (0,
0, 0, 0) old bii3 new old (1,
1, 1, 1) bii3 SELECT(old, new, pred)
vectorization
Bypass vector instructions
for (i0 ilt16 i4) pred aii3 ! (0,
0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
11
Existing Approach single-level BOSCCs
Scalar CFG
Vector CFG
12
Our Approach nesting BOSCCs (2007)
Scalar CFG
Vector CFG
13
Key Idea Inverse-Implies
  • Definition A inverse-implies B iff A ? B.
  • A region of instructions guarded by one vector
    predicate (B) can be nested inside another (A) if
    A ? B.

branch-on-none(A) LA BOSCC region A
branch-on-none(B) LB BOSCC region
B LB LA
Nest BOSCCs!
branch-on-none(A) LA ... LA branch-on-none(B)
LB ... LB
branch-on-none(A) LA BOSCC region A
BOSCC region B LA
Nest BOSCC regions!
14
A ? B if ...
  • A is an ancestor of B All conditions used to
    compute B are derived directly or indirectly from
    A.
  • B implies A A subset of conditions used to
    compute B are used to compute A.

if(bilt0) bi0 if(aigt9) ci1
goto L1 else bi1 if(aigt0)
L1 ci2
P2 is an ancestor of p5.
T
F
P5 implies p6.
T
p2
F
T
F
Inverse-Implies Graph
C code
CFG
Ph.D. dissertation, Scott Mahlke, 1995
15
Algorithm
  • Collect the instructions guarded by the same or
    the descendant predicate so that they are
    textually contiguous.
  • Post-DFS order of II-graph
  • Insert BOSCCs
  • Reverse post-DFS order of II-graph
  • Profitability of a BOSCC based on conditional
    probability of the nested predicate being true
    with respect to the nesting predicate

branch-on-none(A) LA BOSCC region A
branch-on-none(B) LB BOSCC region
B LB LA
P(BA)P(AnB)/P(A) P(B)/P(A) ex)P(A)10,P
(B)10 P(BA)100
16
Nesting Over Single-Level BOSCCs
17
vec_any_
Previous approach
if (ai gt 0)
pT vec_cmpgt(aii3,(0,0,0,0)) if(vec_any_ne(
pT,(0,0,0,0)))
Our approach
if(vec_any_gt(aii3,(0,0,0,0))) pT
vec_cmpgt(aii3,(0,0,0,0))
18
vec_any_ Over vec_any_ne Only
19
Predicate Restoring vs. Predicate Preserving
20
Predicate Restoring Over Predicate Preserving
21
Overall Speedups Over the Scalar Baseline
22
Conclusion
  • By nesting BOSCCs,
  • we can bypass multiple predicate regions by a
    single BOSCC and,
  • even the associated BOSCCs are bypassed.
  • Experiments on 14 benchmarks show the speedups of
    11.94 over the previous BOSCC generation
    technique.
  • A significant portion of the vectorization
    overhead in the presence of arbitrarily complex
    control flow is eliminated.
  • Future work
  • Whole function vectorization
  • Look for more applications

23
Questions ?
24
Why SIMD ?
  • Simple design
  • Replicated functional units
  • Small die area
  • No heavily ported register files
  • Die area
  • MAX-2(HP) 0.1
  • VIS(Sun) 3.0
  • More parallelism
  • When parallelism is abundant
  • SIMD in addition to ILP
  • Must be explicitly exposed to the hardware
  • By the compiler or by the programmer

Low cost
High performance
25
Programming SIMD Units
  • Language extension No Standard !
  • Programming interface similar to function call
  • C built-in functions, Fortran intrinsics
  • Most native compilers support SIMD intrinsics of
    the machine
  • AltiVec dst vec_add(src1, src2)
  • SSE2 dst _mm_add_ps(src1, src2)
  • BG/L dst __fpadd(src1, src2)
  • GCC -faltivec, -msse2
  • Library calls not portable (as is) !
  • BG/L MASS, MASSV, ESSL
  • Ex y sqrt(x) vs. vsqrt(Y,X,500)
  • Need automatic tools.

26
Approach 1 Loop-Level Automatic Vectorization
for (i0 ilt1024 i4) vA vec_ld(Ai)
vB vec_ld(Bi) vC vec_add(vA,vB)
vec_st(vC,Ci)
for (i0 ilt1024 i) Ci AiBi
Strip mining
Code generation
for (i0 ilt1024 i4) for (ii0 iilt4 ii)
tAii Aiii for (ii0 iilt4 ii) tBii
Biii for (ii0 iilt4 ii) tCii
tAii tBii for (ii0 iilt4 ii) Ciii
tCii
for (i0 ilt1024 i4) for (ii0 iilt4 ii)
Ciii AiiiBiii
Loop distribution
27
Approach 2 Basic Block-Level Automatic
Vectorization
for (i0 ilt1024 i) Ci AiBi
for (i0 ilt1024 i4) vA vec_ld(Ai)
vB vec_ld(Bi) vC vec_add(vA,vB)
vec_st(vC,Ci)
unrolling
for (i0 ilt1024 i4) sA0 Ai0 sB0
Bi0 sC0 sA0sB0 Ci0sC0 sA3
Ai3 sB3 Bi3 sC3 sA3sB3
Ci3sC3
Code generation
Packing Isomorphic statements
for (i0 ilt1024 i4) (sA0,sA1,sA2,sA3)Aii
3 (sB0,sB1,sB2,sB3)Bii3
(sC0,sC1,sC2,sC3) (sA0,sA1,sA2,sA3)(sB0,sB1,sB
2,sB3) Cii3 (sC0,sC1,sC2,sC3)
28
Inverse Implies Graph of logf
Write a Comment
User Comments (0)
About PowerShow.com