Introducing Control Flow into Vectorized Code - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Introducing Control Flow into Vectorized Code

Description:

Vectorization in the presence of control flow. Nested control ... Opteron Dual Core 2.4. Opteron Dual Core 2.6. PowerPC 440. Processor. 36864. BNL. 82. BG/L ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 29

Provided by: Jaew2

Category:

more less

Transcript and Presenter's Notes

Title: Introducing Control Flow into Vectorized Code

1
Introducing Control Flow into Vectorized Code

PACT07 September. 19. 2007

2
Outline

Introduction
Automatic vectorization
Vectorization in the presence of control flow
Nested control flow in vector codes
Experimental results
Conclusion

3
The Fastest Computer BG/L
Taken from top500.org
4
Levels of Parallelism

5
Scalar vs. SIMD Operation
1
r3

Scalar add r1,r2,r3
2
r2
3
r1
6
Automatic Vectorization ICC, GCC, XLC,
for (i0 ilt1024 i) Ci AiBi
for (i0 ilt1024 i4) Cii3
Aii3Bii3
7
Issues

Control flow
Vectorization in the presence of complex control
flow
Overhead of executing all control paths
Function calls
Function calls are not vectorized.
Function calls prevent vectorization.
Alignment
Aligned 0, 16, 32,
Misaligned 4, 20, 36,
Unaligned 0, 8, 16,
ETC
Nonunit strides
Dependences

2005
2004,2007
8
Vectorization in the Presence of Control Flow
(2005)
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3vec_sel(Cii3,Bi
i3,vP) (NP1,NP2,NP3,NP4)vNP
if(NP1)DiDi-1 if(NP2)Di1Di
if(NP3)Di2Di1 if(NP4)Di3Di2
for (i0 ilt1024 i) if (Aigt0) Ci
Bi else DiDi-1
for (i0 ilt1024 i) PAigt0 NP!P
Ci Bi ltPgt DiDi-1 ltNPgt
If-conversion
OUT
IN
vectorization
Remove scalar predicates
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3Bii3
ltvPgt (NP1,NP2,NP3,NP4)vNP DiDi-1
ltNP1gt Di1Di ltNP2gt
Di2Di1 ltNP3gt Di3Di2
ltNP4gt
Remove vector predicates
for (i0 ilt1024 i) vPAii3gt(0,0,0,0)
vNPvec_not(vP) Cii3vec_sel(Cii3,Bi
i3,vP) (NP1,NP2,NP3,NP4)vNP DiDi-1
ltNP1gt Di1Di
ltNP2gt Di2Di1 ltNP3gt
Di3Di2 ltNP4gt
9
Branch-On-Superword-Condition-Code (BOSCC)
branch-on-none( src ) lttarget labelgt
branch-on-all( src ) lttarget labelgt
10
BOSCC by Example (2004)
for (i0 ilt16 i) if (ai ! 0) bi
for (i0 ilt16 i4) pred aii3 ! (0,
0, 0, 0) old bii3 new old (1,
1, 1, 1) bii3 SELECT(old, new, pred)
vectorization
Bypass vector instructions
for (i0 ilt16 i4) pred aii3 ! (0,
0, 0, 0) branch-on-none(pred) L1 old
bii3 new old (1, 1, 1, 1)
bii3 SELECT(old, new, pred) L1
11
Existing Approach single-level BOSCCs
Scalar CFG
Vector CFG
12
Our Approach nesting BOSCCs (2007)
Scalar CFG
Vector CFG
13
Key Idea Inverse-Implies

Definition A inverse-implies B iff A ? B.
A region of instructions guarded by one vector
predicate (B) can be nested inside another (A) if
A ? B.

branch-on-none(A) LA BOSCC region A
branch-on-none(B) LB BOSCC region
B LB LA
Nest BOSCCs!
branch-on-none(A) LA ... LA branch-on-none(B)
LB ... LB
branch-on-none(A) LA BOSCC region A
BOSCC region B LA
Nest BOSCC regions!
14
A ? B if ...

A is an ancestor of B All conditions used to
compute B are derived directly or indirectly from
A.
B implies A A subset of conditions used to
compute B are used to compute A.

if(bilt0) bi0 if(aigt9) ci1
goto L1 else bi1 if(aigt0)
L1 ci2
P2 is an ancestor of p5.
T
F
P5 implies p6.
T
p2
F
T
F
Inverse-Implies Graph
C code
CFG
Ph.D. dissertation, Scott Mahlke, 1995
15
Algorithm

Collect the instructions guarded by the same or
the descendant predicate so that they are
textually contiguous.
Post-DFS order of II-graph
Insert BOSCCs
Reverse post-DFS order of II-graph
Profitability of a BOSCC based on conditional
probability of the nested predicate being true
with respect to the nesting predicate

branch-on-none(A) LA BOSCC region A
branch-on-none(B) LB BOSCC region
B LB LA
P(BA)P(AnB)/P(A) P(B)/P(A) ex)P(A)10,P
(B)10 P(BA)100
16
Nesting Over Single-Level BOSCCs
17
vec_any_
Previous approach
if (ai gt 0)
pT vec_cmpgt(aii3,(0,0,0,0)) if(vec_any_ne(
pT,(0,0,0,0)))
Our approach
if(vec_any_gt(aii3,(0,0,0,0))) pT
vec_cmpgt(aii3,(0,0,0,0))
18
vec_any_ Over vec_any_ne Only
19
Predicate Restoring vs. Predicate Preserving
20
Predicate Restoring Over Predicate Preserving
21
Overall Speedups Over the Scalar Baseline
22
Conclusion

By nesting BOSCCs,
we can bypass multiple predicate regions by a
single BOSCC and,
even the associated BOSCCs are bypassed.
Experiments on 14 benchmarks show the speedups of
11.94 over the previous BOSCC generation
technique.
A significant portion of the vectorization
overhead in the presence of arbitrarily complex
control flow is eliminated.
Future work
Whole function vectorization
Look for more applications

23
Questions ?
24
Why SIMD ?

Simple design
Replicated functional units
Small die area
No heavily ported register files
Die area
MAX-2(HP) 0.1
VIS(Sun) 3.0
More parallelism
When parallelism is abundant
SIMD in addition to ILP
Must be explicitly exposed to the hardware
By the compiler or by the programmer

Low cost
High performance
25
Programming SIMD Units

Language extension No Standard !
Programming interface similar to function call
C built-in functions, Fortran intrinsics
Most native compilers support SIMD intrinsics of
the machine
AltiVec dst vec_add(src1, src2)
SSE2 dst _mm_add_ps(src1, src2)
BG/L dst __fpadd(src1, src2)
GCC -faltivec, -msse2
Library calls not portable (as is) !
BG/L MASS, MASSV, ESSL
Ex y sqrt(x) vs. vsqrt(Y,X,500)
Need automatic tools.

26
Approach 1 Loop-Level Automatic Vectorization
for (i0 ilt1024 i4) vA vec_ld(Ai)
vB vec_ld(Bi) vC vec_add(vA,vB)
vec_st(vC,Ci)
for (i0 ilt1024 i) Ci AiBi
Strip mining
Code generation
for (i0 ilt1024 i4) for (ii0 iilt4 ii)
tAii Aiii for (ii0 iilt4 ii) tBii
Biii for (ii0 iilt4 ii) tCii
tAii tBii for (ii0 iilt4 ii) Ciii
tCii
for (i0 ilt1024 i4) for (ii0 iilt4 ii)
Ciii AiiiBiii
Loop distribution
27
Approach 2 Basic Block-Level Automatic
Vectorization
for (i0 ilt1024 i) Ci AiBi
for (i0 ilt1024 i4) vA vec_ld(Ai)
vB vec_ld(Bi) vC vec_add(vA,vB)
vec_st(vC,Ci)
unrolling
for (i0 ilt1024 i4) sA0 Ai0 sB0
Bi0 sC0 sA0sB0 Ci0sC0 sA3
Ai3 sB3 Bi3 sC3 sA3sB3
Ci3sC3
Code generation
Packing Isomorphic statements
for (i0 ilt1024 i4) (sA0,sA1,sA2,sA3)Aii
3 (sB0,sB1,sB2,sB3)Bii3
(sC0,sC1,sC2,sC3) (sA0,sA1,sA2,sA3)(sB0,sB1,sB
2,sB3) Cii3 (sC0,sC1,sC2,sC3)
28
Inverse Implies Graph of logf

Write a Comment

User Comments (0)