Auto-Vectorization of Interleaved Data for SIMD - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Auto-Vectorization of Interleaved Data for SIMD

Description:

PLDI 2006. Auto-Vectorization of Interleaved Data for SIMD. Dorit Nuzman, Ira ... We show how a classic compiler loop-based auto-SIMDizing optimization was ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 24
Provided by: dori91
Category:

less

Transcript and Presenter's Notes

Title: Auto-Vectorization of Interleaved Data for SIMD


1
Auto-Vectorization of Interleaved Data for SIMD
  • Dorit Nuzman, Ira Rosen, Ayal Zaks
  • IBM Haifa Research Lab HiPEAC member, Isreal
  • dorit, ira, zaks_at_il.ibm.com

2
Main Message
  1. Most SIMD targets support access to packed data
    in memory (SIMpD), but there are important
    applications which access non-consecutive data
  2. We show how a classic compiler loop-based
    auto-SIMDizing optimization was augmented to
    support accesses to strided, interleaved data
  3. This can serve as a first step to combine
    traditional loop-based vectorization with
    (if-converted) basic-block vectorization (SLP)

3
SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(b) OP(c) OP(d)
a
b
c
d
VOP( a, b, c, d )
VR1
Vector Operation
Vectorization
Vector Registers
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
4
SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(b) OP(c) OP(d)
a
b
c
d
VOP( a, b, c, d )
VR1
Vectorization
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
b
c
d
5
SIMD Single Instruction Multiple Data
SIM D Single Instruction Multiple
Data
Packed
p
OP(a) OP(f) OP(k) OP(p)
a
f
VOP( a, f, k, p )
VR5
k
p
a
f
k
p
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
f
k
p
6
OP(a) OP(f) OP(k) OP(p)
a
mask ? loop (VR1,,VR4) ? vload (mem) VR5 ?
pack (VR1,,VR4),mask VOP(VR5)
f
VOP( a, f, k, p )
VR5
k
p
a
f
k
p
Data in Memory
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
a
f
k
p
7
Application accessing non-consecutive data
Viterbi decoder(before)
Stride 1
Stride 2
Stride 2

-
-
ltlt 1
ltlt 11

ltlt 1
ltlt 11
max
max
sel
sel
Stride 4
8
Application accessing non-consecutive data
Viterbi decoder(after)
Stride 1
Stride 2
Stride 2

-
-
ltlt 1
ltlt 11

ltlt 1
ltlt 11
max
max
sel
sel
Stride 4
9
Application accessing non-consecutive data
Audio downmix(before)
Stride 4
gtgt 1
gtgt 1
gtgt 1
gtgt 1


Stride 2
10
Application accessing non-consecutive data
Audio downmix(after)
Stride 4
gtgt 1
gtgt 1
gtgt 1
gtgt 1


Stride 2
11
Basic unpacking and packing operations for
strided access
  • Use two pairs of inverse operations widely
    supported on SIMD platforms
  • extract_even, extract_odd
  • interleave_high, interleave_low
  • Use them recursively to support strided accesses
    with power-of-2 strides
  • Support several data types

12
Classic loop-based auto-vectorization
  • vect_analyze_loop (loop)
  • if (!1_analyze_counted_single_bb_loop (loop))
    FAIL
  • if (!2_determine_VF (loop)) FAIL
  • if (!3_analyze_memory_access_patterns (loop))
    FAIL
  • if (!4_analyze_scalar_dependence_cycles (loop))
    FAIL
  • if (!5_analyze_data_dependence_distances
    (loop)) FAIL
  • if (!6_analyze_consecutive_data_accesses
    (loop)) FAIL
  • if (!7_analyze_data_alignment (loop)) FAIL
  • if (!8_analyze_vops_exist_forall_ops (loop))
    FAIL
  • SUCCEED
  • vect_transform_loop (loop)
  • FOR_ALL_STMTS_IN_LOOP(loop, stmt)
  • replace_OP_by_VOP (stmt)
  • decrease_loop_bound_by_factor_VF (loop)

13
Vectorizing non unit stride access
  • One VOP accessing data with stride d requires
    loading of dVF elements
  • Several, otherwise unrelated VOPs can share these
    loaded elements
  • If they all share the same stride d
  • If they all start close to each other
  • Upto d VOPS if less, there are gaps
  • Recognize this spatial reuse potential to
    eliminate redundant load and extract operations
  • Better make the decision earlier than later
    without such elimination
  • vectorizing the loop may be non beneficial (for
    loads)
  • vectorizing the loop may be prohibited (for
    stores)

14
Augmenting the vectorizer step 1/3 build
spatial groups
  • 5_analyze_data_dependence_distancesalready
    traversed all pairs of load/stores to analyze
    their dependence distanceif (cross_iteration_de
    pendence_distance lt (VF-1)stride)
  • if (read,write) or (write,read) or
    (write,write)
  • ok dep_resolve()
  • endif
  • endif
  • Augment this traversal to look for spatial reuse
    between pairs of independent loads and stores,
    building spatial groupsif ok and
    (intra_iteration_address_distance lt strideu)
  • if (read,read) or (write,write)
  • ok analyze_and_build_spatial_groups()
  • endif
  • endif

15
Augmenting the vectorizer step 2/3 check
spatial groups
  • 6_analyze_consecutive_data_accesses already
    traversed each individual load/store to analyze
    its access pattern
  • Augment this traversal by
  • Allowing non-consecutive accesses
  • Building singleton groups for strided ungrouped
    load/stores
  • Checking for gaps and profitability of spatial
    groups

16
Augmenting the vectorizer step 3/3
transformation
  • vect_transform_stmt generates vector code per
    scalar OP
  • Augment this by considering
  • If OP is a load/store in first position of a
    spatial group
  • generate d load/stores
  • handle their alignment according to the starting
    address
  • generate d log d extract/interleaves
  • If OP belongs to a spatial group, connect it to
    the appropriate extract/interleave according to
    its position
  • Unused extract/interleaves are discarded by
    subsequent DCE

17
Performance qualitative VF/(1 log d)
d VF4 VF8 VF16
1 4 8 16
2 2 4 8
4 1.3 2.6 5.3
8 1 2 4
16 0.8 1.6 3.2
32 0.6 1.2 2.4
  • Vectorized code has d load/stores and (d log d)
    extract/interleaves
  • Scalar code has dVF loads/stores
  • Performance improvement factor in of
    load/store/extract/interleave is
  • VF/(1 log d)

18
Performance empirically (on PowerPC 970 with
Altivec)
  • Stride of 2 always provides speedups
  • Strides of 8, 16 suffer from increased code-size
    turns off loop unrolling
  • Stride of 32 suffers from high register pressure
    (d1)
  • If non-permute operations exist speedups for
    all strides if VFm8

19
Performance stride of 8 with gaps
  • Position of gaps affects the number of extract
    (interleaves) needed
  • Improvement is observed even for a single strided
    access(VF16 with arithmetic operations)

20
Performance - kernels
  • 4 groups VF4, 8, 16, 16-with-gaps
  • Strides prefix each kernel
  • Slowdown when doing only memory operations at
    VF4, d8

21
Future direction towards loop-aware SLP
  • When building spatial groups, we consider
    distinct operations accessing adjacent/close
    addresses this is the first step of building SLP
    chains
  • SLP looks for VF fully interleaved accesses,
    without gaps may require earlier loop unrolling
  • Next step is to consider the operations that use
    a spatial group of loads if theyre isomorphic,
    try to postpone the extracts
  • Analogous to handling alignment using zero-shift,
    lazy-shift, eager-shift policies

22
Conclusions
  • Existing SIMD targets supporting SIMpD can
    provide improved performance for important
    power-of-2 strided applications dont be afraid
    of d gt 2
  • Existing compiler loop-based auto-vectorization
    can be augmented efficiently to handle such
    strided accesses
  • This can serve as a first step combining
    traditional loop-based vectorization with
    (if-converted) basic-block vectorization (SLP)
  • This area of work is fertile
  • consider details (d, gaps, positions, VF,
    non-mem ops) for it not to be futile!

23
Questions
?
Write a Comment
User Comments (0)
About PowerShow.com