Vectorization for Modern Architectures - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Vectorization for Modern Architectures

Description:

Threads. Instructions. Data. Grid. BG/L. Netscape. ILP. SIMD. Scalar vs. SIMD Operation ... Must be explicitly exposed to the hardware. By the compiler or by ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 71
Provided by: jaewoo
Category:

less

Transcript and Presenter's Notes

Title: Vectorization for Modern Architectures


1
Vectorization for Modern Architectures
Wednesday, September 23, 2009
  • Jaewook Shin

2
The Fastest Computer BG/L
3
Overview
  • Introduction
  • What-is / why SIMD ?
  • Multimedia extensions
  • Automatic parallelization
  • Vectorization for vector machines
  • Superword-Level Parallelization
  • Issues
  • Conclusion

4
Multiple Levels of Parallelism
Data
SIMD
Instructions
ILP
Threads
Netscape
Processors
BG/L
Computers
Grid
5
Scalar vs. SIMD Operation
1
r3

Scalar add r1,r2,r3
2
r2
3
r1
6
Why SIMD ?
  • More parallelism
  • When parallelism is abundant
  • SIMD in addition to ILP
  • Simple design
  • Replicated functional units
  • Small die area
  • No heavily ported register files
  • Die area
  • MAX-2(HP) 0.1
  • VIS(Sun) 3.0
  • Must be explicitly exposed to the hardware
  • By the compiler or by the programmer

7
Multimedia / Scientific Applications
  • Image
  • Graphics 3D games, movies
  • Image recognition
  • Video encoding/decoding JPEG, MPEG4
  • Sound
  • Encoding/decoding IP phone, MP3
  • Speech recognition
  • Digital signal processing Cell phones
  • Scientific applications
  • Double precision GEneral Matrix-Matrix
    multiplication (DGEMM)
  • Y aX Y (SAXPY)

8
Characteristics of Multimedia Applications
  • Regular data access pattern
  • Data items are contiguous in memory
  • Short data types
  • 8, 16, 32 bits
  • Data streaming through a series of processing
    stages
  • Some temporal reuse for such data streams
  • Sometimes
  • Many constants
  • Short iteration counts
  • Requires saturation arithmetic

9
Multimedia Extensions
  • At the core of multimedia extensions
  • SIMD parallelism
  • Variable-sized data fields
  • Vector length register width / type size

WIDE UNIT
10
Multimedia Extension (cont.)
  • Additions to all major ISAs
  • Alignment
  • AltiVec Only aligned memory accesses
  • SSE Unaligned memory accesses are more
    expensive.
  • Not all operations are directly supported
  • AltiVec No division, no 32-bit integer
    muliplication

11
PIM DIVA (Data IntensiVe Architecture)
DRAM Array (32 MB)
DRAM Page (2048 bits)
Scalar registers, 32x32b (128B)
Wide registers, 32x256b (1KB)
I-Cache (4KB)
Wide Functional Unit
Scalar functional unit
12
2nd Generation DIVA PIM Chip
  • Fabrication technology
  • TSMC 0.18mm
  • Size
  • 10.5mm x 11.5mm, 56.6 million transistors
  • Package
  • 35mm, 420 BGA
  • Status
  • 140MHz
  • 1W
  • Integrated in HP Longs Peak IA64 server

BGA Top View
BGA Bottom View
13
PIM-enhanced HP IA64 System
Two 2nd Generation PIM cards in HP Longs Peak
IA64 host Could host four boards with eight PIM
chips
14
Super Computer BlueGene/L PowerPC 440 Double FPU
15
Programming Multimedia Extensions
  • Language extension
  • Programming interface similar to function call
  • C built-in functions, Fortran intrinsics
  • Most native compilers support their own
    multimedia extensions
  • AltiVec dst vec_add(src1, src2)
  • SSE2 dst _mm_add_ps(src1, src2)
  • GCC -faltivec, -msse2
  • BG/L dst __fpadd(src1, src2)
  • No Standard !

16
Programming Multimedia Extensions (cont.)
  • Library calls and (inline) assembly
  • Difficult to program
  • Not portable
  • Different extensions to the same ISA
  • MMX and SSE
  • SSE vs. 3DNow!
  • Need automatic compilation

17
Overview
  • Introduction
  • What-is / why SIMD ?
  • Multimedia extensions
  • Automatic parallelization
  • Vectorization for vector machines
  • Superword-Level Parallelization
  • Issues
  • Alignment
  • Strided memory access

18
Automatic Parallelization by Compiler
  • Two approaches
  • Targeting loops
  • Vector machines
  • Vectorization
  • Targeting basic blocks
  • Multimedia extensions
  • Superword-Level Parallelization

for (i0 ilt64 i) ai bici
compiler
for (i0 ilt64 i4) aii3
bii3cii3
19
Parallelizing Compilers
  • Research compilers
  • SLP compiler (MIT, Sam Larsen)
  • Vectorizing SUIF Compiler (U. Toronto, Derek De
    Vries)
  • Commercial compilers
  • Intel compiler
  • XL compilers for BG/L (IBM)
  • GCC 4.0.1
  • VAST/AltiVec (Crescent Bay Software)
  • Most compilers for multimedia extensions are
    based on conventional vectorization techniques.

20
Vectorization
  • Pros
  • Successful for vector computers
  • Large body of research
  • Cons
  • Involved transformations
  • Targets loop nests

21
Vectorization (cont.)
for (i0 ilt64 i) Rb bi Rc
ci Ra Rb Rc ai Ra
for (i0 ilt64 i4) for (j0 jlt4 j)
Rbj bij for (j0 jlt4 j)
Rcj cij for (j0 jlt4 j)
Raj RbijRcij for (j0 jlt4 j)
aij Raj
for (i0 ilt64 i4) for (j0 jlt4 j)
Rb bij Rc cij Ra Rb
Rc aij Ra
vector int Va, Vb, Vc for (i0 ilt64 i4)
Vb bii3 Vc cii3 Va Vb
Vc aii3 Va
22
Superword-Level Parallelism (SLP)
  • Fine grain SIMD parallelism in aggregate data
    objects larger than a machine word

23
1. Independent ALU Ops
R R XR 1.08327 G G XG 1.89234 B B
XB 1.29835
24
2. Adjacent Memory References
R R Xi0 G G Xi1 B B Xi2
25
3. Vectorizable Loops
for (i0 ilt100 i1) Ai0 Ai0 Bi0
26
3. Vectorizable Loops
for (i0 ilt100 i4) Ai0 Ai0 Bi0
Ai1 Ai1 Bi1 Ai2 Ai2
Bi2 Ai3 Ai3 Bi3
27
4. Partially Vectorizable Loops
for (i0 ilt16 i1) L Ai0 Bi0 D
D abs(L)
28
4. Partially Vectorizable Loops
for (i0 ilt16 i2) L Ai0 Bi0 D
D abs(L)
L Ai1 Bi1 D D abs(L)
29
Exploiting SLP with SIMD Execution
  • Benefit
  • Multiple ALU ops ? One SIMD op
  • Multiple ld/st ops ? One wide mem op

30
Exploiting SLP with SIMD Execution
  • Benefit
  • Multiple ALU ops ? One SIMD op
  • Multiple ld/st ops ? One wide mem op
  • Cost
  • Packing and unpacking
  • Reshuffling within a register

31
Packing/Unpacking Costs
C A 2 D B 3
32
Packing/Unpacking Costs
  • Packing source operands

A f() B g()
C A 2 D B 3
C A 2 D B 3

33
Packing/Unpacking Costs
  • Packing source operands
  • Unpacking destination operands

A f() B g()
C A 2 D B 3
C A 2 D B 3

E C / 5 F D 7
34
Optimizing Program Performance
  • To achieve the best speedup
  • Maximize parallelization
  • Minimize packing/unpacking

35
Optimizing Program Performance
  • To achieve the best speedup
  • Maximize parallelization
  • Minimize packing/unpacking
  • Many packing possibilities
  • Worst case n ops ? n! configurations
  • Different cost/benefit for each choice

36
Observation 1Packing Costs can be Amortized
  • Use packed result operands

A B C D E F
G A - H I D - J
37
Observation 1Packing Costs can be Amortized
  • Use packed result operands
  • Share packed source operands

A B C D E F
A B C D E F
G A - H I D - J
G B H I E J
38
Observation 2Adjacent Memory is Key
  • Large potential performance gains
  • Eliminate ld/str instructions
  • Few packing possibilities
  • Only one ordering exploits pre-packing

39
SLP Extraction Algorithm
  • Identify adjacent memory references

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
40
SLP Extraction Algorithm
  • Identify adjacent memory references

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
41
SLP Extraction Algorithm
  • Follow def-use chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
42
SLP Extraction Algorithm
  • Follow def-use chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
43
SLP Extraction Algorithm
  • Follow use-def chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
44
SLP Extraction Algorithm
  • Follow use-def chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
45
SLP Extraction Algorithm
  • Follow use-def chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
46
SLP vs. Vector Parallelism

Basic block
47
SLP vs. Vector Parallelism
Iterations
48
SLP vs. Vector Parallelism
  • Extracted with a simple analysis
  • SLP is fine grain ? basic blocks
  • Superset of vector parallelism
  • Unrolling transforms VP to SLP
  • Handles partially vectorizable loops

49
SLP vs. Vector Parallelism
Dynamic SUIF instructions eliminated
50
SLP vs. ILP
  • Subset of instruction level parallelism
  • SIMD hardware is simpler
  • SIMD instructions are more compact
  • Reduces instruction fetch bandwidth

51
SLP and ILP
  • SLP ILP can be exploited together
  • Many architectures can already do this
  • ex) MPC7450
  • 2 vector integer units, 1 vector FPU, 1 vector
    permute unit
  • Mix well with scalar instructions
  • SLP ILP may compete
  • Occurs when parallelism is scarce
  • Unroll the loop more times
  • When ILP is due to loop level parallelism

52
Issues
  • Alignment
  • Strided memory access
  • Cache misses
  • Control flow
  • Profitability model
  • True data dependence
  • Function calls
  • Inlined assembly code
  • Indirect memory accesses
  • Loop bounds varying during run time

53
Overall Improvements Over Scalar (small)
54
Conclusion
  • SIMD parallelism cannot be replaced by other
    types of parallelism.
  • SLP suits the requirements of modern SIMD
    architectures better than the conventional
    technique.
  • Research opportunities to exploit SIMD
    parallelism

55
Backup Slides
  • Alignment
  • Strided memory access
  • More on vectorization

56
Alignment
  • Most multimedia extensions require aligned memory
    accesses.
  • Aligned memory access ?
  • A memory access is aligned to 16 byte boundary if
    the address is multiple of 16.
  • Ex) For 16 byte memory accesses in AltiVec, the
    last 4 bits of the address are ignored.

57
Alignment Code Generation
  • Aligned memory access
  • The address is always a multiple of 16 bytes
  • Just one superword load or store instruction

float a64 for (i0 ilt64 i4) Va
aii3

58
Alignment Code Generation (cont.)
  • Misaligned memory access
  • The address is always non-zero constant offset
    away from the 16 byte boundaries.
  • Static alignment For a misaligned load, issue
    two adjacent aligned loads followed by a merge.

float a64 for (i0 ilt60 i4) Va
ai2i5

59
Alignment Code Generation (cont.)
  • Unaligned memory access
  • The offset from 16 byte boundaries is varying or
    not enough information is available.
  • Dynamic alignment The merging point is computed
    during run time.

float a64 for (i0 ilt60 i) Va
aii3

60
Alignment Analysis
  • Iterative data flow analysis

?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8n0
4n2
61
Transfer Functions
Subtract
Meet
a gcd(a1, a2, b1-b2) b b1 a
a gcd(a1, a2) b (b1 b2) a
Add
Multiply
a gcd(a1, a2) b (b1b2) a
a gcd(a1a2, a1b2, a2b1, C) b (b1b2) a
62
Strided Memory AccessesPacking Through Memory
Memory
Registers
w ((float )a 0) x ((float )b
0) y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
w
x
y
z
p
63
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(a, 0)
temp1
Packing through memory
p shift_and_load(p, temp1)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp1
p
Packing in registers
64
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(b, 0)
temp2
Packing through memory
p shift_and_load(p, temp2)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp2
p
Packing in registers
65
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(c, 0)
temp3
Packing through memory
p shift_and_load(p, temp3)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp3
p
Packing in registers
66
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(d, 0)
temp4
Packing through memory
p shift_and_load(p, temp4)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp4
p
Packing in registers
67
Strided Memory AccessesPower of 2 Interleaved
Data
  • dst vperm(src1, src2, index vector)
  • Fully interleaved data require dlogd steps, where
    d is interleave factor.
  • Dorit Nuzman et. al., PLDI 06

src2
src1
dst
68
Vectorization Dependence Issues
  • True cyclic dependences of distance smaller than
    the vector length can prevent vectorization.

for (i4 ilt64 i4) Va ai-2i1 Vb
bii3 Va Va Vb aii3 Va
for (i4 ilt64 i) Ra ai-2 Rb
bi Ra Ra Rb ai Ra
for (i4 ilt64 i2) Va ai-2i-1 Vb
bii1 Va Va Vb aii1 Va
69
Vectorization Involved Transformations
  • Loop interchange
  • Vectorize one of the outer loops
  • Choose loop with larger iteration count
  • Loop distribution
  • Partial vectorization
  • Node split
  • Break cyclic dependences with an anti-dependence
  • And more
  • Loop skewing, loop peeling, Index set splitting,
    reduction recognition,

70
References
  • S. Larsen and S. Amarasinghe, Exploiting
    Superword Level Parallelism with Multimedia
    Instruction Sets, PLDI 2000
  • S. Larsen, E. Witchel, and S. Amarasinghe,
    Increasing and Detecting Memory Address
    Congruence, PACT 2002
  • J. Shin, M. Hall and J. Chame, Compiler-Controlled
    Caching in Superword Register Files for
    Multimedia Extension Architectures, PACT 2002
  • D. Nuzman, I. Rosen, A. Zaks, Auto-Vectorization
    of Interleaved Data for SIMD, PLDI 2006
Write a Comment
User Comments (0)
About PowerShow.com