Title: Vectorization for Modern Architectures
1Vectorization for Modern Architectures
Wednesday, September 23, 2009
2The Fastest Computer BG/L
3Overview
- Introduction
- What-is / why SIMD ?
- Multimedia extensions
- Automatic parallelization
- Vectorization for vector machines
- Superword-Level Parallelization
- Issues
- Conclusion
4Multiple Levels of Parallelism
Data
SIMD
Instructions
ILP
Threads
Netscape
Processors
BG/L
Computers
Grid
5Scalar vs. SIMD Operation
1
r3
Scalar add r1,r2,r3
2
r2
3
r1
6Why SIMD ?
- More parallelism
- When parallelism is abundant
- SIMD in addition to ILP
- Simple design
- Replicated functional units
- Small die area
- No heavily ported register files
- Die area
- MAX-2(HP) 0.1
- VIS(Sun) 3.0
- Must be explicitly exposed to the hardware
- By the compiler or by the programmer
7Multimedia / Scientific Applications
- Image
- Graphics 3D games, movies
- Image recognition
- Video encoding/decoding JPEG, MPEG4
- Sound
- Encoding/decoding IP phone, MP3
- Speech recognition
- Digital signal processing Cell phones
- Scientific applications
- Double precision GEneral Matrix-Matrix
multiplication (DGEMM) - Y aX Y (SAXPY)
8Characteristics of Multimedia Applications
- Regular data access pattern
- Data items are contiguous in memory
- Short data types
- 8, 16, 32 bits
- Data streaming through a series of processing
stages - Some temporal reuse for such data streams
- Sometimes
- Many constants
- Short iteration counts
- Requires saturation arithmetic
9Multimedia Extensions
- At the core of multimedia extensions
- SIMD parallelism
- Variable-sized data fields
- Vector length register width / type size
WIDE UNIT
10Multimedia Extension (cont.)
- Additions to all major ISAs
- Alignment
- AltiVec Only aligned memory accesses
- SSE Unaligned memory accesses are more
expensive. - Not all operations are directly supported
- AltiVec No division, no 32-bit integer
muliplication
11PIM DIVA (Data IntensiVe Architecture)
DRAM Array (32 MB)
DRAM Page (2048 bits)
Scalar registers, 32x32b (128B)
Wide registers, 32x256b (1KB)
I-Cache (4KB)
Wide Functional Unit
Scalar functional unit
122nd Generation DIVA PIM Chip
- Fabrication technology
- TSMC 0.18mm
- Size
- 10.5mm x 11.5mm, 56.6 million transistors
- Package
- 35mm, 420 BGA
- Status
- 140MHz
- 1W
- Integrated in HP Longs Peak IA64 server
BGA Top View
BGA Bottom View
13PIM-enhanced HP IA64 System
Two 2nd Generation PIM cards in HP Longs Peak
IA64 host Could host four boards with eight PIM
chips
14Super Computer BlueGene/L PowerPC 440 Double FPU
15Programming Multimedia Extensions
- Language extension
- Programming interface similar to function call
- C built-in functions, Fortran intrinsics
- Most native compilers support their own
multimedia extensions - AltiVec dst vec_add(src1, src2)
- SSE2 dst _mm_add_ps(src1, src2)
- GCC -faltivec, -msse2
- BG/L dst __fpadd(src1, src2)
- No Standard !
16Programming Multimedia Extensions (cont.)
- Library calls and (inline) assembly
- Difficult to program
- Not portable
- Different extensions to the same ISA
- MMX and SSE
- SSE vs. 3DNow!
- Need automatic compilation
17Overview
- Introduction
- What-is / why SIMD ?
- Multimedia extensions
- Automatic parallelization
- Vectorization for vector machines
- Superword-Level Parallelization
- Issues
- Alignment
- Strided memory access
18Automatic Parallelization by Compiler
- Two approaches
- Targeting loops
- Vector machines
- Vectorization
- Targeting basic blocks
- Multimedia extensions
- Superword-Level Parallelization
for (i0 ilt64 i) ai bici
compiler
for (i0 ilt64 i4) aii3
bii3cii3
19Parallelizing Compilers
- Research compilers
- SLP compiler (MIT, Sam Larsen)
- Vectorizing SUIF Compiler (U. Toronto, Derek De
Vries) -
- Commercial compilers
- Intel compiler
- XL compilers for BG/L (IBM)
- GCC 4.0.1
- VAST/AltiVec (Crescent Bay Software)
-
- Most compilers for multimedia extensions are
based on conventional vectorization techniques.
20Vectorization
- Pros
- Successful for vector computers
- Large body of research
- Cons
- Involved transformations
- Targets loop nests
21Vectorization (cont.)
for (i0 ilt64 i) Rb bi Rc
ci Ra Rb Rc ai Ra
for (i0 ilt64 i4) for (j0 jlt4 j)
Rbj bij for (j0 jlt4 j)
Rcj cij for (j0 jlt4 j)
Raj RbijRcij for (j0 jlt4 j)
aij Raj
for (i0 ilt64 i4) for (j0 jlt4 j)
Rb bij Rc cij Ra Rb
Rc aij Ra
vector int Va, Vb, Vc for (i0 ilt64 i4)
Vb bii3 Vc cii3 Va Vb
Vc aii3 Va
22Superword-Level Parallelism (SLP)
- Fine grain SIMD parallelism in aggregate data
objects larger than a machine word
231. Independent ALU Ops
R R XR 1.08327 G G XG 1.89234 B B
XB 1.29835
242. Adjacent Memory References
R R Xi0 G G Xi1 B B Xi2
253. Vectorizable Loops
for (i0 ilt100 i1) Ai0 Ai0 Bi0
263. Vectorizable Loops
for (i0 ilt100 i4) Ai0 Ai0 Bi0
Ai1 Ai1 Bi1 Ai2 Ai2
Bi2 Ai3 Ai3 Bi3
274. Partially Vectorizable Loops
for (i0 ilt16 i1) L Ai0 Bi0 D
D abs(L)
284. Partially Vectorizable Loops
for (i0 ilt16 i2) L Ai0 Bi0 D
D abs(L)
L Ai1 Bi1 D D abs(L)
29Exploiting SLP with SIMD Execution
- Benefit
- Multiple ALU ops ? One SIMD op
- Multiple ld/st ops ? One wide mem op
30Exploiting SLP with SIMD Execution
- Benefit
- Multiple ALU ops ? One SIMD op
- Multiple ld/st ops ? One wide mem op
- Cost
- Packing and unpacking
- Reshuffling within a register
31Packing/Unpacking Costs
C A 2 D B 3
32Packing/Unpacking Costs
A f() B g()
C A 2 D B 3
C A 2 D B 3
33Packing/Unpacking Costs
- Packing source operands
- Unpacking destination operands
A f() B g()
C A 2 D B 3
C A 2 D B 3
E C / 5 F D 7
34Optimizing Program Performance
- To achieve the best speedup
- Maximize parallelization
- Minimize packing/unpacking
35Optimizing Program Performance
- To achieve the best speedup
- Maximize parallelization
- Minimize packing/unpacking
- Many packing possibilities
- Worst case n ops ? n! configurations
- Different cost/benefit for each choice
36Observation 1Packing Costs can be Amortized
- Use packed result operands
A B C D E F
G A - H I D - J
37Observation 1Packing Costs can be Amortized
- Use packed result operands
- Share packed source operands
A B C D E F
A B C D E F
G A - H I D - J
G B H I E J
38Observation 2Adjacent Memory is Key
- Large potential performance gains
- Eliminate ld/str instructions
- Few packing possibilities
- Only one ordering exploits pre-packing
39SLP Extraction Algorithm
- Identify adjacent memory references
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
40SLP Extraction Algorithm
- Identify adjacent memory references
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
41SLP Extraction Algorithm
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
42SLP Extraction Algorithm
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
43SLP Extraction Algorithm
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
44SLP Extraction Algorithm
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
45SLP Extraction Algorithm
A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
46SLP vs. Vector Parallelism
Basic block
47SLP vs. Vector Parallelism
Iterations
48SLP vs. Vector Parallelism
- Extracted with a simple analysis
- SLP is fine grain ? basic blocks
- Superset of vector parallelism
- Unrolling transforms VP to SLP
- Handles partially vectorizable loops
49SLP vs. Vector Parallelism
Dynamic SUIF instructions eliminated
50SLP vs. ILP
- Subset of instruction level parallelism
- SIMD hardware is simpler
- SIMD instructions are more compact
- Reduces instruction fetch bandwidth
51SLP and ILP
- SLP ILP can be exploited together
- Many architectures can already do this
- ex) MPC7450
- 2 vector integer units, 1 vector FPU, 1 vector
permute unit - Mix well with scalar instructions
- SLP ILP may compete
- Occurs when parallelism is scarce
- Unroll the loop more times
- When ILP is due to loop level parallelism
52Issues
- Alignment
- Strided memory access
- Cache misses
- Control flow
- Profitability model
- True data dependence
- Function calls
- Inlined assembly code
- Indirect memory accesses
- Loop bounds varying during run time
53Overall Improvements Over Scalar (small)
54Conclusion
- SIMD parallelism cannot be replaced by other
types of parallelism. - SLP suits the requirements of modern SIMD
architectures better than the conventional
technique. - Research opportunities to exploit SIMD
parallelism
55Backup Slides
- Alignment
- Strided memory access
- More on vectorization
56Alignment
- Most multimedia extensions require aligned memory
accesses. - Aligned memory access ?
- A memory access is aligned to 16 byte boundary if
the address is multiple of 16. - Ex) For 16 byte memory accesses in AltiVec, the
last 4 bits of the address are ignored.
57Alignment Code Generation
- Aligned memory access
- The address is always a multiple of 16 bytes
- Just one superword load or store instruction
float a64 for (i0 ilt64 i4) Va
aii3
58Alignment Code Generation (cont.)
- Misaligned memory access
- The address is always non-zero constant offset
away from the 16 byte boundaries. - Static alignment For a misaligned load, issue
two adjacent aligned loads followed by a merge.
float a64 for (i0 ilt60 i4) Va
ai2i5
59Alignment Code Generation (cont.)
- Unaligned memory access
- The offset from 16 byte boundaries is varying or
not enough information is available. - Dynamic alignment The merging point is computed
during run time.
float a64 for (i0 ilt60 i) Va
aii3
60Alignment Analysis
- Iterative data flow analysis
?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8n0
4n2
61Transfer Functions
Subtract
Meet
a gcd(a1, a2, b1-b2) b b1 a
a gcd(a1, a2) b (b1 b2) a
Add
Multiply
a gcd(a1, a2) b (b1b2) a
a gcd(a1a2, a1b2, a2b1, C) b (b1b2) a
62Strided Memory AccessesPacking Through Memory
Memory
Registers
w ((float )a 0) x ((float )b
0) y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
w
x
y
z
p
63Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(a, 0)
temp1
Packing through memory
p shift_and_load(p, temp1)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp1
p
Packing in registers
64Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(b, 0)
temp2
Packing through memory
p shift_and_load(p, temp2)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp2
p
Packing in registers
65Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(c, 0)
temp3
Packing through memory
p shift_and_load(p, temp3)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp3
p
Packing in registers
66Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(d, 0)
temp4
Packing through memory
p shift_and_load(p, temp4)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp4
p
Packing in registers
67Strided Memory AccessesPower of 2 Interleaved
Data
- dst vperm(src1, src2, index vector)
- Fully interleaved data require dlogd steps, where
d is interleave factor. - Dorit Nuzman et. al., PLDI 06
src2
src1
dst
68Vectorization Dependence Issues
- True cyclic dependences of distance smaller than
the vector length can prevent vectorization.
for (i4 ilt64 i4) Va ai-2i1 Vb
bii3 Va Va Vb aii3 Va
for (i4 ilt64 i) Ra ai-2 Rb
bi Ra Ra Rb ai Ra
for (i4 ilt64 i2) Va ai-2i-1 Vb
bii1 Va Va Vb aii1 Va
69Vectorization Involved Transformations
- Loop interchange
- Vectorize one of the outer loops
- Choose loop with larger iteration count
- Loop distribution
- Partial vectorization
- Node split
- Break cyclic dependences with an anti-dependence
- And more
- Loop skewing, loop peeling, Index set splitting,
reduction recognition,
70References
- S. Larsen and S. Amarasinghe, Exploiting
Superword Level Parallelism with Multimedia
Instruction Sets, PLDI 2000 - S. Larsen, E. Witchel, and S. Amarasinghe,
Increasing and Detecting Memory Address
Congruence, PACT 2002 - J. Shin, M. Hall and J. Chame, Compiler-Controlled
Caching in Superword Register Files for
Multimedia Extension Architectures, PACT 2002 - D. Nuzman, I. Rosen, A. Zaks, Auto-Vectorization
of Interleaved Data for SIMD, PLDI 2006 -