Vectorization for Modern Architectures - PowerPoint PPT Presentation

1 / 70

About This Presentation

Title:

Vectorization for Modern Architectures

Description:

Threads. Instructions. Data. Grid. BG/L. Netscape. ILP. SIMD. Scalar vs. SIMD Operation ... Must be explicitly exposed to the hardware. By the compiler or by ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 71

Provided by: jaewoo

Category:

more less

Transcript and Presenter's Notes

Title: Vectorization for Modern Architectures

1
Vectorization for Modern Architectures
Wednesday, September 23, 2009

Jaewook Shin

2
The Fastest Computer BG/L
3
Overview

Introduction
What-is / why SIMD ?
Multimedia extensions
Automatic parallelization
Vectorization for vector machines
Superword-Level Parallelization
Issues
Conclusion

4
Multiple Levels of Parallelism
Data
SIMD
Instructions
ILP
Threads
Netscape
Processors
BG/L
Computers
Grid
5
Scalar vs. SIMD Operation
1
r3

Scalar add r1,r2,r3
2
r2
3
r1
6
Why SIMD ?

More parallelism
When parallelism is abundant
SIMD in addition to ILP
Simple design
Replicated functional units
Small die area
No heavily ported register files
Die area
MAX-2(HP) 0.1
VIS(Sun) 3.0
Must be explicitly exposed to the hardware
By the compiler or by the programmer

7
Multimedia / Scientific Applications

Image
Graphics 3D games, movies
Image recognition
Video encoding/decoding JPEG, MPEG4
Sound
Encoding/decoding IP phone, MP3
Speech recognition
Digital signal processing Cell phones
Scientific applications
Double precision GEneral Matrix-Matrix
multiplication (DGEMM)
Y aX Y (SAXPY)

8
Characteristics of Multimedia Applications

Regular data access pattern
Data items are contiguous in memory
Short data types
8, 16, 32 bits
Data streaming through a series of processing
stages
Some temporal reuse for such data streams
Sometimes
Many constants
Short iteration counts
Requires saturation arithmetic

9
Multimedia Extensions

At the core of multimedia extensions
SIMD parallelism
Variable-sized data fields
Vector length register width / type size

WIDE UNIT
10
Multimedia Extension (cont.)

Additions to all major ISAs
Alignment
AltiVec Only aligned memory accesses
SSE Unaligned memory accesses are more
expensive.
Not all operations are directly supported
AltiVec No division, no 32-bit integer
muliplication

11
PIM DIVA (Data IntensiVe Architecture)
DRAM Array (32 MB)
DRAM Page (2048 bits)
Scalar registers, 32x32b (128B)
Wide registers, 32x256b (1KB)
I-Cache (4KB)
Wide Functional Unit
Scalar functional unit
12
2nd Generation DIVA PIM Chip

Fabrication technology
TSMC 0.18mm
Size
10.5mm x 11.5mm, 56.6 million transistors
Package
35mm, 420 BGA
Status
140MHz
1W
Integrated in HP Longs Peak IA64 server

BGA Top View
BGA Bottom View
13
PIM-enhanced HP IA64 System
Two 2nd Generation PIM cards in HP Longs Peak
IA64 host Could host four boards with eight PIM
chips
14
Super Computer BlueGene/L PowerPC 440 Double FPU
15
Programming Multimedia Extensions

Language extension
Programming interface similar to function call
C built-in functions, Fortran intrinsics
Most native compilers support their own
multimedia extensions
AltiVec dst vec_add(src1, src2)
SSE2 dst _mm_add_ps(src1, src2)
GCC -faltivec, -msse2
BG/L dst __fpadd(src1, src2)
No Standard !

16
Programming Multimedia Extensions (cont.)

Library calls and (inline) assembly
Difficult to program
Not portable
Different extensions to the same ISA
MMX and SSE
SSE vs. 3DNow!
Need automatic compilation

17
Overview

Introduction
What-is / why SIMD ?
Multimedia extensions
Automatic parallelization
Vectorization for vector machines
Superword-Level Parallelization
Issues
Alignment
Strided memory access

18
Automatic Parallelization by Compiler

Two approaches
Targeting loops
Vector machines
Vectorization
Targeting basic blocks
Multimedia extensions
Superword-Level Parallelization

for (i0 ilt64 i) ai bici
compiler
for (i0 ilt64 i4) aii3
bii3cii3
19
Parallelizing Compilers

Research compilers
SLP compiler (MIT, Sam Larsen)
Vectorizing SUIF Compiler (U. Toronto, Derek De
Vries)
Commercial compilers
Intel compiler
XL compilers for BG/L (IBM)
GCC 4.0.1
VAST/AltiVec (Crescent Bay Software)
Most compilers for multimedia extensions are
based on conventional vectorization techniques.

20
Vectorization

Pros
Successful for vector computers
Large body of research
Cons
Involved transformations
Targets loop nests

21
Vectorization (cont.)
for (i0 ilt64 i) Rb bi Rc
ci Ra Rb Rc ai Ra
for (i0 ilt64 i4) for (j0 jlt4 j)
Rbj bij for (j0 jlt4 j)
Rcj cij for (j0 jlt4 j)
Raj RbijRcij for (j0 jlt4 j)
aij Raj
for (i0 ilt64 i4) for (j0 jlt4 j)
Rb bij Rc cij Ra Rb
Rc aij Ra
vector int Va, Vb, Vc for (i0 ilt64 i4)
Vb bii3 Vc cii3 Va Vb
Vc aii3 Va
22
Superword-Level Parallelism (SLP)

Fine grain SIMD parallelism in aggregate data
objects larger than a machine word

23
1. Independent ALU Ops
R R XR 1.08327 G G XG 1.89234 B B
XB 1.29835
24
2. Adjacent Memory References
R R Xi0 G G Xi1 B B Xi2
25
3. Vectorizable Loops
for (i0 ilt100 i1) Ai0 Ai0 Bi0
26
3. Vectorizable Loops
for (i0 ilt100 i4) Ai0 Ai0 Bi0
Ai1 Ai1 Bi1 Ai2 Ai2
Bi2 Ai3 Ai3 Bi3
27
4. Partially Vectorizable Loops
for (i0 ilt16 i1) L Ai0 Bi0 D
D abs(L)
28
4. Partially Vectorizable Loops
for (i0 ilt16 i2) L Ai0 Bi0 D
D abs(L)
L Ai1 Bi1 D D abs(L)
29
Exploiting SLP with SIMD Execution

Benefit
Multiple ALU ops ? One SIMD op
Multiple ld/st ops ? One wide mem op

30
Exploiting SLP with SIMD Execution

Benefit
Multiple ALU ops ? One SIMD op
Multiple ld/st ops ? One wide mem op
Cost
Packing and unpacking
Reshuffling within a register

31
Packing/Unpacking Costs
C A 2 D B 3
32
Packing/Unpacking Costs

Packing source operands

A f() B g()
C A 2 D B 3
C A 2 D B 3

33
Packing/Unpacking Costs

Packing source operands
Unpacking destination operands

A f() B g()
C A 2 D B 3
C A 2 D B 3

E C / 5 F D 7
34
Optimizing Program Performance

To achieve the best speedup
Maximize parallelization
Minimize packing/unpacking

35
Optimizing Program Performance

To achieve the best speedup
Maximize parallelization
Minimize packing/unpacking
Many packing possibilities
Worst case n ops ? n! configurations
Different cost/benefit for each choice

36
Observation 1Packing Costs can be Amortized

Use packed result operands

A B C D E F
G A - H I D - J
37
Observation 1Packing Costs can be Amortized

Use packed result operands
Share packed source operands

A B C D E F
A B C D E F
G A - H I D - J
G B H I E J
38
Observation 2Adjacent Memory is Key

Large potential performance gains
Eliminate ld/str instructions
Few packing possibilities
Only one ordering exploits pre-packing

39
SLP Extraction Algorithm

Identify adjacent memory references

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
40
SLP Extraction Algorithm

Identify adjacent memory references

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
41
SLP Extraction Algorithm

Follow def-use chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
42
SLP Extraction Algorithm

Follow def-use chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
43
SLP Extraction Algorithm

Follow use-def chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
44
SLP Extraction Algorithm

Follow use-def chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
45
SLP Extraction Algorithm

Follow use-def chains

A Xi0 C E 3 B Xi1 H C A D F
5 J D - B
46
SLP vs. Vector Parallelism

Basic block
47
SLP vs. Vector Parallelism
Iterations
48
SLP vs. Vector Parallelism

Extracted with a simple analysis
SLP is fine grain ? basic blocks
Superset of vector parallelism
Unrolling transforms VP to SLP
Handles partially vectorizable loops

49
SLP vs. Vector Parallelism
Dynamic SUIF instructions eliminated
50
SLP vs. ILP

Subset of instruction level parallelism
SIMD hardware is simpler
SIMD instructions are more compact
Reduces instruction fetch bandwidth

51
SLP and ILP

SLP ILP can be exploited together
Many architectures can already do this
ex) MPC7450
2 vector integer units, 1 vector FPU, 1 vector
permute unit
Mix well with scalar instructions
SLP ILP may compete
Occurs when parallelism is scarce
Unroll the loop more times
When ILP is due to loop level parallelism

52
Issues

Alignment
Strided memory access
Cache misses
Control flow
Profitability model
True data dependence
Function calls
Inlined assembly code
Indirect memory accesses
Loop bounds varying during run time

53
Overall Improvements Over Scalar (small)
54
Conclusion

SIMD parallelism cannot be replaced by other
types of parallelism.
SLP suits the requirements of modern SIMD
architectures better than the conventional
technique.
Research opportunities to exploit SIMD
parallelism

55
Backup Slides

Alignment
Strided memory access
More on vectorization

56
Alignment

Most multimedia extensions require aligned memory
accesses.
Aligned memory access ?
A memory access is aligned to 16 byte boundary if
the address is multiple of 16.
Ex) For 16 byte memory accesses in AltiVec, the
last 4 bits of the address are ignored.

57
Alignment Code Generation

Aligned memory access
The address is always a multiple of 16 bytes
Just one superword load or store instruction

float a64 for (i0 ilt64 i4) Va
aii3

58
Alignment Code Generation (cont.)

Misaligned memory access
The address is always non-zero constant offset
away from the 16 byte boundaries.
Static alignment For a misaligned load, issue
two adjacent aligned loads followed by a merge.

float a64 for (i0 ilt60 i4) Va
ai2i5

59
Alignment Code Generation (cont.)

Unaligned memory access
The offset from 16 byte boundaries is varying or
not enough information is available.
Dynamic alignment The merging point is computed
during run time.

float a64 for (i0 ilt60 i) Va
aii3

60
Alignment Analysis

Iterative data flow analysis

?
8n0
8n4
8n2
8n6
8n1
8n5
8n3
8n7
4n0
4n2
4n1
4n3
2n0
2n1
n0
8n0
4n2
61
Transfer Functions
Subtract
Meet
a gcd(a1, a2, b1-b2) b b1 a
a gcd(a1, a2) b (b1 b2) a
Add
Multiply
a gcd(a1, a2) b (b1b2) a
a gcd(a1a2, a1b2, a2b1, C) b (b1b2) a
62
Strided Memory AccessesPacking Through Memory
Memory
Registers
w ((float )a 0) x ((float )b
0) y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
w
x
y
z
p
63
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(a, 0)
temp1
Packing through memory
p shift_and_load(p, temp1)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp1
p
Packing in registers
64
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(b, 0)
temp2
Packing through memory
p shift_and_load(p, temp2)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp2
p
Packing in registers
65
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(c, 0)
temp3
Packing through memory
p shift_and_load(p, temp3)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp3
p
Packing in registers
66
Strided Memory AccessesPacking in Registers
w ((float )a 0) x ((float )b 0)
y ((float )c 0) z ((float )d
0) ((float )p 0) w ((float )p 1)
x ((float )p 2) y ((float )p 3)
z
replicate(d, 0)
temp4
Packing through memory
p shift_and_load(p, temp4)
temp1 replicate(a, 0) temp2 replicate(b, 0)
temp3 replicate(c, 0) temp4
replicate(d, 0) p shift_and_load(p,
temp1) p shift_and_load(p, temp2) p
shift_and_load(p, temp3) p shift_and_load(p,
temp4)
p
temp4
p
Packing in registers
67
Strided Memory AccessesPower of 2 Interleaved
Data

dst vperm(src1, src2, index vector)
Fully interleaved data require dlogd steps, where
d is interleave factor.
Dorit Nuzman et. al., PLDI 06

src2
src1
dst
68
Vectorization Dependence Issues

True cyclic dependences of distance smaller than
the vector length can prevent vectorization.

for (i4 ilt64 i4) Va ai-2i1 Vb
bii3 Va Va Vb aii3 Va
for (i4 ilt64 i) Ra ai-2 Rb
bi Ra Ra Rb ai Ra
for (i4 ilt64 i2) Va ai-2i-1 Vb
bii1 Va Va Vb aii1 Va
69
Vectorization Involved Transformations

Loop interchange
Vectorize one of the outer loops
Choose loop with larger iteration count
Loop distribution
Partial vectorization
Node split
Break cyclic dependences with an anti-dependence
And more
Loop skewing, loop peeling, Index set splitting,
reduction recognition,

70
References

S. Larsen and S. Amarasinghe, Exploiting
Superword Level Parallelism with Multimedia
Instruction Sets, PLDI 2000
S. Larsen, E. Witchel, and S. Amarasinghe,
Increasing and Detecting Memory Address
Congruence, PACT 2002
J. Shin, M. Hall and J. Chame, Compiler-Controlled
Caching in Superword Register Files for
Multimedia Extension Architectures, PACT 2002
D. Nuzman, I. Rosen, A. Zaks, Auto-Vectorization
of Interleaved Data for SIMD, PLDI 2006