Title: Programming the Velocity Engine
1Programming theVelocity Engine
Academic Developers Conference 2001
- Bing-Chang Lai
- Phillip John McKerrow
- University of Wollongong
2Introduction
- What is a Vector Processor?
- The Velocity Engine
- Programming the Velocity Engine
- Discuss Examples 1 to 3 only
- QA
3What is a Vector Processor?
- Supports Single Instruction Multiple Data (SIMD)
instructions - Originally used in Supercomputers for crunching
scientific programs - Now popular on the desktop as well, for crunching
multimedia related applications
4What is a Vector Processor?
- On desktop, it is usually part of a larger
processor - Examples of Vector Processor Technologies
- MMX, SSE, 3DNow, AltiVec
5The Velocity Engine
- Apples name for AltiVec Technology
- What is AltiVec Technology then?
- Refers to technique Motorola used to add vector
processing capabilities to the G4 (74xx) family
of processors
6The Velocity Engine
- G4 Processor
- Load/Store Unit
- Integer Unit
- Floating Point Unit
- Vector Unit (AltiVec)
7Programming the Velocity Engine
- Specifications
- AltiVec Technology Programming Interface Manual
- Available from
- http//e-www.motorola.com/brdata/
PDFDB/MICROPROCESSORS/32_BIT/POWERPC/ALTIVEC/ALTIV
ECPIM.pdf - http//www.altivec.org/tech_specifications/
altivec_pim.pdf
8Programming the Velocity Engine
- Compilers
- Apple AltiVec-related patches to GCC 2.295.2
- Metroworks Codewarrior
- Vector types
- All vectors are 128-bit long
- Start with keyword vector or __vector
- Followed by type. Eg. unsigned char, unsigned
int, signed int and so on
9Programming the Velocity Engine
10Programming the Velocity Engine
long has been Deprecated
11Programming the Velocity Engine
12Programming the Velocity Engine
- Vector operations
- Arithmetic Operations
- vec_abs (absolute value), vec_add (addition),
vec_sub (subtraction) ... - Boolean Operations
- vec_and (Logical AND), vec_or (Logical OR) ...
- vec_cmpeq (Equality), vec_cmple (Less Than or
Equal To)
13Programming the Velocity Engine
- Vector operations
- Miscellaneous Operations
- vec_perm (Permutation), vec_merge (Merges two
vectors into 1) ... - Memory Operations
- vec_st (Store), vec_ld (Load) ...
- Data Stream Operations
- vec_dst (Vector Data Stream Touch), vec_dss
(Vector Data Stream Stop) ...
14Programming the Velocity Engine
- Constraints
- Vector operations all work on 128-bits at a time
only no more and no less. - vec_ld (load) and vec_st (store) all operate on
16-byte (128-bit) boundaries. - This leads to alignment of data issues
- Loading of data from memory to the processor is
one of the main bottlenecks. - Use cache functions to mark data for load before
the operation takes place
15Programming the Velocity Engine
- The following examples from the paper will be
discussed - Example 1 Element-by-Element access
- Example 2 Alignment
- Example 3 Unaligned Loads and Stores
- The Image Addition program in the Appendix will
not be discussed
16Programming the Velocity Engine
- Example 1 Element-by-Element Access
1 include ltiostreamgt 2 typedef union 3
__vector unsigned char AsVector 4 unsigned
char AsUChar16 5 vec_uchar 6 int main()
7 vec_uchar v1 8 v1.AsVector
(__vector unsigned char) ( 9 '0', '1',
'2', '3', '4', '5', '6', '7', 10 '8', '9',
'A', 'B', 'C', 'D', 'E', 'F') 11 for(int i
0 i lt 16 i) 12 stdcout ltlt
v1.AsUChari 13 stdcout ltlt stdendl 14
return 0 15
17Programming the Velocity Engine
- Example 1 Element-by-Element Access
- Outputs
- 01234567890ABCDEF
- Instead of using the union, you can also access
elements by address and casting
__vector unsigned char v1 for(int i 0 i lt 16
i) stdcout ltlt ((unsigned char )(v1))i
18Programming the Velocity Engine
- Example 2 Alignment
- 16-byte aligned locations have address with the
least significant 4 bits set to 0. Eg. 0xf0, 0x10
and so on - AltiVec specification specifies vec_malloc and
vec_free for creating 16-byte aligned blocks for
vectors. - The code finds the aligned address by removing
setting the 4 l.s.b to 0 and then adding 16. - Please note that Apple GCC aligns everything to
16-byte boundaries
19Programming the Velocity Engine
- Example 2 Alignment - Allocate
1 template ltclass Elementgt 2 Element
allocate(unsigned int n) 3 4 // Allocate
n sizeof(Element) 16 bytes 5 Element
p_unal (Element )operator 6
new(nsizeof(Element) 16) 7 //
Align the pointer 8 Element p_al (Element
)align16(p_unaligned) 9 // Store
difference between aligned and unaligned in 10
// byte at location (p_al - 1) 11 unsigned
char p_offset p_al - 1 12 p_offset
p_al - p_unal 13 return p_al 14
20Programming the Velocity Engine
- Example 2 Alignment - Deallocate
1 template ltclass Elementgt 2 void
deallocate(Element p_al) 3 4 // Fetch
difference between aligned and unaligned from
5 // byte at location (p_al - 1) 6 //
and calculate p_unal 7 unsigned char
p_offset p_al - 1 8 Element p_unal
(Element )(p_al - p_offset) 9 10
operator delete(p_unal) 11
21Programming the Velocity Engine
- Example 2 Alignment - Using
1 // Allocate aligned COUNT unsigned char 2
unsigned char p_aligned allocateltunsigned
chargt(COUNT) 3 4 // Now that it is aligned,
we can load into a vector 5 __vector unsigned
char v vec_ld(p_aligned, 0) 6 7 // Use v
for calculations 8 // .... 9 10 // Free
Buffer 11 deallocateltunsigned chargt(p_aligned)
22Programming the Velocity Engine
- Example 3 Unaligned Loads and Store
1 // Load a vector from an unaligned location in
memory 2 __vector unsigned LoadUnaligned(__vector
unsigned char p_v) 3 4 __vector
unsigned char permuteVector vec_lvsl(0, (int
)(p_v)) 5 __vector unsigned char low
vec_ld(0, p_v) 6 __vector unsigned char
high vec_ld(16, p_v) 7 return
vec_perm(low, high, permuteVector) 8
23Programming the Velocity Engine
- Example 3 Unaligned Loads and Store
1 void StoreUnaligned(__vector unsigned char v,
2 __vector unsigned char
p_v) 3 4 __vector unsigned char low
vec_ld(0, p_v) 5 __vector unsigned char
high vec_ld(16, p_v) 6 __vector unsigned
char permvec vec_lvsr(0, (int )p_v) 7
__vector unsigned char oxFF vec_splat_u8(0xff)
8 __vector unsigned char ox00
vec_splat_u8(0) 9 __vector unsigned char
mask vec_perm(ox00, oxFF, permvec) 10 v
vec_perm(v, v, permvec) 11 low
vec_sel(low, v, mask) 12 high vec_sel(v,
high, mask) 13 vec_st(low, 0, p_v) 14
vec_st(high, 16, p_v) 15
24Programming the Velocity Engine
- Example 3 Unaligned Loads and Store
4 l.s.b of p_v 7 v 0
1 2 3 4 5 6 7 8 9 a b c d e f
low 0 0 0 4f 0 0 0 8 0 0
0 6 0 0 0 d high 0 0 0
2 0 0 0 4 41 10 f7 8c bf ff fa 58
perm 9 a b c d e f 10 11 12 13 14
15 16 17 18 mask 0 0 0 0 0
0 0 ff ff ff ff ff ff ff ff ff vec_perm(v,v,perm)
v 9 a b c d e f 0 1 2 3 4 5 6 7
8 vec_sel(low,v,mask) 0 0 0 4f 0 0 0 0
1 2 3 4 5 6 7 8 vec_sel(v,high,mask) 9
a b c d e f 4 41 10 f7 8c bf ff fa 58
25Resources
- The code for this paper will be available
- At http//www.bclai.net (Probably by the end of
the week) - Email me on bl12_at_uow.edu.au
- Other Important Resources
- AltiVec Information Source
- At http//www.altivec.org
- Email group list
- Apples AltiVec Homepage
- At http//developer.apple.com/hardware/ve/
- Tutorials
- Vector Libraries
- AlienOrb AltiVec Page
- At http//www.alienorb.com/AltiVec/
- AltiVec Tutorial
- AltiVec Code Examples on lookup table, streaming
data fetch instructions ...
26References
- Bing-Chang Lai, Phillip John McKerrow Programming
the Velocity Engine, AUC, 2001 - Motorola, Inc. AltiVec Technology Programming
Interface Manual, 1999.see http//e-www.motorola.
com/brdata/PDFDB/MICROPROCESSORS/32_BIT/POWERPC/
ALTIVEC/ALTIVECPIM.pdf - Ian Ollmann Ph.D. AltiVec, 2001. see
http//www.alienorb.com/AltiVec/Altivec.pdf
27QA