Title: Exploiting SIMD parallelism with the CGiS compiler framework
1Exploiting SIMD parallelism with the CGiS
compiler framework
- Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm
- Saarland University
2Outline
- CGiS
- Language, compiler and GPU back-end
- SIMD back-end
- Hardware
- Challenges
- Transformations and optimizations
- Experimental results
- Future Work
- Conclusion
3CGiS
- C-like data-parallel programming language
- Goals
- Exploitation of parallel processing units in
common PCs (GPU, SIMD units) - Easy access for inexperienced programmers
- High abstraction level
- 32-bit scalar and small vector data types
- Two forms of explicit parallelism
- SPMP (iteration), SIMD (vector types)
4CGiS Example YUV to RGB
PROGRAM yuv_to_rgb INTERFACE extern in float3
YUVlt_gt extern out float3 RGBlt_gt CODE procedure
yuv2rgb (in float3 yuv, out float3 rgb) rgb
yuv.x 0, 0.344, 1.77 yuv.y 1.403,
0.714, 0 yuv.z CONTROL forall (yuv in YUV,
rgb in RGB) yuv2rgb (yuv, rgb)
5CGiS Compiler Overview
6CGiS for GPUs
- nVidia G80
- 128 floating points units
- Scalar and vector data processible
- 2-on-2 mapping of CGiS parallelism
- Code generation for various GPU generations
- NV30, NV40, G80, CUDA
- Limited access to hardware features through the
driver
7SIMD Hardware
- Every common PC features SIMD units
- Intels SSE and Freescales AltiVec
- SIMD parallelism not easily accessible for
standard compilers - Well-known vectorization problems
- Data access
- Hardware requires 16-byte aligned loads
- Slow but cached
- Only 4-way SIMD vector parallelism usable
8The SIMD Back-end
- Goal is mapping of CGiS parallelisms to SIMD
hardware - 2-on-1 mapping
- SIMD vectorization problems
- Avoided by design data dependency analyses
- Control flow
- Divergence in consecutive elements
- Misalignment and data layout
- Reordering might be needed
- Gathering operations are bottle-necks in
load-heavy algorithms on multidimensional streams
9Transformations and Optimizations
- Control flow conversion
- If/loop conversion
- Loop sectioning for 2D streams
- Increase cache performance for gather accesses
- Kernel flattening
- IR transformation that replaces compound
variables and operations by scalar ones - 2-on-1
10Control Flow Conversion
- Full inlining
- If/loop converison with slightly modified
Allen-Kennedy algorithm - No guarded assignments
- Masks for select operations are the results of
vector compares - Live and written variables after a control flow
join are copied at the branching - Select operations are inserted at the join
11Loop Sectioning
- Adaptation of iteration sequence to better
exploit cached data - Only interesting for 2D streams
- Iterations subdivided in stripes
- Width depends on access pattern, cache size and
local variables
12Kernel Flattening
procedure yuv2rgb (in float3 yuv, out float3 rgb)
rgb yuv.x 0, 0.344, 1.77 yuv.y
1.403, 0.714, 0 yuv.z
- SIMD vectorization for yuv2rgb not applicable
- Thus flatten the procedure or kernel
- Code transformation on the IR
- All variables and all statements are split into
scalar ones - Those can be subjected to SIMD vectorization
13Kernel Flattening Example
procedure yuv2rgb_f (in float yuv_x, in float
yuv_y, in float yuv_z, out float rgb_x,
out float rgb_y, out float rgb_z) float cy
0.344, cz 1.77, dx 1.403, dy 0.714
rgb_x yuv_x dx yuv.z
rgb_y yuv_x cy yuv.y dy yuv.z rgb_z
yuv_x cz yuv.y
- Procedure yuv2rgb_f now features data types
suitable to be SIMD-parellelized
14Kernel Flattening
- But data layout doesnt fit
- No stride-one access for single components
- Reordering of data required
- Locally via permutes or shuffles
- Globally via memory copy
15Kernel Flattening Data Reorderig
16Global vs. Local Reordering
- Global reordering
- Reusable for further iterations
- Simple, but expensive in-memory copy
- Destroys locality for gather accesses
- Local reordering
- Original stream data untouched
- Insertion of possibly many relatively cheap
in-register permutation operations - Locality for gathering preserved
17Experimental Results
- Tested on Intel Core 2 Duo 1.83GHz and PowerPC G5
1.8GHz - Compiled with intrinsics on gcc 4.0.1
- Examples
- Image processing Gaussian blur
- Loop sectioning
- Computation of mandelbrot set
- Control flow conversion
- Block cipher encryption rc5 encryption
- Kernel flattening
18Experimental Results
19Future Work
- Replace intrinsics by inline-assembly
- Improvement of conditionals
- Better control over register allocation
- Improvement of register re-utilization for
AltiVec - Raises with inline-assembly
- Cell back-end
- SIMD instruction set close to AltiVec
- Work list algorithm to distribute stream parts to
single PEs - More applications
20Conclusion
- CGiS abstracts GPUs as well as SIMD units
- SIMD back-end of the CGiS compiler produces
efficient code - Other transformations and optimizations needed
than for the GPU backend - Full control flow conversion needed
- Gather accesses gain speed with loop sectioning
- Kernel flattening enables better exploitation