Exploiting SIMD parallelism with the CGiS compiler framework - PowerPoint PPT Presentation

About This Presentation

Title:

Exploiting SIMD parallelism with the CGiS compiler framework

Description:

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University Outline CGiS Language, compiler and ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 21

Provided by: cag49

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting SIMD parallelism with the CGiS compiler framework

1
Exploiting SIMD parallelism with the CGiS
compiler framework

Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm
Saarland University

2
Outline

CGiS
Language, compiler and GPU back-end
SIMD back-end
Hardware
Challenges
Transformations and optimizations
Experimental results
Future Work
Conclusion

3
CGiS

C-like data-parallel programming language
Goals
Exploitation of parallel processing units in
common PCs (GPU, SIMD units)
Easy access for inexperienced programmers
High abstraction level
32-bit scalar and small vector data types
Two forms of explicit parallelism
SPMP (iteration), SIMD (vector types)

4
CGiS Example YUV to RGB
PROGRAM yuv_to_rgb INTERFACE extern in float3
YUVlt_gt extern out float3 RGBlt_gt CODE procedure
yuv2rgb (in float3 yuv, out float3 rgb) rgb
yuv.x 0, 0.344, 1.77 yuv.y 1.403,
0.714, 0 yuv.z CONTROL forall (yuv in YUV,
rgb in RGB) yuv2rgb (yuv, rgb)
5
CGiS Compiler Overview
6
CGiS for GPUs

nVidia G80
128 floating points units
Scalar and vector data processible
2-on-2 mapping of CGiS parallelism
Code generation for various GPU generations
NV30, NV40, G80, CUDA
Limited access to hardware features through the
driver

7
SIMD Hardware

Every common PC features SIMD units
Intels SSE and Freescales AltiVec
SIMD parallelism not easily accessible for
standard compilers
Well-known vectorization problems
Data access
Hardware requires 16-byte aligned loads
Slow but cached
Only 4-way SIMD vector parallelism usable

8
The SIMD Back-end

Goal is mapping of CGiS parallelisms to SIMD
hardware
2-on-1 mapping
SIMD vectorization problems
Avoided by design data dependency analyses
Control flow
Divergence in consecutive elements
Misalignment and data layout
Reordering might be needed
Gathering operations are bottle-necks in
load-heavy algorithms on multidimensional streams

9
Transformations and Optimizations

Control flow conversion
If/loop conversion
Loop sectioning for 2D streams
Increase cache performance for gather accesses
Kernel flattening
IR transformation that replaces compound
variables and operations by scalar ones
2-on-1

10
Control Flow Conversion

Full inlining
If/loop converison with slightly modified
Allen-Kennedy algorithm
No guarded assignments
Masks for select operations are the results of
vector compares
Live and written variables after a control flow
join are copied at the branching
Select operations are inserted at the join

11
Loop Sectioning

Adaptation of iteration sequence to better
exploit cached data
Only interesting for 2D streams
Iterations subdivided in stripes
Width depends on access pattern, cache size and
local variables

12
Kernel Flattening
procedure yuv2rgb (in float3 yuv, out float3 rgb)
rgb yuv.x 0, 0.344, 1.77 yuv.y
1.403, 0.714, 0 yuv.z

SIMD vectorization for yuv2rgb not applicable
Thus flatten the procedure or kernel
Code transformation on the IR
All variables and all statements are split into
scalar ones
Those can be subjected to SIMD vectorization

13
Kernel Flattening Example
procedure yuv2rgb_f (in float yuv_x, in float
yuv_y, in float yuv_z, out float rgb_x,
out float rgb_y, out float rgb_z) float cy
0.344, cz 1.77, dx 1.403, dy 0.714
rgb_x yuv_x dx yuv.z
rgb_y yuv_x cy yuv.y dy yuv.z rgb_z
yuv_x cz yuv.y