Exploiting SIMD parallelism with the CGiS compiler framework - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting SIMD parallelism with the CGiS compiler framework

Description:

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University Outline CGiS Language, compiler and ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 21
Provided by: cag49
Category:

less

Transcript and Presenter's Notes

Title: Exploiting SIMD parallelism with the CGiS compiler framework


1
Exploiting SIMD parallelism with the CGiS
compiler framework
  • Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm
  • Saarland University

2
Outline
  • CGiS
  • Language, compiler and GPU back-end
  • SIMD back-end
  • Hardware
  • Challenges
  • Transformations and optimizations
  • Experimental results
  • Future Work
  • Conclusion

3
CGiS
  • C-like data-parallel programming language
  • Goals
  • Exploitation of parallel processing units in
    common PCs (GPU, SIMD units)
  • Easy access for inexperienced programmers
  • High abstraction level
  • 32-bit scalar and small vector data types
  • Two forms of explicit parallelism
  • SPMP (iteration), SIMD (vector types)

4
CGiS Example YUV to RGB
PROGRAM yuv_to_rgb INTERFACE extern in float3
YUVlt_gt extern out float3 RGBlt_gt CODE procedure
yuv2rgb (in float3 yuv, out float3 rgb) rgb
yuv.x 0, 0.344, 1.77 yuv.y 1.403,
0.714, 0 yuv.z CONTROL forall (yuv in YUV,
rgb in RGB) yuv2rgb (yuv, rgb)
5
CGiS Compiler Overview
6
CGiS for GPUs
  • nVidia G80
  • 128 floating points units
  • Scalar and vector data processible
  • 2-on-2 mapping of CGiS parallelism
  • Code generation for various GPU generations
  • NV30, NV40, G80, CUDA
  • Limited access to hardware features through the
    driver

7
SIMD Hardware
  • Every common PC features SIMD units
  • Intels SSE and Freescales AltiVec
  • SIMD parallelism not easily accessible for
    standard compilers
  • Well-known vectorization problems
  • Data access
  • Hardware requires 16-byte aligned loads
  • Slow but cached
  • Only 4-way SIMD vector parallelism usable

8
The SIMD Back-end
  • Goal is mapping of CGiS parallelisms to SIMD
    hardware
  • 2-on-1 mapping
  • SIMD vectorization problems
  • Avoided by design data dependency analyses
  • Control flow
  • Divergence in consecutive elements
  • Misalignment and data layout
  • Reordering might be needed
  • Gathering operations are bottle-necks in
    load-heavy algorithms on multidimensional streams

9
Transformations and Optimizations
  • Control flow conversion
  • If/loop conversion
  • Loop sectioning for 2D streams
  • Increase cache performance for gather accesses
  • Kernel flattening
  • IR transformation that replaces compound
    variables and operations by scalar ones
  • 2-on-1

10
Control Flow Conversion
  • Full inlining
  • If/loop converison with slightly modified
    Allen-Kennedy algorithm
  • No guarded assignments
  • Masks for select operations are the results of
    vector compares
  • Live and written variables after a control flow
    join are copied at the branching
  • Select operations are inserted at the join

11
Loop Sectioning
  • Adaptation of iteration sequence to better
    exploit cached data
  • Only interesting for 2D streams
  • Iterations subdivided in stripes
  • Width depends on access pattern, cache size and
    local variables

12
Kernel Flattening
procedure yuv2rgb (in float3 yuv, out float3 rgb)
rgb yuv.x 0, 0.344, 1.77 yuv.y
1.403, 0.714, 0 yuv.z
  • SIMD vectorization for yuv2rgb not applicable
  • Thus flatten the procedure or kernel
  • Code transformation on the IR
  • All variables and all statements are split into
    scalar ones
  • Those can be subjected to SIMD vectorization

13
Kernel Flattening Example
procedure yuv2rgb_f (in float yuv_x, in float
yuv_y, in float yuv_z, out float rgb_x,
out float rgb_y, out float rgb_z) float cy
0.344, cz 1.77, dx 1.403, dy 0.714
rgb_x yuv_x dx yuv.z
rgb_y yuv_x cy yuv.y dy yuv.z rgb_z
yuv_x cz yuv.y
  • Procedure yuv2rgb_f now features data types
    suitable to be SIMD-parellelized

14
Kernel Flattening
  • But data layout doesnt fit
  • No stride-one access for single components
  • Reordering of data required
  • Locally via permutes or shuffles
  • Globally via memory copy

15
Kernel Flattening Data Reorderig
16
Global vs. Local Reordering
  • Global reordering
  • Reusable for further iterations
  • Simple, but expensive in-memory copy
  • Destroys locality for gather accesses
  • Local reordering
  • Original stream data untouched
  • Insertion of possibly many relatively cheap
    in-register permutation operations
  • Locality for gathering preserved

17
Experimental Results
  • Tested on Intel Core 2 Duo 1.83GHz and PowerPC G5
    1.8GHz
  • Compiled with intrinsics on gcc 4.0.1
  • Examples
  • Image processing Gaussian blur
  • Loop sectioning
  • Computation of mandelbrot set
  • Control flow conversion
  • Block cipher encryption rc5 encryption
  • Kernel flattening

18
Experimental Results
19
Future Work
  • Replace intrinsics by inline-assembly
  • Improvement of conditionals
  • Better control over register allocation
  • Improvement of register re-utilization for
    AltiVec
  • Raises with inline-assembly
  • Cell back-end
  • SIMD instruction set close to AltiVec
  • Work list algorithm to distribute stream parts to
    single PEs
  • More applications

20
Conclusion
  • CGiS abstracts GPUs as well as SIMD units
  • SIMD back-end of the CGiS compiler produces
    efficient code
  • Other transformations and optimizations needed
    than for the GPU backend
  • Full control flow conversion needed
  • Gather accesses gain speed with loop sectioning
  • Kernel flattening enables better exploitation
Write a Comment
User Comments (0)
About PowerShow.com