The Graphics Card As A Stream Computer - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

The Graphics Card As A Stream Computer

Description:

Florida State U, Mar 23, 2005. ATI Animusic Demo. Florida State U, Mar 23, 2005 ... Rendering complex scenes (like the Animusic demo) ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 36
Provided by: HMS13
Category:

less

Transcript and Presenter's Notes

Title: The Graphics Card As A Stream Computer


1
The Graphics Card As A Stream Computer
  • Suresh Venkatasubramanian
  • ATT Labs - Research

2
ATI Animusic Demo
3
CPU performance growth is slowing
4
Why Program on the GPU ?
From Stream Programming Environments
Hanrahan, 2004.
5
How has this come about ?
  • Game design has become ever more sophisticated.
  • Fast GPUs are used to implement complex shader
    and rendering operations for real-time effects.
  • In turn, the demand for speed has led to
    ever-increasing innovation in card design.
  • The NV40 architecture has 225 million
    transistors, compared to about 175 million for
    the Pentium 4 EE 3.2 Ghz chip.

6
GPU Fast co-processor ?
  • GPU speed increasing at cubed-Moores Law.
  • This is a consequence of the data-parallel
    streaming aspects of the GPU.
  • GPUs are cheap ! Put enough together, and you can
    get a super-computer.

So can we use the GPU for general-purpose
computing ?
NYT May 26, 2003 TECHNOLOGY From PlayStation to
Supercomputer for 50,000 National Center for
Supercomputing Applications at University of
Illinois at Urbana-Champaign builds supercomputer
using 70 individual Sony Playstation 2 machines
project required no hardware engineering other
than mounting Playstations in a rack and
connecting them with high-speed network switch
7
Yes ! Wealth of applications
Data Analysis
Motion Planning
Particle Systems
Voronoi Diagrams
Force-field simulation
Geometric Optimization
Graph Drawing
Molecular Dynamics
Physical Simulation
Matrix Multiplication
Database queries
Stream Mining
Conjugate Gradient
Sorting and Searching
Range queries
Video Editing
Signal Processing
Image Analysis
and graphics too !!
8
When does GPUfast co-processor work ?
  • Real-time visualization of complex phenomena
  • The GPU (like a fast parallel processor) can
    simulate physical processes like fluid flow,
    n-body systems, molecular dynamics

9
When does GPUfast co-processor work ?
  • Interactive data analysis
  • For effective visualization of data,
    interactivity is key

10
When does GPUfast co-processor work ?
  • Rendering complex scenes (like the Animusic
    demo)
  • Procedural shaders can offload much of the
    expensive rendering work to the GPU. Still not
    the Holy Grail of 80 million triangles at 30
    frames/sec, but it helps.

Alvy Ray Smith, Pixar.
11
General-purpose Programming on the GPU What do
you need ?
  • In the abstract
  • A model of the processor
  • A high level language
  • In practical terms
  • Programming tools (compiler/debugger/optimizer/)
  • Benchmarking

12
Follow the language
  • GPU architecture details hidden (unlike CPUs).
  • OpenGL (or DirectX) provides a state machine that
    represents the rendering pipeline.
  • Early GPU programs used properties of the state
    machine to program the GPU.
  • Tools like Renderman provided sophisticated
    shader languages, but these were not part of the
    rendering pipeline.

13
Programming using OpenGL state
  • One programmed in OpenGL using state variables
    like blend functions, depth tests and stencil
    tests
  • glEnable( GL_BLEND )
  • glBlendEquationEXT ( GL_MIN_EXT )
  • glBlendFunc( GL_ONE, GL_ONE )

14
Follow the language
  • As the rendering pipeline became more complex,
    new functionality was added to the state machine
    (via extensions)
  • With the introduction of vertex and fragment
    programs, full programmability was introduced to
    the pipeline.

15
Follow the language
  • With fragment programs, one could write general
    programs at each fragment
  • MUL tmp, fragment.texcoord0,
    size.x
  • FLR intg, tmp
  • FRC frac, tmp
  • SUB frac_1, frac, 1.0
  • But writing (pseudo)-assembly code is clumsy and
    error-prone.

16
Follow the language
  • Finally, with the advent of high level languages
    like Cg, BrookGPU, and Sh, general purpose
    programming has become easy
  • float4 main( in float2 texcoords TEXCOORD0,
  • in float2 wpos WPOS,
  • uniform samplerRECT pbuffer,
  • uniform sampler2D nvlogo) COLOR
  • float4 currentColor texRECT(pbuffer, wpos)
  • float4 logo tex2D(nvlogo, texcoords)
  • return currentColor (logo 0.0003)

17
A Unifying theme Streaming
  • All the language models share basic properties
  • They view the frame buffer as an array of pixel
    computers, with the same program running at each
    pixel (SIMD)
  • Fragments are streamed to each pixel computer
  • The pixel programs have limited state.

18
What is stream programming?
  • A stream is a sequence of data (could be numbers,
    colors, RGBA vectors,)
  • A kernel is a (fragment) program that runs on
    each element of a stream, generating an output
    stream (pixel buffer).

19
Stream Program gt GPU
  • Kernel vertex/fragment program
  • Input stream stream of fragments or vertices or
    texture data
  • Output stream frame buffer or pixel buffer or
    texture.
  • Multiple kernels multi-pass rendering sequence
    on the GPU.

20
To program the GPU, one must think of it as a
(parallel) stream processor.
21
What is the cost of a program ?
  • Each kernel represents one pass of a multi-pass
    computation on the GPU.
  • Readbacks from the GPU to main memory are
    expensive, and so is transferring data to the
    GPU.
  • Thus, the number of kernels in a stream program
    is one measure of how expensive a computation is.

22
What is the cost of a program ?
  • Each kernel is a vertex/fragment program. The
    more complex the program, the longer a fragment
    takes to move through a rendering pipeline.
  • Complexity of kernel is another measure of cost
    in a stream program.

23
What is the cost of a program ?
  • Texture accesses on the GPU can be expensive if
    accesses are non-local
  • Number of memory accesses is also a measure of
    complexity in a stream program.

24
Some Examples
25
Computing an inner product
  • Given two (long) vectors a, b, compute c ab
  • If a is presented as a1, a2, then
  • c c ai bi
  • suffices.

c c ai bi
Write operation
Read operation
This kernel is not stateless c is part of its
internal state.
26
Computing an inner product Simple streams
  • GPU programs are stateless no read-write
    operations are allowed.
  • Inner product ON GPU looks like
  • K1 ci ai bi
  • K2 ci ci ci-1
  • K(log n) c c1 c2
  • This is an expensive operation !

27
Matrix Operations
  • To multiply two matrices, we can use repeated
    inner-product calculations (good for sparse
    matrices)
  • Or, we can compute partial results for each entry
    in a single pass.

Both (naïve) approaches have been used in
practice for effective linear system solvers.
28
Strassens Method ?
  • Can we use Strassens approach by packing
    submatrices? Each subproblem is half the size, so
    we can pack 4 subproblems in one matrix
  • Naïve matrix multiplication 8 recursive calls
    2 passes
  • P(n) 2P(n/2)
  • Strassens approach 7 recursive calls
  • P(n) (7/4) P(n/2) ?
  • It is difficult to spread recursive problems
    among passes, so we compromise
  • P(n) 7P(n/8)
  • which yields P(n) O(n0.94) sublinear passes !

29
Geometric Optimization Parallel Streams
Can also solve width, bounding box, min-width
annulus, line of regression, best-fit circle,
Precision issues can be handled by choosing
grid resolution appropriately.
30
GPU Parallel Computer (FFT)
Pass 1
Pass 2
Pass 3
Parallel step GPU rendering pass log n
rendering passes are needed for
computation Individual computations in a pass are
performed at different pixels
31
The GPGPU Challenge
  • Be cognizant of the stream nature of the GPU.
  • Design algorithms that minimize cost under
    streaming measures of complexity rather than
    traditional measures.
  • Implement these algorithms efficiently on the
    GPU, keeping in mind the limited resources
    (memory, program length) and various bottlenecks
    (geometry, fill rate) on the card.

32
Some Open Questions
  • Can we extend standard theoretical models to this
    new architecture ?
  • How many stream kernels are needed to multiply
    two matrices ?
  • What are the limitations of stream architectures
    ?
  • When should we implement an algorithm on the GPU
    ?
  • Whats the right way to write and compile GPU
    programs ?
  • Is there a Visual Studio for stream programming ?
  • What new hardware support is needed to extend the
    capabilities of the GPU ?
  • Hardware-assisted cryptography ?

33
The future
  • The Cell Processor (IBMSonyToshiba)
  • 256 GFLOPS (single precision)
  • 234 M transistors, 4 GHz

Cell Prototype Die (Pham et al, ISSCC 2005)
34
The future
  • The Physics Processing Unit (?)
  • Ageia proposes a new processing unit dedicated to
    performing the physics calculations in real-time
    simulations
  • collision detection, gravity, forces etc.
  • No hardware model revealed as yet

35
The GPGPU Community
  • The ATT GPGPU group
  • http//www.research.att.com/areas/visualization/gp
    gpu
  • gpgpu_at_research.att.com
  • GPGPU.org (repository of papers)
  • Three courses at SIGGRAPH, dedicated workshop,
    courses in various universities (ongoing course
    at Penn)
  • Acknowledgements
  • Many thanks to Shankar Krishnan, Sudipto Guha and
    Nabil Mustafa.
  • http//www.research.att.com/suresh/research.html
Write a Comment
User Comments (0)
About PowerShow.com