Shader Performance Analysis on a Modern GPU Architecture - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Shader Performance Analysis on a Modern GPU Architecture

Description:

Inspired in both NVIDIA and ATI. Not exact to either pipeline ... ATI R520/NVidia G70. Framebuffer. ATTILA OpenGL Driver. ATTILA Simulator. Framebuffer ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 32
Provided by: persona6
Category:

less

Transcript and Presenter's Notes

Title: Shader Performance Analysis on a Modern GPU Architecture


1
Shader Performance Analysis on a Modern GPU
Architecture
Victor Moya, Carlos González, Jordi Roca, Agustí
n Fernández Department of Computer Architecture U
PC
Roger Espasa Intel DEG Barcelona
2
Introduction
  • Shaders in GPUs evolving towards general
    programming
  • Branches, generic loads, scatter
  • New types of shaders geometry in DX10
  • Current specialized shaders
  • Area hungry
  • Unbalancing leads to inefficiencies
  • This paper unify all shaders
  • 8 higher performance with less area resources

3
Outline
  • Attila our GPU architecture
  • Attila-Classic Non-unified shaders
  • Attila-Unified Unified Shaders
  • Simulation Framework
  • Results

4
Outline
  • Attila our GPU architecture
  • Attila-Classic Non-unified shaders
  • Attila-Unified Unified Shaders
  • Simulation Framework
  • Results

5
ATTILA
  • Our implementation of current GPUs
  • Inspired in both NVIDIA and ATI
  • Not exact to either pipeline
  • Lack of detailed micro architecture information
  • Educated guessing on our side
  • Implemented Features
  • 2D Homogeneous Recursive Rasterization
  • Tiled Rasterization
  • Hierarchical Z
  • Texture compression
  • Anisotropic filtering
  • Depth compression, fast z/stencil and color
    clear

6
Outline
  • Attila our GPU architecture
  • Attila-Classic Non-unified shaders
  • Attila-Unified Unified Shaders
  • Simulation Framework
  • Results

7
Attila Classic
Vertex Fetch
Vertex Shader
Vertex Shader
Vertex Shader
Vertex Shader
Primitive Assembly
Clipping
Specialized Shaders
Triangle Setup
Rasterization
HierarchicalZ
Fragment Shader
Fragment Shader
Fragment Shader
Fragment Shader
ROP
ROP
ROP
ROP
Memory Controller
Memory Controller
Memory Controller
Memory Controller
8
Specialized Shader Issues
  • Unbalancing
  • In fragment shading limited scenarios (typical)
    up to 30 of the processing power remains idle
    (for a GPU with 8 vertex and 4 fragment shaders)
  • In vertex shading limited scenarios up to 70 of
    the processing power remains idle.
  • Dedicated Area
  • 4 unused vertex shaders have the same processing
    power than one 1 fragment shader
  • 4 vertex shaders require 66 the area of a
    fragment shader
  • Different Designs
  • Increases the complexity of the micro
    architecture
  • Increases development and verification time

9
Outline
  • Attila our GPU architecture
  • Attila-Classic Non-unified shaders
  • Attila-Unified Unified Shaders
  • Simulation Framework
  • Results

10
Attila Unified
Shader
Vertex Fetch
Shader
Scheduler
Distributor
Primitive Assembly
Clipping
Shader
Triangle Setup
Shader
Rasterization
HierarchicalZ
Unified Shader Pool
ROP
ROP
ROP
ROP
Memory Controller
Memory Controller
Memory Controller
Memory Controller
11
Unified Shader Architecture
  • Benefits
  • Unified programming model
  • DX10/SM4 and OpenGL/GLSlang are already pushing
    for it
  • The same features for all the program targets
  • Texturing, branching, outputs
  • Not just vertex and fragment programs
  • DX10 geometry shader
  • General Purpose GPU or Stream Processor
  • Workload balance
  • Shading resources allocated as required at any
    point of the rendering

12
Unified Shader Architecture
  • Costs
  • Scheduler
  • Select which kind of workload must be processed
    next
  • Partly implemented with multithreading in the
    fragment shader to hide texture access latency
  • Larger instruction memory and constant bank
  • Rerouting required
  • All the paths cross the shader pool

13
Outline
  • Attila our GPU architecture
  • Attila-Classic Non-unified shaders
  • Attila-Unified Unified Shaders
  • Simulation Framework
  • Results

14
ATTILA Framework
  • OpenGL Interceptor tool
  • OpenGL library for Attila GPU
  • Driver for our Attila GPU
  • Attila GPU simulator
  • Signal Visualizer Tool

15
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
16
Collect
Verify
Simulate
Analyze
OpenGL Application
  • GLInterceptor
  • Capture a trace of OpenGL API alls from a real
    game

GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
17
Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
  • GLPlayer
  • Reproduce the captured trace

Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
18
Collect
Verify
Simulate
Analyze
OpenGL Library - Transforms Fixed Function into S
hader code - 200 API Calls supported - ARB Verte
x and Fragment extensions - Alpha and Fog emulate
d via Shader code Driver - Low level access - A
ttila memory management
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
19
Collect
Verify
Simulate
Analyze
ATTILA Simulator - Detailed cycle-by-cycle simula
tion of all pipeline stages - 20 boxes, modeling
a 100-deep pipeline - Execute_at_Execute functional
ity embedded at each pipeline stage
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
20
Find the differences ?
Attila
NVIDIA GeForce FX 5900XT
21
Outline
  • Attila our GPU architecture
  • Attila-Classic Non-unified shaders
  • Attila-Unified Unified Shaders
  • Simulation Framework
  • Results

22
Benchmark
  • Unreal Tournament 2004
  • Fixed function OpenGL API
  • Vertex and fragments shaders generated by our
    library
  • 1024x768 resolution
  • 8x Anisotropic Filtering
  • 160 of 450 frames simulated
  • 40 frames 1 day simulation
  • On a Xeon P4 _at_ 2.0Ghz

23
Baseline Configuration
  • Four Vertex Shaders (only for Attila- Classic)
  • Fragment and Unified shader configuration
  • 32 threads
  • 4 fragments/vertices per thread
  • 16 128-bit FP registers available for temporal
    storage per thread
  • n SIMD ALUs
  • 1 scalar ALU (optional)
  • 1 Texture Unit per Shader Unit
  • 16 KB texture cache
  • Single cycle bilinear and two cycle trilinear
  • AF up to 16x
  • Geometry and Rasterization pipelines limited to 1
    vertex and 1 triangle per cycle
  • Two ROPs 8 z and 8 color values written per
    cycle
  • Four 64-bit DDR buses peak bandwidth 64
    bytes/cycle

24
Classic Performance
7
40
8sh
45
6sh
4sh
75
8
2sh
  • 8 improvement for 2-way
  • Near linear improvement for 4 shaders
  • Sublinear improvement for 6 and 8 shaders
  • Limited by memory bandwidth and latency

25
Frame 330 Detailed Zoom
Vertex shading limited
Vertex shader and fragment shader workload for 4
vertex shader units and 2 fragment shader units
26
Unified Shader Performance
8sh
6sh
4sh
2sh
  • Unified improvement ranges from 1 (2 shaders) to
    8 (eight 1-way shaders)
  • Fragment shading limited
  • Vertex fetch limited
  • Geometry pipeline limited

27
Area Estimation
160 120 40 2 vertex shader 2.5 2
fragments shader 15 5 (other)
28
Shader Scaling vs Transistors
8sh
6sh
4sh
2sh
  • Linear for 4 shader units, sublinear for more
    than 4 shader units
  • Up to 30 more efficient per area for the unified
    architecture (two 1-way shaders)

29
Conclusion
  • Attila Unified architecture has better
    performance than Attila Classic with less
    hardware
  • Up to 8 better performance
  • 8 to 25 less area required
  • 10 to 30 better performance per area
  • Up to 8 better performance for 2-way shader
    units
  • 160 better performance from 2 to 8 fragment or
    unified shader units
  • Memory bandwidth limited beyond 4 shaders

30
  • Questions

31
Performance of Attila Unified vs Classic Attila
Write a Comment
User Comments (0)
About PowerShow.com