Title: Shader Performance Analysis on a Modern GPU Architecture
1Shader Performance Analysis on a Modern GPU
Architecture
Victor Moya, Carlos González, Jordi Roca, Agustí
n Fernández Department of Computer Architecture U
PC
Roger Espasa Intel DEG Barcelona
2Introduction
- Shaders in GPUs evolving towards general
programming
- Branches, generic loads, scatter
- New types of shaders geometry in DX10
- Current specialized shaders
- Area hungry
- Unbalancing leads to inefficiencies
- This paper unify all shaders
- 8 higher performance with less area resources
3Outline
- Attila our GPU architecture
- Attila-Classic Non-unified shaders
- Attila-Unified Unified Shaders
- Simulation Framework
- Results
4Outline
- Attila our GPU architecture
- Attila-Classic Non-unified shaders
- Attila-Unified Unified Shaders
- Simulation Framework
- Results
5ATTILA
- Our implementation of current GPUs
- Inspired in both NVIDIA and ATI
- Not exact to either pipeline
- Lack of detailed micro architecture information
- Educated guessing on our side
- Implemented Features
- 2D Homogeneous Recursive Rasterization
- Tiled Rasterization
- Hierarchical Z
- Texture compression
- Anisotropic filtering
- Depth compression, fast z/stencil and color
clear
6Outline
- Attila our GPU architecture
- Attila-Classic Non-unified shaders
- Attila-Unified Unified Shaders
- Simulation Framework
- Results
7Attila Classic
Vertex Fetch
Vertex Shader
Vertex Shader
Vertex Shader
Vertex Shader
Primitive Assembly
Clipping
Specialized Shaders
Triangle Setup
Rasterization
HierarchicalZ
Fragment Shader
Fragment Shader
Fragment Shader
Fragment Shader
ROP
ROP
ROP
ROP
Memory Controller
Memory Controller
Memory Controller
Memory Controller
8Specialized Shader Issues
- Unbalancing
- In fragment shading limited scenarios (typical)
up to 30 of the processing power remains idle
(for a GPU with 8 vertex and 4 fragment shaders)
- In vertex shading limited scenarios up to 70 of
the processing power remains idle.
- Dedicated Area
- 4 unused vertex shaders have the same processing
power than one 1 fragment shader
- 4 vertex shaders require 66 the area of a
fragment shader
- Different Designs
- Increases the complexity of the micro
architecture
- Increases development and verification time
9Outline
- Attila our GPU architecture
- Attila-Classic Non-unified shaders
- Attila-Unified Unified Shaders
- Simulation Framework
- Results
10Attila Unified
Shader
Vertex Fetch
Shader
Scheduler
Distributor
Primitive Assembly
Clipping
Shader
Triangle Setup
Shader
Rasterization
HierarchicalZ
Unified Shader Pool
ROP
ROP
ROP
ROP
Memory Controller
Memory Controller
Memory Controller
Memory Controller
11Unified Shader Architecture
- Benefits
- Unified programming model
- DX10/SM4 and OpenGL/GLSlang are already pushing
for it
- The same features for all the program targets
- Texturing, branching, outputs
- Not just vertex and fragment programs
- DX10 geometry shader
- General Purpose GPU or Stream Processor
- Workload balance
- Shading resources allocated as required at any
point of the rendering
12Unified Shader Architecture
- Costs
- Scheduler
- Select which kind of workload must be processed
next
- Partly implemented with multithreading in the
fragment shader to hide texture access latency
- Larger instruction memory and constant bank
- Rerouting required
- All the paths cross the shader pool
13Outline
- Attila our GPU architecture
- Attila-Classic Non-unified shaders
- Attila-Unified Unified Shaders
- Simulation Framework
- Results
14ATTILA Framework
- OpenGL Interceptor tool
- OpenGL library for Attila GPU
- Driver for our Attila GPU
- Attila GPU simulator
- Signal Visualizer Tool
15Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
16Collect
Verify
Simulate
Analyze
OpenGL Application
- GLInterceptor
- Capture a trace of OpenGL API alls from a real
game
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
17Collect
Verify
Simulate
Analyze
OpenGL Application
GLInterceptor
- GLPlayer
- Reproduce the captured trace
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
18Collect
Verify
Simulate
Analyze
OpenGL Library - Transforms Fixed Function into S
hader code - 200 API Calls supported - ARB Verte
x and Fragment extensions - Alpha and Fog emulate
d via Shader code Driver - Low level access - A
ttila memory management
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
19Collect
Verify
Simulate
Analyze
ATTILA Simulator - Detailed cycle-by-cycle simula
tion of all pipeline stages - 20 boxes, modeling
a 100-deep pipeline - Execute_at_Execute functional
ity embedded at each pipeline stage
OpenGL Application
GLInterceptor
Trace
GLPlayer
Statistics
Vendor OpenGL Driver
Vendor OpenGL Driver
ATTILA OpenGL Driver
Signal Traffic
ATI R520/NVidia G70
ATI R520/NVidia G70
ATTILA Simulator
Framebuffer
Framebuffer
Framebuffer
Signal Visualizer
CHECK!
CHECK!
20Find the differences ?
Attila
NVIDIA GeForce FX 5900XT
21Outline
- Attila our GPU architecture
- Attila-Classic Non-unified shaders
- Attila-Unified Unified Shaders
- Simulation Framework
- Results
22Benchmark
- Unreal Tournament 2004
- Fixed function OpenGL API
- Vertex and fragments shaders generated by our
library
- 1024x768 resolution
- 8x Anisotropic Filtering
- 160 of 450 frames simulated
- 40 frames 1 day simulation
- On a Xeon P4 _at_ 2.0Ghz
23Baseline Configuration
- Four Vertex Shaders (only for Attila- Classic)
- Fragment and Unified shader configuration
- 32 threads
- 4 fragments/vertices per thread
- 16 128-bit FP registers available for temporal
storage per thread
- n SIMD ALUs
- 1 scalar ALU (optional)
- 1 Texture Unit per Shader Unit
- 16 KB texture cache
- Single cycle bilinear and two cycle trilinear
- AF up to 16x
- Geometry and Rasterization pipelines limited to 1
vertex and 1 triangle per cycle
- Two ROPs 8 z and 8 color values written per
cycle
- Four 64-bit DDR buses peak bandwidth 64
bytes/cycle
24Classic Performance
7
40
8sh
45
6sh
4sh
75
8
2sh
- 8 improvement for 2-way
- Near linear improvement for 4 shaders
- Sublinear improvement for 6 and 8 shaders
- Limited by memory bandwidth and latency
25Frame 330 Detailed Zoom
Vertex shading limited
Vertex shader and fragment shader workload for 4
vertex shader units and 2 fragment shader units
26Unified Shader Performance
8sh
6sh
4sh
2sh
- Unified improvement ranges from 1 (2 shaders) to
8 (eight 1-way shaders)
- Fragment shading limited
- Vertex fetch limited
- Geometry pipeline limited
27Area Estimation
160 120 40 2 vertex shader 2.5 2
fragments shader 15 5 (other)
28Shader Scaling vs Transistors
8sh
6sh
4sh
2sh
- Linear for 4 shader units, sublinear for more
than 4 shader units
- Up to 30 more efficient per area for the unified
architecture (two 1-way shaders)
29Conclusion
- Attila Unified architecture has better
performance than Attila Classic with less
hardware
- Up to 8 better performance
- 8 to 25 less area required
- 10 to 30 better performance per area
- Up to 8 better performance for 2-way shader
units
- 160 better performance from 2 to 8 fragment or
unified shader units
- Memory bandwidth limited beyond 4 shaders
30 31Performance of Attila Unified vs Classic Attila