Title: A UserProgrammable Vertex Engine
1A User-Programmable Vertex Engine
- Erik Lindholm
- Mark Kilgard
- Henry Moreton
- NVIDIA Corporation
- Presented by Han-Wei Shen
2Where does the Vertex Engine fit?
Transform Lighting
Traditional Graphics Pipeline
setup rasterizer
texture blending
frame-buffer anti-aliasing
3GeForce 3 Vertex Engine
Vertex Program
Transform Lighting
setup rasterizer
texture blending
frame-buffer anti-aliasing
4API Support
- Designed to fit into OpenGL and D3D APIs
- Program mode vs. Fixed function mode
- Load and bind program
- Simple to add to old D3D and OpenGL programs
5Programming Model
- Enable vertex program
- glEnable(GL_VERTEX_PROGRAM_NV)
- Create vertex program object
- Bind vertex program object
- Execute vertex program object
6Create Vertex Program
- Programs (assembly) are defined inline as
- character strings
static const GLubyte vpgm \!!VP1. 0\ DP4
oHPOS.x, c0, v0 \ DP4
oHPOS.y, c1, v0 \ DP4
oHPOS.z, c2, v0 \ DP4
oHPOS.w, c3, v0 \ MOV
oCOL0,v3
\ END"
7Create Vertex Program (2)
- Load and bind vertex programs similar to texture
objects - glLoadProgramNV(GL_VERTEX_PROGRAM_NV, 7,
strelen(programString), programString) - .
- glBindProgramNV(GL_VERTEX_PROGRAM_NV, 7)
8Invoke Vertex Program
- The vertex program is initiated when a vertex is
given, i.e., when - glBegin()
- glVertex3f(x,y,z)
-
- glEnd()
9Lets look at the sample program
static const GLubyte vpgm \!!VP1. 0\ DP4
oHPOS.x, c0, v0 \ DP4
oHPOS.y, c1, v0 \ DP4
oHPOS.z, c2, v0 \ DP4
oHPOS.w, c3, v0 \ MOV
oCOL0,v3
\ END"
OHPOS M(c0,c1,c2,c3) v - HPOS? OCOL0
v3 - COL0? Calculate
the clip space point position and Assign the
vertex with v3 as its diffuse color
10Programming Model
V0 V15
Vertex Source
Program Constants
c0 c96
16x4 registers
OHPOS OCOL0 OCOL1 OFOGP OPSIZ OTEX0
OTEX7
Vertex Program
96x4 registers
R0 R11
Temporary Registers
128 instructions
12x4 registers
Vertex Output
15x4 registers
All quad floats
11Input Vertex Attributes
- V0 V15
- Aliased (tracked) with conventional per-vertex
attributes (Table 3) - Use glVertexAttribNV() to explicitly assig values
- Can also specify a scalar value to the vertex
attribute array - glVertexAttributesNV() - Can change values inside or outside
glBegin()/glEnd() pair
12Program Constants
- Can only change values outside glBegin()/glEnd()
pair - No automatic aliasing
- Can be used to track OpenGl matrices (modelview,
projection, texture, etc.) - Example
- glTrackMatrix(GL_VERTEX_PROGRAM_NV, 0,
GL_MODELVIEW_PROJECTION_NV, GL_IDENTIGY_NV) - - track 4 contiguous program constants starting
with c0
13Program Constants (contd)
- DP4 oHPOS.x, c0, vOPOS
- DP4 oHPOS.y, c1, vOPOS
- DP4 oHPOS.z, c2, vOPOS
- DP4 oHPOS.w, c3, vOPOS
- What does it do?
14Program Constants (contd)
- glTrackMatrixNV(GL_VERTEX_PROGRAM_NV, 4,
GL_MODEL_VIEW, GL_INVERSE_TRANPOSE_NV) - DP3 R0.x, C4, VNRML
- DP3 R0.y, C5, VNRML
- DP3 R0.z, C6, VNRML
- What doe it do?
15Hardware Block Diagram
Vertex In
Vertex Attribute Buffer (VAB)
Vector FP Core
Vertex Out
16Vertex Attribute Buffer (VAB)
128 ( 32 x 4 )
VAB
dirty bits
128
.
0 1 14 15
IB
17HW Block Diagram
18Data Path
X
Y
Z
W
X
Y
Z
W
X
Y
Z
W
Swizzle
Swizzle
Swizzle
Negate
Negate
Negate
FPU Core
Write Mask
X
Y
Z
W
19Instruction Set The ops
- 17 instructions total
- MOV, MUL, ADD, MAD, DST
- DP3, DP4
- MIN, MAX, SLT, SGE
- RCP, RSQ, LOG, EXP, LIT
- ARL
20Instruction Set The Core Features
- Immediate access to sources
- Swizzle/negate on all sources
- Write mask on all destinations
- DP3,DP4 most common graphics ops
- Cross product is MULMAD with swizzling
- LIT instruction implements phong lighting
21Dot Product Instruction
- DP3 R0.x, R1, R2
- R0.x R1.x R2.x R1.y R1.y R1.z R2.z
- DP4 R0.x, R1, R2
- 4-component dot product
22MUL instruction
- MUL R1, R0, R2 (component-wise mult.)
- R1.x R0.x R2.x
- R1.y R0.y R2.y
- R1.z R0.z R2.z
- R1.w R0.w R2.w
23MAD instruction
- MAD R1, R2, R3, R4
- R1 R2 R3 R4
- component wise multiplication
- Example
- MAD R1, R0.yzxw, R2.zxyw, -R1
-
- What does it do?
24Cross Product Coding Example
Cross product R2 R0 x R1 MUL R2, R0.zxyw,
R1.yzxw MAD R2, R0.yzxw, R1.zxyw, -R2
25Lighting instruction
- LIT R1, R0 (phong light model)
- Input R0 (diffuse, specular, ??, shiness)
- Output R1 (1, diffuse, specularshininess, 1)
- Usually followed by
- DP3 oCOL0, C21, R1 (assuming using
c21) - where Cxx (ka, kd, ks, ??)
26Ready to trace some program?
27Previous Work Geometry Engine
- High bandwidth lots of Flops
- Low clock rate
- No architectural continuity
- VERY hard to program
- Some high-level language support (maybe)
- A compromise solution (vtx,prim,pix,)
28Alternative The CPU
- Low bandwidth reasonable Flops
- High clock rate
- Excellent architectural continuity
- VERY hard to use efficiently
- Excellent high-level language support
- Flexible, but often too slow
29New Design The Vertex Engine
- Simple hardware for a commodity GPU
- Allows user to manipulate vertex transform
- Simple to use programming model
- Superset of fixed function mode
30Why Vertex Processing?
- Very parallel
- Use single vertex programming model
- Hardware can batch or interleave
- KISS
31Why Not Primitive Processing?
- Face culling and clipping break parallelism
- Complicates memory accesses
- Inefficient (control takes time)
- Let hardware designers optimize
32Programming Model Vertex I/O
- Streaming vertex architecture
- Source data converted to floats
- Source data loaded
- Run program
- Destination data drained
- Destination data re-formatted for hw
33Hardware Implementation
- Vector SIMD Unit Special Function Unit
- Multithreaded and pipelined to hide latency
- Any one instruction/cycle
- All instructions equal latency
- Free swizzling/negate/write mask support
34Conclusion
- Very simple, efficient implementation
- Allows vertex programming continuity
- Stanford Imagine Architecture
- A work in progress, lots more to come
- We welcome your feedback