A UserProgrammable Vertex Engine - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

A UserProgrammable Vertex Engine

Description:

Previous Work: Geometry Engine. High bandwidth lots of Flops. Low clock rate ... New Design: The Vertex Engine. Simple hardware for a commodity GPU ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 35
Provided by: Comdex
Learn more at: https://cse.osu.edu
Category:

less

Transcript and Presenter's Notes

Title: A UserProgrammable Vertex Engine


1
A User-Programmable Vertex Engine
  • Erik Lindholm
  • Mark Kilgard
  • Henry Moreton
  • NVIDIA Corporation
  • Presented by Han-Wei Shen

2
Where does the Vertex Engine fit?
Transform Lighting
Traditional Graphics Pipeline
setup rasterizer
texture blending
frame-buffer anti-aliasing
3
GeForce 3 Vertex Engine
Vertex Program
Transform Lighting
setup rasterizer
texture blending
frame-buffer anti-aliasing
4
API Support
  • Designed to fit into OpenGL and D3D APIs
  • Program mode vs. Fixed function mode
  • Load and bind program
  • Simple to add to old D3D and OpenGL programs

5
Programming Model
  • Enable vertex program
  • glEnable(GL_VERTEX_PROGRAM_NV)
  • Create vertex program object
  • Bind vertex program object
  • Execute vertex program object

6
Create Vertex Program
  • Programs (assembly) are defined inline as
  • character strings

static const GLubyte vpgm \!!VP1. 0\ DP4
oHPOS.x, c0, v0 \ DP4
oHPOS.y, c1, v0 \ DP4
oHPOS.z, c2, v0 \ DP4
oHPOS.w, c3, v0 \ MOV
oCOL0,v3
\ END"
7
Create Vertex Program (2)
  • Load and bind vertex programs similar to texture
    objects
  • glLoadProgramNV(GL_VERTEX_PROGRAM_NV, 7,
    strelen(programString), programString)
  • .
  • glBindProgramNV(GL_VERTEX_PROGRAM_NV, 7)

8
Invoke Vertex Program
  • The vertex program is initiated when a vertex is
    given, i.e., when
  • glBegin()
  • glVertex3f(x,y,z)
  • glEnd()

9
Lets look at the sample program
static const GLubyte vpgm \!!VP1. 0\ DP4
oHPOS.x, c0, v0 \ DP4
oHPOS.y, c1, v0 \ DP4
oHPOS.z, c2, v0 \ DP4
oHPOS.w, c3, v0 \ MOV
oCOL0,v3
\ END"
OHPOS M(c0,c1,c2,c3) v - HPOS? OCOL0
v3 - COL0? Calculate
the clip space point position and Assign the
vertex with v3 as its diffuse color
10
Programming Model
V0 V15
Vertex Source
Program Constants
c0 c96
16x4 registers
OHPOS OCOL0 OCOL1 OFOGP OPSIZ OTEX0
OTEX7
Vertex Program
96x4 registers
R0 R11
Temporary Registers
128 instructions
12x4 registers
Vertex Output
15x4 registers
All quad floats
11
Input Vertex Attributes
  • V0 V15
  • Aliased (tracked) with conventional per-vertex
    attributes (Table 3)
  • Use glVertexAttribNV() to explicitly assig values
  • Can also specify a scalar value to the vertex
    attribute array - glVertexAttributesNV()
  • Can change values inside or outside
    glBegin()/glEnd() pair

12
Program Constants
  • Can only change values outside glBegin()/glEnd()
    pair
  • No automatic aliasing
  • Can be used to track OpenGl matrices (modelview,
    projection, texture, etc.)
  • Example
  • glTrackMatrix(GL_VERTEX_PROGRAM_NV, 0,
    GL_MODELVIEW_PROJECTION_NV, GL_IDENTIGY_NV)
  • - track 4 contiguous program constants starting
    with c0

13
Program Constants (contd)
  • DP4 oHPOS.x, c0, vOPOS
  • DP4 oHPOS.y, c1, vOPOS
  • DP4 oHPOS.z, c2, vOPOS
  • DP4 oHPOS.w, c3, vOPOS
  • What does it do?

14
Program Constants (contd)
  • glTrackMatrixNV(GL_VERTEX_PROGRAM_NV, 4,
    GL_MODEL_VIEW, GL_INVERSE_TRANPOSE_NV)
  • DP3 R0.x, C4, VNRML
  • DP3 R0.y, C5, VNRML
  • DP3 R0.z, C6, VNRML
  • What doe it do?

15
Hardware Block Diagram
Vertex In
Vertex Attribute Buffer (VAB)
Vector FP Core
Vertex Out
16
Vertex Attribute Buffer (VAB)
128 ( 32 x 4 )

VAB
dirty bits
128
.
0 1 14 15
IB
17
HW Block Diagram
 
18
Data Path
X
Y
Z
W
X
Y
Z
W
X
Y
Z
W
Swizzle
Swizzle
Swizzle
Negate
Negate
Negate
FPU Core
Write Mask
X
Y
Z
W
19
Instruction Set The ops
  • 17 instructions total
  • MOV, MUL, ADD, MAD, DST
  • DP3, DP4
  • MIN, MAX, SLT, SGE
  • RCP, RSQ, LOG, EXP, LIT
  • ARL

20
Instruction Set The Core Features
  • Immediate access to sources
  • Swizzle/negate on all sources
  • Write mask on all destinations
  • DP3,DP4 most common graphics ops
  • Cross product is MULMAD with swizzling
  • LIT instruction implements phong lighting

21
Dot Product Instruction
  • DP3 R0.x, R1, R2
  • R0.x R1.x R2.x R1.y R1.y R1.z R2.z
  • DP4 R0.x, R1, R2
  • 4-component dot product

22
MUL instruction
  • MUL R1, R0, R2 (component-wise mult.)
  • R1.x R0.x R2.x
  • R1.y R0.y R2.y
  • R1.z R0.z R2.z
  • R1.w R0.w R2.w

23
MAD instruction
  • MAD R1, R2, R3, R4
  • R1 R2 R3 R4
  • component wise multiplication
  • Example
  • MAD R1, R0.yzxw, R2.zxyw, -R1
  • What does it do?

24
Cross Product Coding Example
Cross product R2 R0 x R1 MUL R2, R0.zxyw,
R1.yzxw MAD R2, R0.yzxw, R1.zxyw, -R2
25
Lighting instruction
  • LIT R1, R0 (phong light model)
  • Input R0 (diffuse, specular, ??, shiness)
  • Output R1 (1, diffuse, specularshininess, 1)
  • Usually followed by
  • DP3 oCOL0, C21, R1 (assuming using
    c21)
  • where Cxx (ka, kd, ks, ??)

26
Ready to trace some program?
27
Previous Work Geometry Engine
  • High bandwidth lots of Flops
  • Low clock rate
  • No architectural continuity
  • VERY hard to program
  • Some high-level language support (maybe)
  • A compromise solution (vtx,prim,pix,)

28
Alternative The CPU
  • Low bandwidth reasonable Flops
  • High clock rate
  • Excellent architectural continuity
  • VERY hard to use efficiently
  • Excellent high-level language support
  • Flexible, but often too slow

29
New Design The Vertex Engine
  • Simple hardware for a commodity GPU
  • Allows user to manipulate vertex transform
  • Simple to use programming model
  • Superset of fixed function mode

30
Why Vertex Processing?
  • Very parallel
  • Use single vertex programming model
  • Hardware can batch or interleave
  • KISS

31
Why Not Primitive Processing?
  • Face culling and clipping break parallelism
  • Complicates memory accesses
  • Inefficient (control takes time)
  • Let hardware designers optimize

32
Programming Model Vertex I/O
  • Streaming vertex architecture
  • Source data converted to floats
  • Source data loaded
  • Run program
  • Destination data drained
  • Destination data re-formatted for hw

33
Hardware Implementation
  • Vector SIMD Unit Special Function Unit
  • Multithreaded and pipelined to hide latency
  • Any one instruction/cycle
  • All instructions equal latency
  • Free swizzling/negate/write mask support

34
Conclusion
  • Very simple, efficient implementation
  • Allows vertex programming continuity
  • Stanford Imagine Architecture
  • A work in progress, lots more to come
  • We welcome your feedback
Write a Comment
User Comments (0)
About PowerShow.com