Graphics on GRAMPS - PowerPoint PPT Presentation

About This Presentation
Title:

Graphics on GRAMPS

Description:

Real data parallel apps still have performance critical non-data ... Queues are an excellent idiom to capture producer-consumer parallelism thread and data ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 32
Provided by: KayvonFa1
Category:
Tags: gramps | graphics | idiom

less

Transcript and Presenter's Notes

Title: Graphics on GRAMPS


1
Graphics on GRAMPS
  • Jeremy Sugerman
  • Kayvon Fatahalian

2
Background
  • Context Broader research investigation
    generalizing GPU/Cell/compute cores and
    combining them with CPUs.
  • Fundamental Beliefs
  • Real data parallel apps still have performance
    critical non-data parallel pieces
  • Existing parallel programming models are too
    constrained (GPUs) or too hard/vague (CPUs)
  • Queues are an excellent idiom to capture
    producer-consumer parallelism thread and data
  • Fixed function execution units are not a problem,
    but fixed control paths are

3
Compute Cores
  • CPUs designed for single threads per core
  • Minimal FLOPS per core
  • Compute cores design for lots of math per core
  • Many threads per core
  • Sometimes wider SIMD per thread
  • SIMD width hardware threads ops / core
  • And, more compute than CPU cores fit per chip
  • Many examples GPU, Cell, Niagara, Larrabee

4
Simplified Direct3D Pipeline
  • Application launches some drawing
  • Vertex Assembly (Fixed, Non-Data Parallel)
  • Vertex Processing (Programmable, Data Parallel)
  • Primitive Assembly (Fixed, Non-Data Parallel)
  • Primitive Processing (Programmable, Data
    Parallel)
  • Fragment Assembly (Fixed, Non-Data Parallel)
  • Fragment Processing (Programmable, Data Parallel)
  • Pixel / Image Assembly (Fixed, Non-Data Parallel)
  • Only Data Parallel stages are programmable!

5
Direct3D Pipeline Properties
  • There is a reason only data parallel stages are
    programmable.
  • Shader stages are inherently per-element (e.g.
    vertex / primitive / fragment) and stateless
    between them.
  • Assembly stages also run on many elements, but
    they have inter-element dependencies
  • State can be remembered (vertex caching)
  • Inputs can be used by multiple outputs (strips)
  • Programmable Assembly requires heavier (more
    serial) threads than Shaders.

6
Question
  • Can fixed-function control be decoupled from
    efficient graphics performance on a compute-
    heavy architecture?
  • Does not necessarily exclude fixed-function
    execution blocks (eg. rasterizer, texture units)

7
This Talk
  • GRAMPS Our current model for programming compute
    cores.
  • Implementing Direct3D 10 in software with
    GRAMPS.
  • (Potentially) thoughts about how REYES, ray
    tracers map to GRAMPS.
  • No explicit discussion of heterogeneous cores.
  • No fancy scheduling algorithms (yet?)

8
Example Simple 3D Pipeline
Input Vertices
Transformed Vertices
Vertex Shading
Primitive Assembly
Primitives
Fragment Shading
Rasterize (Assemble)
Fragments
Image Assembly
Framebuffer Pixels
Shaded Fragments
9
GRAMPS
  • General Runtime/Architecture for Multicore
    Parallel Systems
  • Models execution graph of queues connected by
    threads
  • Graph specified by host program
  • Simulator for exploring compute cores
  • Currently conflates hardware and runtime
  • of cores, thread contexts, SIMD width are all
    parameters

10
Simple GRAMPS core
  • T - threads/core
  • S - SIMD ALUs/core
  • R - registers/thread
  • 1 thread runs in each clock
  • Threads issue vector instructions (think S-wide
    SSE)

L1 data cache (or scratchpad)
Thread 0
R
Thread 1
Thread 2

Thread T-1
ALU 0
ALU 1
ALU 2
ALU 3
ALU 4
ALU S-1

11
D3D10 Setup
  • App defines 3 shading environments
  • Vertex, geometry, fragment
  • Attach programs and resources
  • App configure fixed function units
  • Fixed number of modes
  • Attach resources
  • App submits work (vertices) to pipeline
  • Graphics runtime executes until completion

12
GRAMPS Setup
  • App defines a set of queues
  • App defines a set of thread environments
  • App attaches queues as thread inputs and outputs
  • App bootstraps computation by inserting data into
    queue
  • Runtime executes threads until completion

13
GRAMPS Entities Execution
  • Threads Assemble, Shader, Fixed
  • Assemble Stateful, akin to a regular thread
  • Fixed Special purpose hardware wrapped to appear
    an Assemble thread
  • Shader Stateless and data parallel

14
GRAMPS Entities Data
  • Queues for producer-consumer parallelism
  • Queues for aggregating coherent work
  • Queues support push and reserve/commit for
    in-place Assembly
  • Chunks are the units / granularity at which
    Queues are manipulated.

15
GRAMPS Scheduling
  • GRAMPS assigns Threads to hw contexts
  • Based on graph, current Queue contents
  • Tiered scheduling model
  • Tier-0 Trivially puts threads onto hw threads
  • Tier-1 Builds schedules for Tier-0.
  • Tier-N Arbitrarily clever. Doesnt exist.

16
System(how it works today)
17
D3D10 on GRAMPS
Index queue
postVtxShade queue
idxVtxAssemble
preVtxShade queue
prePrimAssemble queue
vtxShade
primAssemble
prePrimShade queue
shader thread
primShade
postPrimShade queue
assemble thread
rastAssemble
fixed function in GPU
preRast queue
tri setup / clip / cull
tri queue 0
tri queue 1
tri queue 2
tri queue N
rasterize
rasterize
rasterize
rasterize
preFragShade queue
preFragShade queue
preFragShade queue
preFragShade queue
fragShade
fragShade
fragShade
fragShade
postFragShade queue
postFragShade queue
postFragShade queue
postFragShade queue
blend / ztest
blend / ztest
blend / ztest
blend / ztest
18
Internal Queues
  • Queues just memory state struct (see below)
  • For now Queues are finite
  • Queues are contiguous array of chunks
  • Chunks granularity of manipulation

queue BYTE ptrnum_chunks
chunk_byte_width int num_chunks int
chunk_byte_width int head int tail
int reclaim bool donenum_chunks
19
Ex GRAMPS has chunks
Index queue
postVtxShade queue
idxVtxAssemble
preVtxShade queue
vtxShade
index_queue chunks contain vertex
indices preVtxShade_queue chunks contain 16
pre-transformed vertices postVtxShade_queue
chunks contain 16 transformed vertices
20
Ex GRAMPS has chunks
rasterize
preFragShade queue
fragShade
preFragshade_queue chunks contain Interpolated
inputs for 16 fragments liveness mask per
fragment x,y position per quad uniform data
shared across all fragments
21
Queue API
  • Window view into a contiguous range of chunks
    for assemble threads
  • Symmetric for producing/consuming access

qwin BYTE ptr int num int id
  • Shader threads just have push

22
Queue manipulation
(All threads)
void produce() push
(Assemble shader only)
qwin reserve(qwin q, int num_chunks) qwin
commit(qwin q, int num_chunks)
23
Internal threads
  • Defines a type of thread

ThreadEnv type shader, assemble,
fixed-func Program Code uniforms/constant
data sampler/texture/resource id bindings
List of input queues List of output queues
24
Shader threads
  • Shading language unchanged (HLSL)
  • Still write shaders in terms of single elements
  • Compilation produces code to operate on chunks

void hlsl_likefn(const element inputEl,
element outputEl,
const sampler foo, const tex3d
tex)
25
Internal shader threads
  • Shader thread code processes chunks
  • Input
  • GRAMPS pre-reserved chunks from in/out queues
  • Environment info (uniforms, consts, etc)

void shaderFn(const chunk in_chunks,
chunk out_chunks,
const env env)
  • Dispatched shader threads run to completion
  • Completion implies
  • inChunks are released
  • outChunks are commited

26
Assemble threads
  • Assemble threads build chunks
  • Access queue data via windows
  • Commit/reserve/consume may block thread

void assembleFn(qwin in_win,
qwin out_win, const env env)
27
Ex primitive assembly
  • Input chunks 16 verts
  • Output chunks 16 prims
  • Prim structure depends on type of prim
  • Points lines, triangles, triangle /w adj, etc
  • Creating prims from verts dependent on topology
  • Strips or lists
  • Triangle strip data for output chunk comes from
    multiple input chunks

prePrimAssemble queue
primAssemble
prePrimShade queue
28
Ex frag assembly (rast)
For (each input triangle) Add triangle
uniform data to chunk while (chunk not full
triangle not done) rasterize next
tile of quads for (each nonempty quad)
add 4 fragments to chunk add quad
description per chunk if (chunk is
full) qwin_out commit(qwin_out, 1)
grow window with reserve() if
necessary
Building chunks 1. Compact valid quads
2. Data at various frequencies
29
Execution Tier 1
queue
queue
queue
shader threadEnv
assemble threadEnv
assemble threadEnv
shader threadEnv
queue
queue
queue
shader threadEnv
shader threadEnv
assemble threadEnv
assemble threadEnv
ShaderThr dispatch AssembleThr resume
Tier 1 to Tier 0 FIFO
Thread_Done() (implicit commit) Produce() Reserve(
) Commit()
30
Execution Tier 0
  • Each cycle round robin runnable threads
  • Thread stalls place on wait list
  • When thread completes
  • Pull next thread from fifo, assign to empty
    thread slot
  • Send completion message to tier 0

Tier 1 to Tier 0 FIFO
L1 data cache (or scratchpad)
Thread 0
R
Thread 1
Thread 2

Tier 0 Scheduler
Thread T-1

ALU 0
ALU 1
ALU 2
ALU 3
ALU 4
ALU S-1
31
Validation
  • Fat enough cores for assemble threads can
    deliver sufficient FLOPS
  • Assemble threads can keep compute cores
    fixed-function units busy
  • Can give up domain-specific heuristics in the
    scheduling
Write a Comment
User Comments (0)
About PowerShow.com