Processor Opportunities - PowerPoint PPT Presentation

About This Presentation

Title:

Processor Opportunities

Description:

Processor Opportunities Jeremy Sugerman Kayvon Fatahalian – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 34

Provided by: Kayvo

Learn more at: http://graphics.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Processor Opportunities

1
Processor Opportunities

Jeremy Sugerman
Kayvon Fatahalian

2
Outline

Background
Introduction and Overview
Phase One / The First Paper

3
Background

Evolved from the GPU, Cell, and x86 ray tracing
work with Tim (Foley).
Grew out of the FLASHG talk Jeremy gave in
February 2006 and Kayvons experiences with
Sequoia.
Daniel, Mike, and Jeremy pursued related short
term manifestations in our I3D 2007 paper.

4
GPU K-D Tree Ray Tracing

k-D construction really hard. Especially lazy.
Ray k-D Tree intersection painful
Entirely data dependent control flow and access
patterns
With SSE packets, extremely CPU efficient
Local shading runs great on GPUs, though
Highly coherent data access (textures)
Highly coherent execution (materials)
Insult to injury Rasterization dominates tracing
eye rays

5
Fixing the GPU

Ray segments are a lot like fragments
Add frame buffer (x,y) and weight
Otherwise independent, but highly coherent
But Rays can generate more rays
What if
Fragments could create fragments?
Shading and Intersecting fragments could both
be runnable at once?
But
SIMD still inefficient and (lazy) k-D build still
hard

6
Moving to the Present
7
Applications are FLOP Hungry

Important workloads want lots of FLOPS
Video processing, Rendering
Physics / Game Physics
Computational Biology, Finance
Even OS compositing managers!
All can soak vastly more compute than current
CPUs deliver
All can utilize thread or data parallelism.

8
Compute-Maximizing Processors

Or throughput oriented
Packed with ALUs / FPUs
Trade single-thread ILP for higher level
parallelism
Offer an order of magnitude potential performance
boost
Available in many flavours SIMD, Massively
threaded, Hordes of tiny cores,

9
Compute-Maximizing Processors

Generally offered as off-board accelerators
Performance is only achieved when utilization
stays high. Which is hard.
Mapping / porting algorithms is a labour
intensive and complex effort.
This is intrinsic. Within any given area / power
/ transistor budget, an order of magnitude
advantage over CPU performance comes at a cost
If it didnt, the CPU designers would steal it.

10
Real Applications are Complicated

Complete applications have aspects both well
suited to and pathological for compute-maximizing
processors.
Often co-mingled.
Porting is often primarily disentangling into
large enough chunks to be worth offloading.
Difficulty in partitioning and cost of transfer
disqualifies some likely seeming applications.

11
Enter Multi-Core

Single threaded CPU scaling is very hard.
Multi-core and multi-threaded cores are already
mainstream
2-, 4-way x86es, 9-way Cell, 16 way GPU
Multi-core allows heterogeneous cores per chip
Qualitatively easier acceptance than multiple
single core packages.
Qualitatively better than an accelerator model

12
Heterogeneous Multi-Core

Balance the mix of conventional and compute cores
based upon target market.
Area / Power budget can be maximized for e.g.
Consumer / Laptop versus Server
Always worth having at least one throughput core
per chip.
Order of magnitude advantage when it works
Video processing and window system effects
A small compute core is not a huge trade off.

13
Heterogeneous Multi-Core

Three significant advantages
(Obvious) Inter-core communication and
coordination become lighter weight.
(Subtle) Compute-maximizing cores become
ubiquitous CPU elements and thus create a unified
architectural model predicated on their
availability. Not just a CPU plus accelerator!
The CPU-Compute interconnect and software
interface have a single owner and can thus be
extended in key ways.

14
Changing the Rules

AMD ( ATI) already rumbling about Fusion
Just gluing a CPU to a GPU misses out, though.
(Still CPU Accelerator, with a fat pipe)
A few changes break the most onerous flexibility
limitations AND ease the CPU Compute
communication and scheduling model.
Without being impractical (i.e. dropping down to
CPU level performance)

15
Changing the Rules

Work queues / Write buffers as first class items
Simple, but useful building block already
pervasive for coordination / scheduling in
parallel apps.
Plus Unified address space, simple
sync/atomicity,

16
Queue / Buffer Details

Conventional or Compute threads can enqueue for
queues associated with any core.
Dequeue / Dispatch mechanisms vary by core
HW Dispatched for a GPU-like compute core
TBD (Likely SW) for thin multi-threaded cores
SW Dispatched on CPU cores
Queues can be entirely application defined or
reflect hardware resource needs of entries.

17
CPUGPU hybrid
18
What should change?

Accelerator model of computing
Today work created by CPU, in batches
Batch processing not a prerequisite for efficient
coherent execution
Paper 1 GPU threads create new GPU threads
(fragments generate fragments)

19
What should change?

GPU threads to create new GPU threads
GPU threads to create new CPU work (paper 2)
Efficiently run data parallel algorithms on a GPU
where per-element processing goes through
unpredictable
Number of stages
Spends unpredictable about of time in stage
May dynamically create new data elements
Processing is still coherent, but unpredictably
so
(have to dynamically find coherence to run fast)

20
Queues

Model GPU as collection of work queues
Applications consist of many small tasks
Task is either running or in a queue
Software enqueue create new task
Hardware decides when to dequeue and start
running task
All the work in a queue is in similar stage

21
Queues

GPUs today have similar queuing mechanisms
They are implicit/fixed function (invisible)

22
GPU as a giant scheduler
cmd buffer
on-chip queues
data buffer
IA
MC
VS
1-to-1
Off-chip buffers (data)
GS
1-to-N (bounded)
stream out
RS
1-to-N (unbounded)
PS
1-to-(0 or X) (X static)
OM
data buffer
23
GPU as a giant scheduler
Hardware scheduler
VS/GS/PS
IA

RS

Off-chip buffers (data)
Thread scoreboard

Processing cores
MC
command queue
vertex queue
geometry queue
fragment queue
OM
memory queues
On-chip queues
(read-modify-write)
24
GPU as a giant scheduler

Rasterizer ( input cmd processor) is a domain
specific work scheduler
Millions of work items/frame
On-chip queues of work
Thousands of HW threads active at once
CPU threads (via DirectX commands), GS programs,
fixed function logic generate work
Pipeline describes dependencies
What is the work here?
Vertices
Geometric primitives
Fragments

Well defined resource requirements for each
category.
25
GPU Delta

Allow application to define queues
Just like other GPU state management
No longer hard-wired into chip
Make enqueue visible to software
Make it a shader instruction
Preserve shaderexecution
Wide SIMD execution
Stackless lightweight threads
Isolation

26
Research Challenges

Make create queue enqueue operation feasible in
HW
Constrained global operations
Key challenge scheduling work in all the queues
without domain specific knowledge
Keep queue lengths small to fit on chip
What is a good scheduling algorithm?
Define metrics
What information does scheduler need?

27
Role of queues

Recall GPU has queues for commands, vertices,
fragments, etc.
Well-defined processing/resource requirements
associated with queues
Now Software associates properties with queues
during queue instantiation
Aka. Queues are typed

28
Role of queues

Associate execution properties with queues during
queue instantiation
Simple 1 kernel per queue
Tasks using no more than X regs
Tasks that do not perform gathers
Tasks that do not create new tasks
Future Tasks to execute on CPU
Notice COHERENCE HINTS!

29
Role of queues

Denote coherence groupings (where HW finds
coherent work)
Describe dependencies connecting kernels
Enqueue async. add new work into system
Enqueue terminate
Point where coherence groupings change
Point where resource/environment changes

30
Design space

Queue setup commands / enqueue instructions
Scheduling algorithm (what are inputs?)
What properties associated with queues
Ordering guarantees
Determinism
Failure handling (kill or spill when queues
full?)
Inter-task synch (or maintain isolation)
Resource cleanup

31
Implementation

GPU shader interpreter (SM4 extensions)
Hello world run CPU threadGPU threads
GPU threads create other threads
Identify GPU ISA additions
GPU raytracer formulation
May investigate DX10 geometry shader
Establish what information scheduler needs
Compare scheduling strategies

32
Alternatives

Multi-pass rendering
Compare scheduling resources
Compare bandwidth savings
On chip state / performance tradeoff
Large monolithic kernel (branching)
CUDA/CTM
Multi-core x86

33
Three interesting fronts

Paper 1 GPU micro-architecture
GPU work creating new GPU work
Software defined queues
Generalization of DirectX 10 GS?
GPU resource management
Ability to correctly manage/virtualize GPU
resources
CPU/compute-maximized integration
Compute cores? GPU/Niagara/Larrabee
compute cores as first-class execution
environments (dump the accelerator model)
Unified view of work throughout machine
Any core creates work for other cores