Processor Opportunities - PowerPoint PPT Presentation

About This Presentation
Title:

Processor Opportunities

Description:

Processor Opportunities Jeremy Sugerman Kayvon Fatahalian – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 34
Provided by: Kayvo
Category:

less

Transcript and Presenter's Notes

Title: Processor Opportunities


1
Processor Opportunities
  • Jeremy Sugerman
  • Kayvon Fatahalian

2
Outline
  • Background
  • Introduction and Overview
  • Phase One / The First Paper

3
Background
  • Evolved from the GPU, Cell, and x86 ray tracing
    work with Tim (Foley).
  • Grew out of the FLASHG talk Jeremy gave in
    February 2006 and Kayvons experiences with
    Sequoia.
  • Daniel, Mike, and Jeremy pursued related short
    term manifestations in our I3D 2007 paper.

4
GPU K-D Tree Ray Tracing
  • k-D construction really hard. Especially lazy.
  • Ray k-D Tree intersection painful
  • Entirely data dependent control flow and access
    patterns
  • With SSE packets, extremely CPU efficient
  • Local shading runs great on GPUs, though
  • Highly coherent data access (textures)
  • Highly coherent execution (materials)
  • Insult to injury Rasterization dominates tracing
    eye rays

5
Fixing the GPU
  • Ray segments are a lot like fragments
  • Add frame buffer (x,y) and weight
  • Otherwise independent, but highly coherent
  • But Rays can generate more rays
  • What if
  • Fragments could create fragments?
  • Shading and Intersecting fragments could both
    be runnable at once?
  • But
  • SIMD still inefficient and (lazy) k-D build still
    hard

6
Moving to the Present
7
Applications are FLOP Hungry
  • Important workloads want lots of FLOPS
  • Video processing, Rendering
  • Physics / Game Physics
  • Computational Biology, Finance
  • Even OS compositing managers!
  • All can soak vastly more compute than current
    CPUs deliver
  • All can utilize thread or data parallelism.

8
Compute-Maximizing Processors
  • Or throughput oriented
  • Packed with ALUs / FPUs
  • Trade single-thread ILP for higher level
    parallelism
  • Offer an order of magnitude potential performance
    boost
  • Available in many flavours SIMD, Massively
    threaded, Hordes of tiny cores,

9
Compute-Maximizing Processors
  • Generally offered as off-board accelerators
  • Performance is only achieved when utilization
    stays high. Which is hard.
  • Mapping / porting algorithms is a labour
    intensive and complex effort.
  • This is intrinsic. Within any given area / power
    / transistor budget, an order of magnitude
    advantage over CPU performance comes at a cost
  • If it didnt, the CPU designers would steal it.

10
Real Applications are Complicated
  • Complete applications have aspects both well
    suited to and pathological for compute-maximizing
    processors.
  • Often co-mingled.
  • Porting is often primarily disentangling into
    large enough chunks to be worth offloading.
  • Difficulty in partitioning and cost of transfer
    disqualifies some likely seeming applications.

11
Enter Multi-Core
  • Single threaded CPU scaling is very hard.
  • Multi-core and multi-threaded cores are already
    mainstream
  • 2-, 4-way x86es, 9-way Cell, 16 way GPU
  • Multi-core allows heterogeneous cores per chip
  • Qualitatively easier acceptance than multiple
    single core packages.
  • Qualitatively better than an accelerator model

12
Heterogeneous Multi-Core
  • Balance the mix of conventional and compute cores
    based upon target market.
  • Area / Power budget can be maximized for e.g.
    Consumer / Laptop versus Server
  • Always worth having at least one throughput core
    per chip.
  • Order of magnitude advantage when it works
  • Video processing and window system effects
  • A small compute core is not a huge trade off.

13
Heterogeneous Multi-Core
  • Three significant advantages
  • (Obvious) Inter-core communication and
    coordination become lighter weight.
  • (Subtle) Compute-maximizing cores become
    ubiquitous CPU elements and thus create a unified
    architectural model predicated on their
    availability. Not just a CPU plus accelerator!
  • The CPU-Compute interconnect and software
    interface have a single owner and can thus be
    extended in key ways.

14
Changing the Rules
  • AMD ( ATI) already rumbling about Fusion
  • Just gluing a CPU to a GPU misses out, though.
  • (Still CPU Accelerator, with a fat pipe)
  • A few changes break the most onerous flexibility
    limitations AND ease the CPU Compute
    communication and scheduling model.
  • Without being impractical (i.e. dropping down to
    CPU level performance)

15
Changing the Rules
  • Work queues / Write buffers as first class items
  • Simple, but useful building block already
    pervasive for coordination / scheduling in
    parallel apps.
  • Plus Unified address space, simple
    sync/atomicity,

16
Queue / Buffer Details
  • Conventional or Compute threads can enqueue for
    queues associated with any core.
  • Dequeue / Dispatch mechanisms vary by core
  • HW Dispatched for a GPU-like compute core
  • TBD (Likely SW) for thin multi-threaded cores
  • SW Dispatched on CPU cores
  • Queues can be entirely application defined or
    reflect hardware resource needs of entries.

17
CPUGPU hybrid
18
What should change?
  • Accelerator model of computing
  • Today work created by CPU, in batches
  • Batch processing not a prerequisite for efficient
    coherent execution
  • Paper 1 GPU threads create new GPU threads
  • (fragments generate fragments)

19
What should change?
  • GPU threads to create new GPU threads
  • GPU threads to create new CPU work (paper 2)
  • Efficiently run data parallel algorithms on a GPU
    where per-element processing goes through
    unpredictable
  • Number of stages
  • Spends unpredictable about of time in stage
  • May dynamically create new data elements
  • Processing is still coherent, but unpredictably
    so
  • (have to dynamically find coherence to run fast)

20
Queues
  • Model GPU as collection of work queues
  • Applications consist of many small tasks
  • Task is either running or in a queue
  • Software enqueue create new task
  • Hardware decides when to dequeue and start
    running task
  • All the work in a queue is in similar stage

21
Queues
  • GPUs today have similar queuing mechanisms
  • They are implicit/fixed function (invisible)

22
GPU as a giant scheduler
cmd buffer
on-chip queues
data buffer
IA
MC
VS
1-to-1
Off-chip buffers (data)
GS
1-to-N (bounded)
stream out
RS
1-to-N (unbounded)
PS
1-to-(0 or X) (X static)
OM
data buffer
23
GPU as a giant scheduler
Hardware scheduler
VS/GS/PS
IA




RS








Off-chip buffers (data)
Thread scoreboard




Processing cores
MC
command queue
vertex queue
geometry queue
fragment queue
OM
memory queues
On-chip queues
(read-modify-write)
24
GPU as a giant scheduler
  • Rasterizer ( input cmd processor) is a domain
    specific work scheduler
  • Millions of work items/frame
  • On-chip queues of work
  • Thousands of HW threads active at once
  • CPU threads (via DirectX commands), GS programs,
    fixed function logic generate work
  • Pipeline describes dependencies
  • What is the work here?
  • Vertices
  • Geometric primitives
  • Fragments

Well defined resource requirements for each
category.
25
GPU Delta
  • Allow application to define queues
  • Just like other GPU state management
  • No longer hard-wired into chip
  • Make enqueue visible to software
  • Make it a shader instruction
  • Preserve shaderexecution
  • Wide SIMD execution
  • Stackless lightweight threads
  • Isolation

26
Research Challenges
  • Make create queue enqueue operation feasible in
    HW
  • Constrained global operations
  • Key challenge scheduling work in all the queues
    without domain specific knowledge
  • Keep queue lengths small to fit on chip
  • What is a good scheduling algorithm?
  • Define metrics
  • What information does scheduler need?

27
Role of queues
  • Recall GPU has queues for commands, vertices,
    fragments, etc.
  • Well-defined processing/resource requirements
    associated with queues
  • Now Software associates properties with queues
    during queue instantiation
  • Aka. Queues are typed

28
Role of queues
  • Associate execution properties with queues during
    queue instantiation
  • Simple 1 kernel per queue
  • Tasks using no more than X regs
  • Tasks that do not perform gathers
  • Tasks that do not create new tasks
  • Future Tasks to execute on CPU
  • Notice COHERENCE HINTS!

29
Role of queues
  • Denote coherence groupings (where HW finds
    coherent work)
  • Describe dependencies connecting kernels
  • Enqueue async. add new work into system
  • Enqueue terminate
  • Point where coherence groupings change
  • Point where resource/environment changes

30
Design space
  • Queue setup commands / enqueue instructions
  • Scheduling algorithm (what are inputs?)
  • What properties associated with queues
  • Ordering guarantees
  • Determinism
  • Failure handling (kill or spill when queues
    full?)
  • Inter-task synch (or maintain isolation)
  • Resource cleanup

31
Implementation
  • GPU shader interpreter (SM4 extensions)
  • Hello world run CPU threadGPU threads
  • GPU threads create other threads
  • Identify GPU ISA additions
  • GPU raytracer formulation
  • May investigate DX10 geometry shader
  • Establish what information scheduler needs
  • Compare scheduling strategies

32
Alternatives
  • Multi-pass rendering
  • Compare scheduling resources
  • Compare bandwidth savings
  • On chip state / performance tradeoff
  • Large monolithic kernel (branching)
  • CUDA/CTM
  • Multi-core x86

33
Three interesting fronts
  • Paper 1 GPU micro-architecture
  • GPU work creating new GPU work
  • Software defined queues
  • Generalization of DirectX 10 GS?
  • GPU resource management
  • Ability to correctly manage/virtualize GPU
    resources
  • CPU/compute-maximized integration
  • Compute cores? GPU/Niagara/Larrabee
  • compute cores as first-class execution
    environments (dump the accelerator model)
  • Unified view of work throughout machine
  • Any core creates work for other cores
Write a Comment
User Comments (0)
About PowerShow.com