Kevin Skadron - PowerPoint PPT Presentation

About This Presentation
Title:

Kevin Skadron

Description:

Between threads: allowed (SIMT), but expensive due to divergence and block-wide ... Divergent threads require masking, handled in HW ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 21
Provided by: NVI56
Category:
Tags: kevin | skadron

less

Transcript and Presenter's Notes

Title: Kevin Skadron


1
Wrapup and Open Issues
  • Kevin Skadron
  • University of Virginia Dept. of Computer Science
  • LAVA Lab

2
Outline
  • Objective of this segment explore interesting
    open questions, research challenges
  • CUDA restrictions (we have covered most of these)
  • CUDA ecosystem
  • CUDA as a generic platform for manycore research
  • Manycore architecture/programming in general

3
CUDA Restrictions Thread Launch
  • Threads may only be created by launching a kernel
  • Thread blocks run to completionno provision to
    yield
  • Thread blocks can run in arbitrary order
  • No writable on-chip persistent state between
    thread blocks
  • Together, these restrictions enable
  • Lightweight thread creation
  • Scalable programming model (not specific to
    number of cores)
  • Simple synchronization

4
CUDA Restrictions Concurrency
  • Task parallelism is restricted
  • Global communication is restricted
  • No writable on-chip persistent state between
    thread blocks
  • Recursion not allowed
  • Data-driven thread creation, and irregular
    fork-join parallelism are therefore difficult
  • Together, these restrictions allow focus on
  • perf/mm2
  • scalability

5
CUDA Restrictions GPU-centric
  • No OS services within a kernel
  • e.g., no malloc
  • CUDA is a manycore but single-chip solution
  • Multi-chip, e.g. multi-GPU, requires explicit
    management in host program, e.g. separately
    managing multiple CUDA devices
  • Compute and 3D-rendering cant run concurrently
    (just like dependent kernels)

6
CUDA Ecosystem Issues
  • These are really general parallel-programming
    ecosystem issues
  • 1 issue need parallel libraries and skeletons
  • Ideally the API is platform independent
  • 2 issue need higher-level languages
  • Simplify programming
  • Provide information about desired outcomes, data
    structures
  • May need to be domain specific
  • Allow underlying implementation to manage
    parallelism, data
  • Examples D3D, MATLAB, R, etc.
  • 3 issue debugging for correctness and
    performance
  • CUDA debugger and profiler in beta, but
  • Even if you have the mechanics, it is an
    information visualization challenge
  • GPUs do not have precise exceptions

7
Outline
  • CUDA restrictions
  • CUDA ecosystem
  • CUDA as a generic platform for manycore research
  • Manycore architecture/programming in general

8
CUDA for Architecture Research
  • CUDA is a promising vehicle for exploring many
    aspects of parallelism at an interesting scale on
    real hardware
  • CUDA is also a great vehicle for developing good
    parallel benchmarks
  • What about for architecture research?
  • We cant change the HW
  • CUDA is good for exploring bottlenecks
  • Programming model what is hard to express, how
    could architecture help?
  • Performance bottlenecks where are the
    inefficiencies in real programs?
  • But how to evaluate benefits of fixing these
    bottlenecks?
  • Open question
  • Measure cost at bottlenecks, estimate benefit of
    a solution?
  • Will our research community accept this?

9
Programming Model
  • General manycore challenges
  • Balance flexibility and abstraction vs.
    efficiency and scalability
  • MIMD vs SIMD, SIMT vs. vector
  • Barriers vs. fine-grained synchronization
  • More flexible task parallelism
  • High thread count requirement of GPUs
  • Balance ease of first program against efficiency
  • Software-managed local store vs. cache
  • Coherence
  • Genericity vs. ability to exploit fixed-function
    hardware
  • Ex OpenGL exposes more GPU hardware that can be
    used to good effect for certain algorithmsbut
    harder to program
  • Balance ability to drill down against
    portability, scalability
  • Control thread placement
  • Challenges due to high off-chip offload costs
  • Fine-grained global RW sharing
  • Seamless scalability across multiple chips
    (manychip?), distributed systems

10
Scaling
  • How do we use all the transistors?
  • ILP
  • More L1 storage?
  • More L2?
  • More PEs per core?
  • More cores?
  • More true MIMD?
  • More specialized hardware?
  • How do we scale when we run into the power wall?

11
Thank you
  • Questions?

12
CUDA Restrictions Task Parallelism(extended)
  • Task parallelism is restricted and can be awkward
  • Hard to get efficient fine-grained task
    parallelism
  • Between threads allowed (SIMT), but expensive
    due to divergence and block-wide barrier
    synchronization
  • Between warps allowed, but awkward due to
    block-wide barrier synchronization
  • Between thread blocks much more efficient, code
    still a bit awkward (SPMD style)
  • Communication among thread blocks inadvisable
  • Communication among blocks generally requires new
    kernel call, i.e. global barrier
  • Coordination (i.e. shared queue pointer) ok
  • No writable on-chip persistent state between
    thread blocks
  • These restrictions stem from focus on
  • perf/mm2
  • support scalable number of cores

13
CUDA Restrictions HW Exposure
  • CUDA is still low-level, like C good and bad
  • Requires/allows manual optimization
  • Manage concurrency, communication, data transfers
  • Manage scratchpad
  • Performance is more sensitive to HW
    characteristics than C on a uniprocessor (but
    this is probably true of any manycore system)
  • Thread count must be high to get efficiency on
    GPU
  • Sensitivity to number of registers allocated
  • Fixed total register file size, variable
    registers/thread, e.g. G80 registers/thread
    limits threads/SM
  • Fixed register file/thread, e.g., Niagara small
    register allocation wastes register file space
  • SW multithreading context switch cost f(
    registers/thread)
  • Memory coalescing/locality
  • These issues will arise in most manycore
    architectures
  • Many more interacting hardware sensitivities (
    regs/thread, threads/block, shared memory
    usage, etc.)performance tuning is tricky
  • Need libraries and higher-level APIs that have
    been optimized to account for all these factors

14
SIMD Organizations
  • SIMT independent threads, SIMD hardware
  • First used in Solomon and Illiac IV
  • Each thread has its own PC and stack, allowed to
    follow unique execution path, call/return
    independently
  • Implementation is optimized for lockstep
    execution of these threads (32-wide warp in
    Tesla)
  • Divergent threads require masking, handled in HW
  • Each thread may access unrelated locations in
    memory
  • Implementation is optimized for spatial
    localityadjacent words will be coalesced into
    one memory transaction
  • Uncoalesced loads require separate memory
    transactions, handled in HW
  • Datapath for each thread is scalar

15
SIMD Organizations (2)
  • Vector
  • First implemented in CDC, commercial big time in
    MMX/SSE/Altivec/etc.
  • Multiple lanes handled with a single PC and stack
  • Divergence handled with predication
  • Promotes code with good lockstep properties
  • Datapath operates on vectors
  • Computing on data that is non-adjacent from
    memory requires vector gather if available, else
    scalar load and packing
  • Promotes code with good spatial locality
  • Chief difference more burden on software
  • What is the right answer?

16
Backup
17
Other Manycore PLs
  • Data-parallel languages (HPF, Co-Array Fortran,
    various data-parallel C dialects)
  • DARPA HPCS languages (X10, Chapel, Fortress)
  • OpenMP
  • Titanium
  • Java/pthreads/etc.
  • MPI
  • Streaming
  • Key CUDA differences
  • Virtualizes PEs
  • Supports offload
  • Scales to large numbers of threads
  • Allows shared memory between (subsets of) threads

18
Persistent Threads
  • You can cheat on the rule that thread blocks
    should be independent
  • Instantiate thread blocks SMs
  • Now you dont have to worry about execution order
    among thread blocks
  • Lose hardware global barrier
  • Need to parameterize code so that it is not tied
    to a specific chip/ SMs

19
DirectX good case study
  • High-level abstractions
  • Serial ordering among primitives
  • implicit synchronization
  • No guarantees about ordering within primitives
  • no fine-grained synchronization
  • Domain-specific API is convenient for programmers
    and provides lots of semantic information to
    middleware parallelism, load balancing, etc.
  • Domain-specific API is convenient for hardware
    designers API has evolved while underlying
    architectures have been radically different from
    generation to generation and company to company
  • Similar arguments apply to Matlab, SQL,
    Map-Reduce, etc.
  • Im not advocating any particular API, but these
    examples show that high-level, domain-specific
    APIs are commercially viable and effective in
    exposing parallelism
  • Middleware (hopefully common) translates
    domain-specific APIs to general-purpose
    architecture that supports many different app.
    domains

20
HW Implications of Abstractions
  • If we are developing high-level abstractions and
    supporting middleware, low-level
    macro-architecture is less important
  • Look at the dramatic changes in GPU architecture
    under DirectX
  • If middleware understands triangles/matrices/graph
    s/etc., it can translate them to SIMD or anything
    else
  • HW design should focus on
  • Reliability
  • Scalability (power, bandwidth, etc.)
  • Efficient support for important parallel
    primitives
  • Scan, sort, shuffle, etc.
  • Memory model
  • SIMD divergent code work queues, conditional
    streams, etc.
  • Efficient producer-consumer communication
  • Flexibility (need economies of scale)
  • Need to support legacy code (MPI, OpenMP,
    pthreads) and other low-level languages (X10,
    Chapel, etc.)
  • Low-level abstractions cannot cut out these users
  • Programmers must be able to drill down if
    necessary
Write a Comment
User Comments (0)
About PowerShow.com