Tradeoffs in Designing Accelerator Architectures for Visual Computing - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Tradeoffs in Designing Accelerator Architectures for Visual Computing

Description:

1. Tradeoffs in Designing Accelerator Architectures for Visual Computing. Aqeel Mahesri ... GPUs run clusters of cores in SIMD lockstep. works well for graphics ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 44
Provided by: usersCrhc
Category:

less

Transcript and Presenter's Notes

Title: Tradeoffs in Designing Accelerator Architectures for Visual Computing


1
Tradeoffs in Designing Accelerator Architectures
for Visual Computing
  • Aqeel Mahesri
  • Daniel R. Johnson
  • Neal Crago
  • Sanjay J. Patel
  • Center for Reliable and High Performance
    Computing
  • University of Illinois

2
Outline
  • Motivation
  • Applications
  • Accelerator Meta-Architecture
  • Micro-Architecture
  • Conclusion

3
Outline
  • Motivation
  • Applications
  • Accelerator Meta-Architecture
  • Micro-Architecture
  • Conclusion

4
Motivation New Opportunities
  • Researchers creating new applications
  • computer vision
  • data mining
  • computational biology
  • New applications highly scalable
  • demand high performance
  • offer massive parallelism
  • Moores law scaling enables high compute density
  • GFLOPS/mm2

5
Motivation Existing Challenges
  • Limitations of existing CPU architectures
  • complex cores cannot maximize compute density
  • legacy, single-threaded applications
  • Limitations of existing GPU architectures
  • specialized graphics hardware
  • new applications less structured than graphics

6
New Class of Architecture
  • A generalized accelerator architecture
  • call it xPU
  • Aim at emerging applications
  • broader than graphics
  • but not burdened by legacy
  • What should this architecture look like?

7
Outline
  • Motivation
  • Applications
  • Accelerator Meta-Architecture
  • Micro-Architecture
  • Conclusion

8
Applications for Accelerator Architectures
  • Visual computing
  • processing, rendering, modeling of visual
    information
  • performance drivers for high-end computing
  • Traditional
  • graphics rendering
  • video encoding
  • Emerging
  • computer vision
  • real-time simulation

9
Evaluating Applications
  • Need a proxy to represent app areas
  • create a benchmark suite
  • Requirements
  • representative
  • real, production quality apps
  • generally targeted, suitable for design space
    exploration
  • Alas . . . only open source available to us

10
VISBench
  • Visualization, Interaction, Simulation Benchmarks
  • Blender
  • high quality scanline renderer
  • POVRay
  • advanced ray tracer
  • Open Dynamics Engine-based PhysicsBench
  • physics simulation
  • OpenCV-based face detection
  • computer vision
  • H.264 motion estimation kernel
  • High-fidelity MRI reconstruction kernel

11
Parallelizing VISBench
  • Starting apps targeted to PC
  • make appropriate for xPU study
  • Identify parallel loops
  • remove pthreads/OpenMP calls
  • replace with annotations
  • Tune data structures, code
  • but minimal algorithm-level change
  • Gives us experimental proxy for xPU code

12
VISBench on xPU
13
Outline
  • Motivation
  • Applications
  • Accelerator Meta-Architecture
  • Micro-Architecture
  • Conclusion

14
Accelerator Meta-Architecture
15
Architecture Issues
  • synchronization and communication
  • MIMD vs. SIMD
  • core pipeline
  • multithreading
  • cache sizing
  • memory bandwidth

16
Outline
  • Motivation
  • Applications
  • Accelerator Meta-Architecture
  • Micro-Architecture
  • Conclusion

17
Architecture Optimization Problem
  • Given a fixed area budget, architect chip to
    maximize performance
  • assume 400mm2 area budget in 65nm
  • 100mm2 L2 global cache
  • 200mm2 compute array
  • 100mm2 interconnect, memory controllers, I/O
  • Fill the compute array to maximize performance

18
Area Methodology
  • Needs to work for wide-ranging exploration
  • Hybrid approach
  • SRAMs
  • model using CACTI
  • Functional units
  • use published area numbers
  • Pipeline logic
  • estimate number of NAND2 gates

19
Performance Methodology
  • Simulation
  • simulate whole chip
  • Run annotated VISBench binaries through
    sequential front end
  • Simulation back end detects annotations, runs
    threads on different cores

20
Synchronization and Communication Issues
  • How much support to provide?
  • none vs. bulk synchronous vs. arbitrary threading
    vs. something in between
  • Shared memory vs. message passing
  • message passing easier to scale
  • shared memory makes shared data structures easier
  • Coherence
  • facilitates synchronization
  • facilitates write sharing

21
Synchronization and Communication
  • study data sharing in VISBench
  • classify shared reads and writes
  • how many
  • how many need synchronization

22
Synchronization and Communication
23
Synchronization and Communication
  • limited number of non-private writes
  • mostly at barriers
  • dont need coherence for communication at barriers

24
Synchronization and Communication
  • sharing within barriers rare
  • only ODE and POVRay
  • fine-grained synchronization needs to be present
  • but it doesnt need to be high-throughput

25
Proposed Memory Model
  • Shared address space
  • facilitates shared data structures
  • No hardware cache coherence
  • Support for fine-grained synchronization
  • global memory operations
  • special global load and store
  • always bypass private caches

26
Execution Model
  • GPUs run clusters of cores in SIMD lockstep
  • works well for graphics
  • save area overhead of instruction fetch and
    decode
  • Drawbacks of SIMD
  • loss of efficiency if control flow diverges
  • programmer effort
  • Does SIMD work well for VISBench?

27
SIMD vs. MIMD
  • look at efficiency loss due to control divergence
  • assume perfect memory

28
SIMD Area vs. Performance
  • Replicated portion of SIMD pipelines is 60 of
    MIMD core area

29
Core Pipeline
  • pipeline organization
  • in-order vs. out-of-order
  • utilization vs. density
  • single-issue vs. superscalar
  • ILP vs. core count
  • compare 1-wide in-order, 2-wide in-order, 2-wide
    out-of-order pipelines

30
Core Pipeline Organization
  • in1 configurations
  • vary cache sizes, instruction mix, multithreading
  • plot performance/area versus core complexity

31
Core Pipeline Organization
  • in2 configurations

32
Core Pipeline Organization
  • out2 configurations

33
Fine-Grained Multithreading
  • Covers latency from cache misses, long latency FP
    operations
  • Area cost
  • additional pipeline latches
  • muxes for thread selection
  • added register file
  • grows linearly with thread count
  • Extra cache pressure

34
Multithreading
  • 2-way performance benefit
  • 8 average
  • 12 area overhead
  • 4-way no benefit
  • 2-way covers most miss latency
  • 36 area overhead

35
Cache Sizing
  • small caches dont fit working set
  • large caches take up too much area

36
Memory and Cache Bandwidth
  • Look at highest performing configuration with
    varying BW
  • Mem BW
  • perf saturates at 64GB/s
  • Cache BW
  • perf saturates at 768GB/s

37
Performance Summary
  • Highest performing configuration
  • 2-issue in-order
  • 4KI/8KD L1 caches
  • 573 cores in 200mm2 compute array
  • 165GOP/s on parallel sections
  • 103X speedup over 2.2GHz Opteron

38
Conclusion
  • Highly parallel, visual computing applications
    are long-term performance drivers
  • accelerator architectures can achieve much higher
    throughput than CPUs
  • Architectural insights
  • non-coherent shared memory model
  • MIMD execution model better than SIMD
  • in-order cores work well, even with sub-optimal
    scheduling
  • multithreading improves performance, up to a
    point
  • cache sizing requires compromise

39
More Information
  • more microarchitectural exploration
  • Omid Azizi, Aqeel Mahesri, Sanjay J. Patel, Mark
    Horowitz, Area Efficiency in CMP Core Design
    Co-Optimization of Microarchitecture and Physical
    Design, dasCMP 2008
  • more on incoherent shared memory model
  • John H. Kelm, Daniel R. Johnson, Aqeel Mahesri,
    Steven S. Lumetta, Matthew Frank, Sanjay J.
    Patel, SChISM Scalable Cache Incoherent Shared
    Memory, University of Illinois Technical Report,
    UILU-ENG-08-2212, August 2008
  • Rigel Project
  • http//rigel.crhc.uiuc.edu
  • public release of VISBench
  • coming soon
  • e-mail mahesri_at_illinois.edu

40
Backup
  • what needs to go here?
  • icc vs. gcc?

41
Vector Instructions
  • area cost exceeds perf gain

42
Accelerator Meta-Architecture
  • Accelerator model
  • co-processor attached to main CPU
  • used to off-load compute-intensive sections
  • Accelerator components
  • array of compute cores
  • shared global cache
  • high-bandwidth memory system

43
Memory and Cache Bandwidth
  • Assumed memory bandwidth
  • 8x 64-bit channels, 2GHz effective clock
  • 128GB/s total
  • Assumed global cache bandwidth
  • 32-way banked
  • 32B cache lines
  • 1 access per cycle per bank
  • 1024GB/s total
  • Is this sufficient?
Write a Comment
User Comments (0)
About PowerShow.com