Title: Tradeoffs in Designing Accelerator Architectures for Visual Computing
1Tradeoffs in Designing Accelerator Architectures
for Visual Computing
- Aqeel Mahesri
- Daniel R. Johnson
- Neal Crago
- Sanjay J. Patel
- Center for Reliable and High Performance
Computing - University of Illinois
2Outline
- Motivation
- Applications
- Accelerator Meta-Architecture
- Micro-Architecture
- Conclusion
3Outline
- Motivation
- Applications
- Accelerator Meta-Architecture
- Micro-Architecture
- Conclusion
4Motivation New Opportunities
- Researchers creating new applications
- computer vision
- data mining
- computational biology
- New applications highly scalable
- demand high performance
- offer massive parallelism
- Moores law scaling enables high compute density
- GFLOPS/mm2
5Motivation Existing Challenges
- Limitations of existing CPU architectures
- complex cores cannot maximize compute density
- legacy, single-threaded applications
- Limitations of existing GPU architectures
- specialized graphics hardware
- new applications less structured than graphics
6New Class of Architecture
- A generalized accelerator architecture
- call it xPU
- Aim at emerging applications
- broader than graphics
- but not burdened by legacy
- What should this architecture look like?
7Outline
- Motivation
- Applications
- Accelerator Meta-Architecture
- Micro-Architecture
- Conclusion
8Applications for Accelerator Architectures
- Visual computing
- processing, rendering, modeling of visual
information - performance drivers for high-end computing
- Traditional
- graphics rendering
- video encoding
- Emerging
- computer vision
- real-time simulation
9Evaluating Applications
- Need a proxy to represent app areas
- create a benchmark suite
- Requirements
- representative
- real, production quality apps
- generally targeted, suitable for design space
exploration - Alas . . . only open source available to us
10VISBench
- Visualization, Interaction, Simulation Benchmarks
- Blender
- high quality scanline renderer
- POVRay
- advanced ray tracer
- Open Dynamics Engine-based PhysicsBench
- physics simulation
- OpenCV-based face detection
- computer vision
- H.264 motion estimation kernel
- High-fidelity MRI reconstruction kernel
11Parallelizing VISBench
- Starting apps targeted to PC
- make appropriate for xPU study
- Identify parallel loops
- remove pthreads/OpenMP calls
- replace with annotations
- Tune data structures, code
- but minimal algorithm-level change
- Gives us experimental proxy for xPU code
12VISBench on xPU
13Outline
- Motivation
- Applications
- Accelerator Meta-Architecture
- Micro-Architecture
- Conclusion
14Accelerator Meta-Architecture
15Architecture Issues
- synchronization and communication
- MIMD vs. SIMD
- core pipeline
- multithreading
- cache sizing
- memory bandwidth
16Outline
- Motivation
- Applications
- Accelerator Meta-Architecture
- Micro-Architecture
- Conclusion
17Architecture Optimization Problem
- Given a fixed area budget, architect chip to
maximize performance - assume 400mm2 area budget in 65nm
- 100mm2 L2 global cache
- 200mm2 compute array
- 100mm2 interconnect, memory controllers, I/O
- Fill the compute array to maximize performance
18Area Methodology
- Needs to work for wide-ranging exploration
- Hybrid approach
- SRAMs
- model using CACTI
- Functional units
- use published area numbers
- Pipeline logic
- estimate number of NAND2 gates
19Performance Methodology
- Simulation
- simulate whole chip
- Run annotated VISBench binaries through
sequential front end - Simulation back end detects annotations, runs
threads on different cores
20Synchronization and Communication Issues
- How much support to provide?
- none vs. bulk synchronous vs. arbitrary threading
vs. something in between - Shared memory vs. message passing
- message passing easier to scale
- shared memory makes shared data structures easier
- Coherence
- facilitates synchronization
- facilitates write sharing
21Synchronization and Communication
- study data sharing in VISBench
- classify shared reads and writes
- how many
- how many need synchronization
22Synchronization and Communication
23Synchronization and Communication
- limited number of non-private writes
- mostly at barriers
- dont need coherence for communication at barriers
24Synchronization and Communication
- sharing within barriers rare
- only ODE and POVRay
- fine-grained synchronization needs to be present
- but it doesnt need to be high-throughput
25Proposed Memory Model
- Shared address space
- facilitates shared data structures
- No hardware cache coherence
- Support for fine-grained synchronization
- global memory operations
- special global load and store
- always bypass private caches
26Execution Model
- GPUs run clusters of cores in SIMD lockstep
- works well for graphics
- save area overhead of instruction fetch and
decode - Drawbacks of SIMD
- loss of efficiency if control flow diverges
- programmer effort
- Does SIMD work well for VISBench?
27SIMD vs. MIMD
- look at efficiency loss due to control divergence
- assume perfect memory
28SIMD Area vs. Performance
- Replicated portion of SIMD pipelines is 60 of
MIMD core area
29Core Pipeline
- pipeline organization
- in-order vs. out-of-order
- utilization vs. density
- single-issue vs. superscalar
- ILP vs. core count
- compare 1-wide in-order, 2-wide in-order, 2-wide
out-of-order pipelines
30Core Pipeline Organization
- in1 configurations
- vary cache sizes, instruction mix, multithreading
- plot performance/area versus core complexity
31Core Pipeline Organization
32Core Pipeline Organization
33Fine-Grained Multithreading
- Covers latency from cache misses, long latency FP
operations - Area cost
- additional pipeline latches
- muxes for thread selection
- added register file
- grows linearly with thread count
- Extra cache pressure
34Multithreading
- 2-way performance benefit
- 8 average
- 12 area overhead
- 4-way no benefit
- 2-way covers most miss latency
- 36 area overhead
35Cache Sizing
- small caches dont fit working set
- large caches take up too much area
36Memory and Cache Bandwidth
- Look at highest performing configuration with
varying BW - Mem BW
- perf saturates at 64GB/s
- Cache BW
- perf saturates at 768GB/s
37Performance Summary
- Highest performing configuration
- 2-issue in-order
- 4KI/8KD L1 caches
- 573 cores in 200mm2 compute array
- 165GOP/s on parallel sections
- 103X speedup over 2.2GHz Opteron
38Conclusion
- Highly parallel, visual computing applications
are long-term performance drivers - accelerator architectures can achieve much higher
throughput than CPUs - Architectural insights
- non-coherent shared memory model
- MIMD execution model better than SIMD
- in-order cores work well, even with sub-optimal
scheduling - multithreading improves performance, up to a
point - cache sizing requires compromise
39More Information
- more microarchitectural exploration
- Omid Azizi, Aqeel Mahesri, Sanjay J. Patel, Mark
Horowitz, Area Efficiency in CMP Core Design
Co-Optimization of Microarchitecture and Physical
Design, dasCMP 2008 - more on incoherent shared memory model
- John H. Kelm, Daniel R. Johnson, Aqeel Mahesri,
Steven S. Lumetta, Matthew Frank, Sanjay J.
Patel, SChISM Scalable Cache Incoherent Shared
Memory, University of Illinois Technical Report,
UILU-ENG-08-2212, August 2008 - Rigel Project
- http//rigel.crhc.uiuc.edu
- public release of VISBench
- coming soon
- e-mail mahesri_at_illinois.edu
40Backup
- what needs to go here?
- icc vs. gcc?
41Vector Instructions
- area cost exceeds perf gain
42Accelerator Meta-Architecture
- Accelerator model
- co-processor attached to main CPU
- used to off-load compute-intensive sections
- Accelerator components
- array of compute cores
- shared global cache
- high-bandwidth memory system
43Memory and Cache Bandwidth
- Assumed memory bandwidth
- 8x 64-bit channels, 2GHz effective clock
- 128GB/s total
- Assumed global cache bandwidth
- 32-way banked
- 32B cache lines
- 1 access per cycle per bank
- 1024GB/s total
- Is this sufficient?