Tradeoffs in Designing Accelerator Architectures for Visual Computing - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Tradeoffs in Designing Accelerator Architectures for Visual Computing

Description:

1. Tradeoffs in Designing Accelerator Architectures for Visual Computing. Aqeel Mahesri ... GPUs run clusters of cores in SIMD lockstep. works well for graphics ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 44

Provided by: usersCrhc

Category:

more less

Transcript and Presenter's Notes

Title: Tradeoffs in Designing Accelerator Architectures for Visual Computing

1
Tradeoffs in Designing Accelerator Architectures
for Visual Computing

Aqeel Mahesri
Daniel R. Johnson
Neal Crago
Sanjay J. Patel
Center for Reliable and High Performance
Computing
University of Illinois

2
Outline

Motivation
Applications
Accelerator Meta-Architecture
Micro-Architecture
Conclusion

3
Outline

Motivation
Applications
Accelerator Meta-Architecture
Micro-Architecture
Conclusion

4
Motivation New Opportunities

Researchers creating new applications
computer vision
data mining
computational biology
New applications highly scalable
demand high performance
offer massive parallelism
Moores law scaling enables high compute density
GFLOPS/mm2

5
Motivation Existing Challenges

Limitations of existing CPU architectures
complex cores cannot maximize compute density
legacy, single-threaded applications
Limitations of existing GPU architectures
specialized graphics hardware
new applications less structured than graphics

6
New Class of Architecture

A generalized accelerator architecture
call it xPU
Aim at emerging applications
broader than graphics
but not burdened by legacy
What should this architecture look like?

7
Outline

Motivation
Applications
Accelerator Meta-Architecture
Micro-Architecture
Conclusion

8
Applications for Accelerator Architectures

Visual computing
processing, rendering, modeling of visual
information
performance drivers for high-end computing
Traditional
graphics rendering
video encoding
Emerging
computer vision
real-time simulation

9
Evaluating Applications

Need a proxy to represent app areas
create a benchmark suite
Requirements
representative
real, production quality apps
generally targeted, suitable for design space
exploration
Alas . . . only open source available to us

10
VISBench

Visualization, Interaction, Simulation Benchmarks
Blender
high quality scanline renderer
POVRay
advanced ray tracer
Open Dynamics Engine-based PhysicsBench
physics simulation
OpenCV-based face detection
computer vision
H.264 motion estimation kernel
High-fidelity MRI reconstruction kernel

11
Parallelizing VISBench

Starting apps targeted to PC
make appropriate for xPU study
Identify parallel loops
remove pthreads/OpenMP calls
replace with annotations
Tune data structures, code
but minimal algorithm-level change
Gives us experimental proxy for xPU code

12
VISBench on xPU
13
Outline

Motivation
Applications
Accelerator Meta-Architecture
Micro-Architecture
Conclusion

14
Accelerator Meta-Architecture
15
Architecture Issues

synchronization and communication
MIMD vs. SIMD
core pipeline
multithreading
cache sizing
memory bandwidth

16
Outline

Motivation
Applications
Accelerator Meta-Architecture
Micro-Architecture
Conclusion

17
Architecture Optimization Problem

Given a fixed area budget, architect chip to
maximize performance
assume 400mm2 area budget in 65nm
100mm2 L2 global cache
200mm2 compute array
100mm2 interconnect, memory controllers, I/O
Fill the compute array to maximize performance

18
Area Methodology

Needs to work for wide-ranging exploration
Hybrid approach
SRAMs
model using CACTI
Functional units
use published area numbers
Pipeline logic
estimate number of NAND2 gates

19
Performance Methodology

Simulation
simulate whole chip
Run annotated VISBench binaries through
sequential front end
Simulation back end detects annotations, runs
threads on different cores

20
Synchronization and Communication Issues

How much support to provide?
none vs. bulk synchronous vs. arbitrary threading
vs. something in between
Shared memory vs. message passing
message passing easier to scale
shared memory makes shared data structures easier
Coherence
facilitates synchronization
facilitates write sharing

21
Synchronization and Communication

study data sharing in VISBench
classify shared reads and writes
how many
how many need synchronization

22
Synchronization and Communication
23
Synchronization and Communication

limited number of non-private writes
mostly at barriers
dont need coherence for communication at barriers

24
Synchronization and Communication

sharing within barriers rare
only ODE and POVRay
fine-grained synchronization needs to be present
but it doesnt need to be high-throughput

25
Proposed Memory Model

Shared address space
facilitates shared data structures
No hardware cache coherence
Support for fine-grained synchronization
global memory operations
special global load and store
always bypass private caches

26
Execution Model

GPUs run clusters of cores in SIMD lockstep
works well for graphics
save area overhead of instruction fetch and
decode
Drawbacks of SIMD
loss of efficiency if control flow diverges
programmer effort
Does SIMD work well for VISBench?

27
SIMD vs. MIMD

look at efficiency loss due to control divergence
assume perfect memory

28
SIMD Area vs. Performance

Replicated portion of SIMD pipelines is 60 of
MIMD core area

29
Core Pipeline

pipeline organization
in-order vs. out-of-order
utilization vs. density
single-issue vs. superscalar
ILP vs. core count
compare 1-wide in-order, 2-wide in-order, 2-wide
out-of-order pipelines

30
Core Pipeline Organization

in1 configurations
vary cache sizes, instruction mix, multithreading
plot performance/area versus core complexity

31
Core Pipeline Organization

in2 configurations

32
Core Pipeline Organization

out2 configurations

33
Fine-Grained Multithreading

Covers latency from cache misses, long latency FP
operations
Area cost
additional pipeline latches
muxes for thread selection
added register file
grows linearly with thread count
Extra cache pressure

34
Multithreading

2-way performance benefit
8 average
12 area overhead
4-way no benefit
2-way covers most miss latency
36 area overhead

35
Cache Sizing

small caches dont fit working set
large caches take up too much area

36
Memory and Cache Bandwidth

Look at highest performing configuration with
varying BW
Mem BW
perf saturates at 64GB/s
Cache BW
perf saturates at 768GB/s

37
Performance Summary

Highest performing configuration
2-issue in-order
4KI/8KD L1 caches
573 cores in 200mm2 compute array
165GOP/s on parallel sections
103X speedup over 2.2GHz Opteron

38
Conclusion

Highly parallel, visual computing applications
are long-term performance drivers
accelerator architectures can achieve much higher
throughput than CPUs
Architectural insights
non-coherent shared memory model
MIMD execution model better than SIMD
in-order cores work well, even with sub-optimal
scheduling
multithreading improves performance, up to a
point
cache sizing requires compromise

39
More Information

more microarchitectural exploration
Omid Azizi, Aqeel Mahesri, Sanjay J. Patel, Mark
Horowitz, Area Efficiency in CMP Core Design
Co-Optimization of Microarchitecture and Physical
Design, dasCMP 2008
more on incoherent shared memory model
John H. Kelm, Daniel R. Johnson, Aqeel Mahesri,
Steven S. Lumetta, Matthew Frank, Sanjay J.
Patel, SChISM Scalable Cache Incoherent Shared
Memory, University of Illinois Technical Report,
UILU-ENG-08-2212, August 2008
Rigel Project
http//rigel.crhc.uiuc.edu
public release of VISBench
coming soon
e-mail mahesri_at_illinois.edu