Title: High Performance Mobile Computing Using Flexible Wide SIMD Processors
1High Performance Mobile Computing Using Flexible
Wide SIMD Processors
- Scott Mahlke
- in collaboration with
- Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo
Choi, Trevor Mudge, Chaitali Chakrabarti (ASU),
Krisztian Flautner (ARM Ltd.) - Advanced Computer Architecture Laboratory
- University of Michigan
2The Modern Mobile Phone
The Old Mobile Phone
- Future phones are becoming more complex
- Richer applications require both more
performance and more flexibility - Modern phones look like Franken-chips
Video Recording
Video Editing
Higher Data Rates
3D Rendering
Advanced Image Processing
3Power/Performance Requirements for Multiple
Systems
Different applications have different
power/performance characteristics! We need to
design keeping each application in mind! (Not GPP
but Domain Specific Processor)
3
44G Wireless Basics
NTT DoCoMo 4G test setup
- Three kernels make up the majority of the work
- FFT Extract Data from Signals
- STBC Combine Data into More Reliable Stream
- LDPC Error Correction on Data Stream
4
5High Definition Video (H.264) Basics
4CIF_at_30fps
5
6Mobile Signal Processing Algorithm Characteristics
- Problems with traditional SIMD
- High register file power
- Large data movement/alignment cost
- Inconsistent lane utilization
- SIMD implies single thread
- Algorithms have different SIMD widths
- From very large to very small
- Though SIMD width varies all algorithms can
exploit it - Large percentage of work can be SIMDized
- Larger SIMD width tend to have less TLP
6
7So, Whats the Right Solution?
- Alternatives
- More processors, less lanes?
- Configurable Hardware can be SIMD or MIMD?
- Franken chip?
- SIMD is the answer! It provides high performance
and power efficiency - Low control cost
- More area-efficient scaling
- Single thread context
- Simpler memory system design no cache coherence
8A Closer Look at SIMD Power Breakdown
- Register file power disproportionately high in a
traditional SIMD architecture
9Register File Accesses
Lots of power wasted on unneeded register file
access!
- Many of the register file access do not have to
go back to the main register file
9
10LDPC Scaling Performance with SIMD Width
- SIMD loses effectiveness when lanes cannot be
put to productive use - SIMD on distributed data (SIMdD)
- Efficient data rearrangement critical to success
of SIMD
10
11Data Alignment Issues
Intra-Prediction
Traditional SIMD machines take too long or cost
too much to do this Good news small fixed
number patterns per kernel
- H.264 Intra-prediction has 9 different prediction
modes - Each prediction mode requires a specific
permutation
124G/H.264 Summary
- Lots of different sized parallelism
- From 4 wide to 96 wide to 1024 wide SIMD
- Which means many different SIMD widths need to be
supported - TLP (disjoint SIMD) often available
- Very short-lived values
- Lots of potential for instruction fusings (beyond
pairwise) - Limited set of shuffle patterns required for each
kernel
13AnySP Push SIMDBut, Increase the Inherent
Flexibility and Efficiency
14AnySP Architecture High Level
16 Banked Memory with SRAM-based Crossbar
8 Groups of 8-Wide Flexible Function Units
Multiple Output Adder Tree
128x128 16bit Swizzle Network
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
15Multi-Width SIMD Support
16Using SIMD Lanes for Deeper Subgraphs
- Flexible Functional Unit allows us to
- Exploit Pipeline-parallelism by joining two lanes
together - Handle register bypass and the temporary buffer
- Join multiple pipelines to process deeper
subgraphs - Fuse Instruction Pairs
17SRAM-based Crossbar
Multiple SRAM cells replace MUX of traditonal
crossbar Each cell stores configuration
information The controller selects the specific
configuration based on the instruction
parameter Each cell can store up to 6 different
configurations Power reduced by 50 for 128x128
crossbar
18AnySP vs SIMD-based Architecture
- SIMD width doubled
- But that only provides half the performance gain,
other half due to flexibility features
19AnySP Energy-Delay vs SIMD-based Architecture
- Comparison based on 90nm synthesis results
- Flexibility increases utilization of datapath and
hence its efficiency
20AnySP Power Breakdown
- We estimate that both H.264 and 4G wireless can
be done in under 1 Watt at 45nm
21Conclusions
- Scaling traditional SIMD for mobile applications
- Wide-SIMD hardware under-utilized
- Large fraction of power on non-computation
- AnySP design
- Can possibly meet the requirements of 100Mbps 4G
and HD video on the same platform _at_45nm - Flexibility/Efficiency improvements
- Increase SIMD utilization (FFUs, multiple short
vectors) - Reduce register file power (bypass buffer)
- More efficient data shuffling (SRAM-based
crossbar)
22Questions
- For more information
- http//cccp.eecs.umich.edu