Title: AnySP: Anytime Anywhere Anyway Signal Processing
1AnySP Anytime Anywhere Anyway Signal Processing
- Mark Woh1, Sangwon Seo1, Scott Mahlke1,Trevor
Mudge1, - Chaitali Chakrabarti2, Krisztian Flautner3
- University of Michigan ACAL1
- Arizona State University2
- ARM, Ltd.3
2The Modern Mobile Phone
The Old Mobile Phone
- Future phones are becoming more complex
- Richer applications require much more
requirements - How do phones handle this now?
Video Recording
Video Editing
Higher Data Rates
3D Rendering
Advanced Image Processing
Photos From - http//www.engadget.com/2009/06/10/
iphone-3g-s-supports-opengl-es-2-0-but-3g-only-sup
ports-1-1/ http//www.apple.com/iphone
3Inside Todays Smart Phones
- Modern phones are looking like Frankenchips!
- Some cores unused and functionality duplicated
4Cost for Multi-System Support
- Programmable Unified Architectures Provide
- Lower Cost
- Faster Time to Market
- Support for Multiple Applications (Current and
Future) - Bug Fixes After Manufacturing
- So where do we start?
- Supporting multiple systems is reserved for the
most expensive phones - Cost is in supporting all the systems that may or
may not be used at once
Data gathered from - Ramacher, U. 2007.
Software-Defined Radio Prospects for
Multistandard Mobile Phones. Computer 40, 10
(Oct. 2007) - Finchelstein, D.F. Sze, V.
Sinangil, M.E. Koken, Y. Chandrakasan, A.P., "A
low-power 0.7-V H.264 720p video decoder,"
Solid-State Circuits Conference, 2008. A-SSCC
'08.
5Power/Performance Requirements for Multiple
Systems
Different applications have different
power/performance characteristics! We need to
design keeping each application in mind! (Not GPP
but Domain Specific Processor)
6The Applications
- Is there anything we can learn from the
applications themselves?
7H.264 Basics
T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. A
low-power dual-mode video decoder for mobile
applications, IEEE Communications Magazine,
volume 44, issue 8, pp.119-126, Aug. 2006.
84G Wireless Basics
- Three kernels make up the majority of the work
- FFT Extract Data from Signals
- STBC Combine Data into More Reliable Stream
- LDPC Error Correction on Data Stream
9Mobile Signal Processing Algorithm Characteristics
Algorithm SIMD Scalar Overhead SIMD Width Amount
Algorithm Workload () Workload () Workload () (Elements) of TLP
4G FFT 75 5 20 1024 Low
4G STBC 81 5 14 4 High
4G LDPC 49 18 33 96 Low
H.264 Deblocking Filter 72 13 15 8 Medium
H.264 Intra-Prediction 85 5 10 16 Medium
H.264 Inverse Transform 80 5 15 8 High
H.264 Motion Compensation 75 5 10 8 High
- SIMD comes at a cost!
- Register File Power
- Data Movement/Alignment Cost
- SIMD architectures have to deal with this!
- Algorithms have different SIMD widths
- From very large to very small
- Though SIMD width varies all algorithms can
exploit it - Large percentage of work can be SIMDized
- Larger SIMD width tend to have less TLP
10Traditional SIMD Power Breakdown
- Register File Power consumes a lot of power in
traditional 32-wide SIMD architecture
11Register File Access
Lots of power wasted on unneeded register file
access!
- Many of the register file access do not have to
go back to the main register file
12Instruction Pair Frequency
Like the Multiply-Accumulate (MAC) instruction
there is opportunity to fuse other
instructions A few instruction pairs (3-5) make
up the majority of all instruction pairs!
13Data Alignment Problem!
Intra-Prediction
Traditional SIMD machines take too long or cost
too much to do this Good news small fixed
number patterns per kernel
- H.264 Intra-prediction has 9 different prediction
modes - Each prediction mode requires a specific
permutation
14Summary
- Conclusion about 4G and H.264
- Lots of different sized parallelism
- From 4 wide to 96 wide to 1024 wide SIMD
- Which means many different SIMD widths need to be
supported - Very short lived values
- Lots of potential for instruction fusings
- Limited set of shuffle patterns required for each
kernel
15AnySP Design
16Traditional SIMD Architectures
32-Wide SIMD with Simple Shuffle Network
17AnySP Architecture High Level
16 Banked Memory with SRAM-based Crossbar
8 Groups of 8-Wide Flexible Function Units
Multiple Output Adder Tree
128x128 16bit Swizzle Network
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
18Multi-Width Support
19AnySP FFU Datapath
- Flexible Functional Unit allows us to
- Exploit Pipeline-parallelism by joining two lanes
together - Handle register bypass and the temporary buffer
- Join multiple pipelines to process deeper
subgraphs - Fuse Instruction Pairs
20AnySP Results
21Simulation Environment
- Traditional SIMD architecture comparison
- SODA at 90nm technology
- AnySP
- Synthesized at 90nm TSMC
- Power, timing, area numbers were extracted
- Performance and Power for each kernel was
generated using synthesized data on in-house
simulator - 4G based on a NTT DoCoMo 4G test setup
- H.264 4CIF_at_30fps
22AnySP Speedup vs SIMD-based Architecture
- For all benchmarks we perform more than 2x better
than a SIMD-based architecture
23AnySP Energy-Delay vs SIMD-based Architecture
- More importantly energy efficiency is much
better!
24AnySP Power Breakdown
- We estimate that both H.264 and 4G wireless can
be done in under 1 Watt at 45nm
25Conclusion Future Work
- Conclusion
- We have presented an example architecture that
could possibly meet the requirements of 100Mbps
4G and HD video on the same platform - Under the power budget and meeting the
performance at 45nm - Future and Ongoing Work
- Application-specific language
- Larger class of algorithms for AnySP
- Better utilization of resources for non-parallel
kernels - Speedup sequential parts
26The End