AnySP: Anytime Anywhere Anyway Signal Processing - PowerPoint PPT Presentation

About This Presentation
Title:

AnySP: Anytime Anywhere Anyway Signal Processing

Description:

Three kernels make up the majority of the work. FFT Extract Data from Signals ... A few instruction pairs (3-5) make up the majority of all instruction pairs! a ... – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 27
Provided by: milc
Category:

less

Transcript and Presenter's Notes

Title: AnySP: Anytime Anywhere Anyway Signal Processing


1
AnySP Anytime Anywhere Anyway Signal Processing
  • Mark Woh1, Sangwon Seo1, Scott Mahlke1,Trevor
    Mudge1,
  • Chaitali Chakrabarti2, Krisztian Flautner3
  • University of Michigan ACAL1
  • Arizona State University2
  • ARM, Ltd.3

2
The Modern Mobile Phone
The Old Mobile Phone
  • Future phones are becoming more complex
  • Richer applications require much more
    requirements
  • How do phones handle this now?

Video Recording
Video Editing
Higher Data Rates
3D Rendering
Advanced Image Processing
Photos From - http//www.engadget.com/2009/06/10/
iphone-3g-s-supports-opengl-es-2-0-but-3g-only-sup
ports-1-1/ http//www.apple.com/iphone
3
Inside Todays Smart Phones
  • Modern phones are looking like Frankenchips!
  • Some cores unused and functionality duplicated

4
Cost for Multi-System Support
  • Programmable Unified Architectures Provide
  • Lower Cost
  • Faster Time to Market
  • Support for Multiple Applications (Current and
    Future)
  • Bug Fixes After Manufacturing
  • So where do we start?
  • Supporting multiple systems is reserved for the
    most expensive phones
  • Cost is in supporting all the systems that may or
    may not be used at once

Data gathered from - Ramacher, U. 2007.
Software-Defined Radio Prospects for
Multistandard Mobile Phones. Computer 40, 10
(Oct. 2007) - Finchelstein, D.F. Sze, V.
Sinangil, M.E. Koken, Y. Chandrakasan, A.P., "A
low-power 0.7-V H.264 720p video decoder,"
Solid-State Circuits Conference, 2008. A-SSCC
'08.
5
Power/Performance Requirements for Multiple
Systems
Different applications have different
power/performance characteristics! We need to
design keeping each application in mind! (Not GPP
but Domain Specific Processor)
6
The Applications
  • Is there anything we can learn from the
    applications themselves?

7
H.264 Basics
T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. A
low-power dual-mode video decoder for mobile
applications, IEEE Communications Magazine,
volume 44, issue 8, pp.119-126, Aug. 2006.
8
4G Wireless Basics
  • Three kernels make up the majority of the work
  • FFT Extract Data from Signals
  • STBC Combine Data into More Reliable Stream
  • LDPC Error Correction on Data Stream

9
Mobile Signal Processing Algorithm Characteristics
Algorithm SIMD Scalar Overhead SIMD Width Amount
Algorithm Workload () Workload () Workload () (Elements) of TLP
4G FFT 75 5 20 1024 Low
4G STBC 81 5 14 4 High
4G LDPC 49 18 33 96 Low
H.264 Deblocking Filter 72 13 15 8 Medium
H.264 Intra-Prediction 85 5 10 16 Medium
H.264 Inverse Transform 80 5 15 8 High
H.264 Motion Compensation 75 5 10 8 High
  • SIMD comes at a cost!
  • Register File Power
  • Data Movement/Alignment Cost
  • SIMD architectures have to deal with this!
  • Algorithms have different SIMD widths
  • From very large to very small
  • Though SIMD width varies all algorithms can
    exploit it
  • Large percentage of work can be SIMDized
  • Larger SIMD width tend to have less TLP

10
Traditional SIMD Power Breakdown
  • Register File Power consumes a lot of power in
    traditional 32-wide SIMD architecture

11
Register File Access
Lots of power wasted on unneeded register file
access!
  • Many of the register file access do not have to
    go back to the main register file

12
Instruction Pair Frequency
Like the Multiply-Accumulate (MAC) instruction
there is opportunity to fuse other
instructions A few instruction pairs (3-5) make
up the majority of all instruction pairs!
13
Data Alignment Problem!
Intra-Prediction
Traditional SIMD machines take too long or cost
too much to do this Good news small fixed
number patterns per kernel
  • H.264 Intra-prediction has 9 different prediction
    modes
  • Each prediction mode requires a specific
    permutation

14
Summary
  • Conclusion about 4G and H.264
  • Lots of different sized parallelism
  • From 4 wide to 96 wide to 1024 wide SIMD
  • Which means many different SIMD widths need to be
    supported
  • Very short lived values
  • Lots of potential for instruction fusings
  • Limited set of shuffle patterns required for each
    kernel

15
AnySP Design
16
Traditional SIMD Architectures
32-Wide SIMD with Simple Shuffle Network
17
AnySP Architecture High Level
16 Banked Memory with SRAM-based Crossbar
8 Groups of 8-Wide Flexible Function Units
Multiple Output Adder Tree
128x128 16bit Swizzle Network
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
18
Multi-Width Support
19
AnySP FFU Datapath
  • Flexible Functional Unit allows us to
  • Exploit Pipeline-parallelism by joining two lanes
    together
  • Handle register bypass and the temporary buffer
  • Join multiple pipelines to process deeper
    subgraphs
  • Fuse Instruction Pairs

20
AnySP Results
21
Simulation Environment
  • Traditional SIMD architecture comparison
  • SODA at 90nm technology
  • AnySP
  • Synthesized at 90nm TSMC
  • Power, timing, area numbers were extracted
  • Performance and Power for each kernel was
    generated using synthesized data on in-house
    simulator
  • 4G based on a NTT DoCoMo 4G test setup
  • H.264 4CIF_at_30fps

22
AnySP Speedup vs SIMD-based Architecture
  • For all benchmarks we perform more than 2x better
    than a SIMD-based architecture

23
AnySP Energy-Delay vs SIMD-based Architecture
  • More importantly energy efficiency is much
    better!

24
AnySP Power Breakdown
  • We estimate that both H.264 and 4G wireless can
    be done in under 1 Watt at 45nm

25
Conclusion Future Work
  • Conclusion
  • We have presented an example architecture that
    could possibly meet the requirements of 100Mbps
    4G and HD video on the same platform
  • Under the power budget and meeting the
    performance at 45nm
  • Future and Ongoing Work
  • Application-specific language
  • Larger class of algorithms for AnySP
  • Better utilization of resources for non-parallel
    kernels
  • Speedup sequential parts

26
The End
Write a Comment
User Comments (0)
About PowerShow.com