High Performance Mobile Computing Using Flexible Wide SIMD Processors - PowerPoint PPT Presentation

About This Presentation
Title:

High Performance Mobile Computing Using Flexible Wide SIMD Processors

Description:

Electrical Engineering and Computer Science ... Electrical Engineering and Computer Science. The Old Mobile Phone. The Modern Mobile Phone ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 23
Provided by: milc
Category:

less

Transcript and Presenter's Notes

Title: High Performance Mobile Computing Using Flexible Wide SIMD Processors


1
High Performance Mobile Computing Using Flexible
Wide SIMD Processors
  • Scott Mahlke
  • in collaboration with
  • Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo
    Choi, Trevor Mudge, Chaitali Chakrabarti (ASU),
    Krisztian Flautner (ARM Ltd.)
  • Advanced Computer Architecture Laboratory
  • University of Michigan

2
The Modern Mobile Phone
The Old Mobile Phone
  • Future phones are becoming more complex
  • Richer applications require both more
    performance and more flexibility
  • Modern phones look like Franken-chips

Video Recording
Video Editing
Higher Data Rates
3D Rendering
Advanced Image Processing
3
Power/Performance Requirements for Multiple
Systems
Different applications have different
power/performance characteristics! We need to
design keeping each application in mind! (Not GPP
but Domain Specific Processor)
3
4
4G Wireless Basics
NTT DoCoMo 4G test setup
  • Three kernels make up the majority of the work
  • FFT Extract Data from Signals
  • STBC Combine Data into More Reliable Stream
  • LDPC Error Correction on Data Stream

4
5
High Definition Video (H.264) Basics
4CIF_at_30fps
5
6
Mobile Signal Processing Algorithm Characteristics
  • Problems with traditional SIMD
  • High register file power
  • Large data movement/alignment cost
  • Inconsistent lane utilization
  • SIMD implies single thread
  • Algorithms have different SIMD widths
  • From very large to very small
  • Though SIMD width varies all algorithms can
    exploit it
  • Large percentage of work can be SIMDized
  • Larger SIMD width tend to have less TLP

6
7
So, Whats the Right Solution?
  • Alternatives
  • More processors, less lanes?
  • Configurable Hardware can be SIMD or MIMD?
  • Franken chip?
  • SIMD is the answer! It provides high performance
    and power efficiency
  • Low control cost
  • More area-efficient scaling
  • Single thread context
  • Simpler memory system design no cache coherence

8
A Closer Look at SIMD Power Breakdown
  • Register file power disproportionately high in a
    traditional SIMD architecture

9
Register File Accesses
Lots of power wasted on unneeded register file
access!
  • Many of the register file access do not have to
    go back to the main register file

9
10
LDPC Scaling Performance with SIMD Width
  • SIMD loses effectiveness when lanes cannot be
    put to productive use
  • SIMD on distributed data (SIMdD)
  • Efficient data rearrangement critical to success
    of SIMD

10
11
Data Alignment Issues
Intra-Prediction
Traditional SIMD machines take too long or cost
too much to do this Good news small fixed
number patterns per kernel
  • H.264 Intra-prediction has 9 different prediction
    modes
  • Each prediction mode requires a specific
    permutation

12
4G/H.264 Summary
  • Lots of different sized parallelism
  • From 4 wide to 96 wide to 1024 wide SIMD
  • Which means many different SIMD widths need to be
    supported
  • TLP (disjoint SIMD) often available
  • Very short-lived values
  • Lots of potential for instruction fusings (beyond
    pairwise)
  • Limited set of shuffle patterns required for each
    kernel

13
AnySP Push SIMDBut, Increase the Inherent
Flexibility and Efficiency
14
AnySP Architecture High Level
16 Banked Memory with SRAM-based Crossbar
8 Groups of 8-Wide Flexible Function Units
Multiple Output Adder Tree
128x128 16bit Swizzle Network
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
15
Multi-Width SIMD Support
16
Using SIMD Lanes for Deeper Subgraphs
  • Flexible Functional Unit allows us to
  • Exploit Pipeline-parallelism by joining two lanes
    together
  • Handle register bypass and the temporary buffer
  • Join multiple pipelines to process deeper
    subgraphs
  • Fuse Instruction Pairs

17
SRAM-based Crossbar
Multiple SRAM cells replace MUX of traditonal
crossbar Each cell stores configuration
information The controller selects the specific
configuration based on the instruction
parameter Each cell can store up to 6 different
configurations Power reduced by 50 for 128x128
crossbar
18
AnySP vs SIMD-based Architecture
  • SIMD width doubled
  • But that only provides half the performance gain,
    other half due to flexibility features

19
AnySP Energy-Delay vs SIMD-based Architecture
  • Comparison based on 90nm synthesis results
  • Flexibility increases utilization of datapath and
    hence its efficiency

20
AnySP Power Breakdown
  • We estimate that both H.264 and 4G wireless can
    be done in under 1 Watt at 45nm

21
Conclusions
  • Scaling traditional SIMD for mobile applications
  • Wide-SIMD hardware under-utilized
  • Large fraction of power on non-computation
  • AnySP design
  • Can possibly meet the requirements of 100Mbps 4G
    and HD video on the same platform _at_45nm
  • Flexibility/Efficiency improvements
  • Increase SIMD utilization (FFUs, multiple short
    vectors)
  • Reduce register file power (bypass buffer)
  • More efficient data shuffling (SRAM-based
    crossbar)

22
Questions
  • For more information
  • http//cccp.eecs.umich.edu
Write a Comment
User Comments (0)
About PowerShow.com