High Performance Mobile Computing Using Flexible Wide SIMD Processors

About This Presentation

Title:

High Performance Mobile Computing Using Flexible Wide SIMD Processors

Description:

Electrical Engineering and Computer Science ... Electrical Engineering and Computer Science. The Old Mobile Phone. The Modern Mobile Phone ... – PowerPoint PPT presentation

Number of Views:163

Avg rating:3.0/5.0

Slides: 23

Provided by: milc

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Mobile Computing Using Flexible Wide SIMD Processors

1
High Performance Mobile Computing Using Flexible
Wide SIMD Processors

Scott Mahlke
in collaboration with
Mark Woh, Sangwon Seo, Amir Hormati, Yoonseo
Choi, Trevor Mudge, Chaitali Chakrabarti (ASU),
Krisztian Flautner (ARM Ltd.)
Advanced Computer Architecture Laboratory
University of Michigan

2
The Modern Mobile Phone
The Old Mobile Phone

Future phones are becoming more complex
Richer applications require both more
performance and more flexibility
Modern phones look like Franken-chips

Video Recording
Video Editing
Higher Data Rates
3D Rendering
Advanced Image Processing
3
Power/Performance Requirements for Multiple
Systems
Different applications have different
power/performance characteristics! We need to
design keeping each application in mind! (Not GPP
but Domain Specific Processor)
3
4
4G Wireless Basics
NTT DoCoMo 4G test setup

Three kernels make up the majority of the work
FFT Extract Data from Signals
STBC Combine Data into More Reliable Stream
LDPC Error Correction on Data Stream

4
5
High Definition Video (H.264) Basics
4CIF_at_30fps
5
6
Mobile Signal Processing Algorithm Characteristics

Problems with traditional SIMD
High register file power
Large data movement/alignment cost
Inconsistent lane utilization
SIMD implies single thread

Algorithms have different SIMD widths
From very large to very small
Though SIMD width varies all algorithms can
exploit it
Large percentage of work can be SIMDized
Larger SIMD width tend to have less TLP

6
7
So, Whats the Right Solution?

Alternatives
More processors, less lanes?
Configurable Hardware can be SIMD or MIMD?
Franken chip?
SIMD is the answer! It provides high performance
and power efficiency
Low control cost
More area-efficient scaling
Single thread context
Simpler memory system design no cache coherence

8
A Closer Look at SIMD Power Breakdown

9
Register File Accesses
Lots of power wasted on unneeded register file
access!

Many of the register file access do not have to
go back to the main register file

9
10
LDPC Scaling Performance with SIMD Width

SIMD loses effectiveness when lanes cannot be
put to productive use
SIMD on distributed data (SIMdD)
Efficient data rearrangement critical to success
of SIMD

10
11
Data Alignment Issues
Intra-Prediction
Traditional SIMD machines take too long or cost
too much to do this Good news small fixed
number patterns per kernel

H.264 Intra-prediction has 9 different prediction
modes
Each prediction mode requires a specific
permutation

12
4G/H.264 Summary

Lots of different sized parallelism
From 4 wide to 96 wide to 1024 wide SIMD
Which means many different SIMD widths need to be
supported
TLP (disjoint SIMD) often available
Very short-lived values
Lots of potential for instruction fusings (beyond
pairwise)
Limited set of shuffle patterns required for each
kernel

13
AnySP Push SIMDBut, Increase the Inherent
Flexibility and Efficiency
14
AnySP Architecture High Level
16 Banked Memory with SRAM-based Crossbar
8 Groups of 8-Wide Flexible Function Units
Multiple Output Adder Tree
128x128 16bit Swizzle Network
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
15
Multi-Width SIMD Support
16
Using SIMD Lanes for Deeper Subgraphs

Flexible Functional Unit allows us to
Exploit Pipeline-parallelism by joining two lanes
together
Handle register bypass and the temporary buffer
Join multiple pipelines to process deeper
subgraphs
Fuse Instruction Pairs

17
SRAM-based Crossbar
Multiple SRAM cells replace MUX of traditonal
crossbar Each cell stores configuration
information The controller selects the specific
configuration based on the instruction
parameter Each cell can store up to 6 different
configurations Power reduced by 50 for 128x128
crossbar
18
AnySP vs SIMD-based Architecture

SIMD width doubled
But that only provides half the performance gain,
other half due to flexibility features

19
AnySP Energy-Delay vs SIMD-based Architecture

Comparison based on 90nm synthesis results
Flexibility increases utilization of datapath and
hence its efficiency

20
AnySP Power Breakdown

We estimate that both H.264 and 4G wireless can
be done in under 1 Watt at 45nm

21
Conclusions

Scaling traditional SIMD for mobile applications
Wide-SIMD hardware under-utilized
Large fraction of power on non-computation
AnySP design
Can possibly meet the requirements of 100Mbps 4G
and HD video on the same platform _at_45nm
Flexibility/Efficiency improvements
Increase SIMD utilization (FFUs, multiple short
vectors)
Reduce register file power (bypass buffer)
More efficient data shuffling (SRAM-based
crossbar)