Title: Adaptive System on a Chip aSoC for LowPower Signal Processing
1Adaptive System on a Chip (aSoC) for Low-Power
Signal Processing
- Andrew Laffely, Jian Liang, Prashant Jain, Ning
Weng, - Wayne Burleson, Russell Tessier
- Department of Electrical and Computer Engineering
- University of Massachusetts, Amherst
- alaffely, jliang, pjain, nweng, burleson,
tessier _at_ecs.umass.edu
This material is based upon work supported by the
National Science Foundation under Grant No.
9988238. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the author(s) and do not necessarily
reflect the views of the National Science
Foundation.
2Overview
- Motivation
- Video Processing
- Architecture
- Dynamic Power Management
- Core, Interconnect, and Clock
3Problem
- Wireless video processing requires
- High throughput
- Low Power
- Flexible
4System on a Chip Solutions
- Take advantage of parallelism
- Possible improved performance
- Allow use and reuse of existing integrated
components - If
- The application can be partitioned
- The appropriate architecture is used
5Proposed Architecture aSoC
- High throughput
- Heterogeneous processor elements
- Use the right tool for the job
- Fast and predictable interconnect
- Flexible
- Runtime reconfiguration of cores and interconnect
- Power consumption
- Implement power saving features in both cores and
interconnect - Use reconfiguration to dynamically control power
consumption
6aSoC adaptive System on a Chip
7aSoC adaptive System on a Chip
- Tiled SoC architecture
- Supports the use of independently developed
heterogeneous cores - Pick and place cores which best perform the given
application - Increase performance
- Save power
- Cores may be any number of tiles in size
8aSoC adaptive System on a Chip
- Tiled SoC architecture
- Supports the use of independently developed
heterogeneous cores - Connected with an interconnect mesh
- Restricted to near neighbor communications
- Creates pipeline
- Decreases cycle time
9aSoC adaptive System on a Chip
- Tiled SoC architecture
- Supports the use of independently developed
heterogeneous cores - Connected with a fixed interconnect mesh
- Using a communication interface (CI) to manage
data - Network port (Coreport) for each core
- Each CI uses a memory and FSM to repetitively
process a predefined schedule of communications - Crossbar
10Stream Control
- Instruction memory
- Holds the predetermined schedule of
communications - PC
- Selects and synchronizes the communications
- Decoder
- Sets crossbar
- Controller
- Sets PC
- Interprets incoming configuration commands
- Crossbar
- Any input to any set of outputs
Outputs
Inputs
Core
Core
North
North
South
South
East
East
West
West
Local Config.
Decoder/Controller
PC
Instruction Memory
11Example Communication
- A given application requires periodic
communications from Core A to Core C - aSoC uses a prescheduled communication STREAM
- Core A places the data in a dedicated STREAM
between the two tiles - Core C pulls the data from that STREAM
- The tile to tile communication uses 3 cycles
12Example Stream
1
Core to East
13Example Stream
2
West to East
14Example Stream
West to Core
3
15Example Stream
1
Core to East
Loop Back
2
West to East
West to Core
3
16Static Scheduled Communications
- Creates system scalability by eliminating
network congestion - Many interconnect segments managed with time
division multiplexing - lots of Bandwidth
- Improves SoC performance by up to factor of 8
17Power Consumption?
- Provide reconfiguration methods for cores and CI
- Develop programmable clocking systems at each tile
18Power Aware Core
- Custom motion estimation core
- Choose search method
- Full search
- 960-600mW (bit width and pel sub-sampling)
- Spiral search
- 76mW
- Three step search
- 25mW
- Data taken with SynopsysTM Power Compiler at the
RTL level
19aSoC Support
- Multiple streams in and out through dedicated
coreports - Easy to manage on both sides of the port
- Schedule configuration streams in with the data
- Stream A Input Frame
- Stream B Configuration (Choose search mode and
size) - Stream C Motion Vectors
Motion Estimation Core
Coreports
in1
in2
out2
out1
Stream A
Stream C
Stream B
20Reconfigurable Interconnect
S
DCT
-
Input Frame
ME
MC
DCT
Input Frame
21aSoC Support
Motion Estimation Compensation
DCT
- Lumped ME, MC and Summation into one double core
22aSoC Support P-Frame
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
23aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Configuration Streams (C D)
24aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Schedule 1
PC
Schedule 2
Configuration (Streams C)
25aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Schedule 1
PC
Schedule 2
Configuration (Streams C)
26aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Schedule 1
PC
Schedule 2
Configuration (Streams D)
27aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Schedule 1
PC
Schedule 2
Configuration (Streams D)
28aSoC Support I-Frame
OFF
Motion Estimation Compensation
DCT
Input Frame (Stream A)
29Operating Frequency?
- Interconnect synchronized
- H-tree clock distribution
- Core frequencies depend on critical path
- Tile provides clock reference
- Coreport provides asynchronous boundary
- Dynamic core configuration requires dynamic clock
configuration - aSoC clock reference provides multiples of
interconnect clock ( 4x, 2x, 1x, 0.5x, 0.25x, ) - Configured through the tile controller
30Mixed vs. Fixed Core Frequencies
- Cores not designed with clock gating
- Core power from Synopsys RTL simulation
- Interconnect from SPICE
- Assumes 10 cycle schedule, 4 pixels/word
31Current Density and Clocking
- Red fixed worst case clocking
- Short spikes of high current
- Green optimal independent clocking
- Slow and low
- Optimal clocking eliminates current spikes
(improved battery life)
ME Full Search ME Spiral ME Three Step
Search DCT
Current
Time
Deadline
Process Start
32Configuration Overhead
- Configuration adds up to 2 streams per tile
- Only 2 required for data
- Total BW 5xTxN
- 5 streams/(cycle,tile)
- T tiles
- N cycles in schedule
- Single tile can support up to 50 different
streams in 10 cycle schedule
DCT
Input Frame (Stream B)
Transform Frame (Stream D)
Configuration Streams
33Configuration Power Overhead
- Configuration streams used infrequently
- Once/Macro block or Once/Frame
- Architecture disables unused streams
- Data valid bit already used for flow control
- Only 4-9 of interconnect power is due to
configuration streams
34Conclusion
- aSoC supports dynamic power management with
Reconfiguration - Cores
- Interconnect
- Clocks
- Low configuration overhead in both
- Communication Bandwidth
- Power
35Future Work
- Add reconfigurable voltage supplies at each tile
- Finish test chip
- Import larger applications
36Questions
37aSoC adaptive System on a Chip
Tile
Motion Estimation and Compensation
Cores
Interconnect
Interface
38Example Stream
39Partitioning
- Automated partitioning a non trivial problem
- For small signal processing systems user defined
partitioning may be possible - Key Perfectly partitioning the system may not be
possible - How can the SoC mitigate the penalty?