Title: Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia
1Architectures and Design Techniques for Energy
Efficient Embedded DSP and Multimedia
- Ingrid Verbauwhede, Christian Piguet, Bart
Kienhuis/Ed Deprettere, Patrick Schaumont
2Outline
- Intro
- Low Power Observations
- SoC Architectures Ingrid
- Low Power Components Christian
- Design Methods Ed
- Design Methods Patrick
- Conclusion
3Introduction
- Applications
- Mapped
- onto
- Architectures
- Embedded DSP Multimedia
- Design Methods
- Low Power!
Ingrid Verbauwhede, UCLA, K.U.Leuven
4Low Power observation 1 architecture tuned to
application
5Observation 2 Energy-flexibility trade-off
Application
Power
Cost
ASIC
Fixed
???
Platform
GeneralPurpose
6Example DSP processors
- Specialized instructions MAC
- Dedicated co-processors Viterbi acceleration
7FIR on TI C55x Dual MAC
FIR filter two outputs in parallel with 3 busses
y(0) c(0)x(0) c(1)x(-1) c(2)x(-2) . . .
c(N-1)x(1-N) y(1) c(0)x(1) c(1)x(0)
c(2)x(-1) . . . c(N-1)x(2-N) y(2)
c(0)x(2) c(1)x(1) c(2)x(0) . . .
c(N-1)x(3-N) y(3) c(0)x(3) c(1)x(2)
c(2)x(1) . . . c(N-1)x(4-N)
8Energy comparison
Total energy for one output sample
Adaptation of the datapath MAC, DMAC Adaptation
of the memory architecture and bus
network Adaptation of the instruction set
9Viterbi on TIC54x
ALU and CSSU CMPS instruction
DB1(16)
DB0(16)
G min (G1 m1), (G2 m2)
m2
m1
TREG
- ALU splits in 16 bit halves
- ACC splits in half
- Shortest distance saved
- CSSU compares halves
- Path indicator saved
- 4 cycles / butterfly
ALU
G1
G2
Accumulator
ALU
MSW/LSW Select
decision bit
Comp
G
TRN reg
Data bus EB, to memory
Source TI Application Report, Viterbi Decoding
in the TMS320C54x family, document SPRA071
10Observation Also general purpose architectures
become heterogeneous.
Source Xilinx webpage
11Question
- Energy - flexibility are opposite demands!
- How to navigate in this jungle?
- 3D design space
- Next question how to map (or compile) an
application onto such an architecture?
Computational Abstraction Level
Binding rate
Reconfigurable feature
12Flexibility (1) - Abstraction level
Computational Abstraction Level
- Instruction set level programmable
- CLB level reconfigurable
13Flexibility (2) - Reconfigurable feature
Computational Abstraction Level
Communication Storage Computation
Systems Interconnect network Memory hierarchy Number type of processes
Instruction set Architecture Size address/ data bus Register set Custom instructions
Micro-architecture Cross-bar Busses Register file Execution unit type
Implementation Switches, Muxes RAM details CLB
Reconfigurable feature
14Flexibility (3) - Binding rate
Binding rate
- Compare processing to binding
- Configurable (compile-time)
- Re-configurable
- Dynamic reconfigurable (adaptive)
15SOC architecture RINGS
16Instruction set extension
- Instruction set extension
- Register mapped
- Tightly coupled
- Experiment DFT
1000 iterations SW on Embedded proc. SW with HW datapath Improve-ment
Energy 67.6 mJ 5.76 mJ 12.5 times
17Co-processor
- Memory mapped
- Loosely coupled
- Experiment AES
Local Memory
175 iterations SW on emb. Proc. SW with HW datapath Improve-ment
Energy 89.2 mJ 13.5 mJ 25 times
18Independent IP
- Loosely coupled
- Network on chip connected
- Flexible interconnect
- Experiment TCP/IP checksum
router
router
100 packets SW on emb. Proc. HW datapath Improve-ment
Energy 17.0 mJ 0.20 mJ 84 times
19Communication Energy-flexibility
Daly-DAC01
- Also energy - flexibility conflict!
- General purpose NOC tiles
- FPGA general purpose
- Therefore domain specific NOC
20Conclusion
- Low Power by going domain-specific
- Energy-flexibility conflict
- How to program this RINGS?
- Next Ultra-low power components Christian
- Design exploration Ed
- Co-design environment Patrick
21Introduction
- Applications
- Mapped
- onto
- Architectures
- Embedded DSP Multimedia
- Design Methods
- Low Power!
22Efficient Embedded DSPUltra-Low-Power
Components
23Ultra Low Power DSP Processors
- The design of DSP processors is very challenging,
as it has to take into account contradictory
goals - an increased throughput request
- at a reduced energy budget
- New issues due to very deep submicron
technologies such as interconnect delays and
leakage - History of hearing aids circuits
- analog filters 15 years ago
- digital ASIC-like circuits 5 years ago
- powerful DSP processors today, below 1 Volt and 1
mW
24DSP Architectures for Low-Power
- single MAC DSP core of 5-10 years ago
- parallel architectures with several MAC working
in parallel - VLIW or multitask DSP architectures
- Benchmark
- number of simple operations executed per clock
cycle, up to 50 or more - Drawbacks of VLIW
- very large instruction words up to 256 bits
- Some instructions in the set are still missing
- transistor count is not favorable to reduce
leakage
25VLIW TMS320C6x (VelociTI)
Peak C62 About 1000 MIPS/watt (5000
MOPS/watt) C64 About 5000 MIPS/watt (25000
MOPS/watt) About 5 Op/Instr.
263 Ways to be more Energy Efficient
- To to design specific very small DSP engines for
each task, in such a way that each DSP task is
executed in the most energy efficient way on the
smallest piece of hardware (N co-processors) - to design reconfigurable architectures such as
the DART cluster - in which configuration bits allow the user to
modify the hardware in such a way that it can
much better fit to the executed algorithms.
27Co-processors
- - best one regarding power
- - minimal number of transistors and transitions
to perform a task - - control code on a microcontroller
- - main issue is the software mapping of a given
application onto so many heterogeneous processors
and co-processors - - cut-off for leakage reduction
- Ph.D of O. Paker,
- DTU, Lyngby, June 2002, Minicores
- IIR 73000 MOPS/watt
- FIR 33000 MOPS/watt
28DART O. Sentieys, ENSSAT
- RDP Reconfigurable DataPath
- - Many identical co-processors or hardware
accelerators for // tasks - - Reconfigurable interconnections
- - in 0.18, 1.9 V., 130 MHz, WCDMA, 6.2 GOPS,
40000 MOPS/watt - - FPGA Xcv200E 4000 MOPS/watt DSP C64 3000
MOPS/watt
29Reconfigurable DSP Architectures
- Not FPGA, much more efficient than FPGA. The key
point is to reconfigure only a limited number of
units - Reconfigurable datapath
- Reconfigurable interconnections
- Reconfigurable Addressing Units (AGU)
- FPGA
- MACGIC DSP consumes 1 mW/MHz in 0.18
- Same MACGIC in Altera Stratix consumes 10 mW/MHz
plus 900 mW of static power, so 1000 mW at 10 MHz
30Reconfigurable Datapaths
31Reconfigurable Addressing Modes
- operands fetch is generally a severe bottleneck
in parallel machines for which 8-16 operands are
required each clock cycle. - sophisticated addressing modes can be dynamically
reconfigured depending on the DSP task to be
executed
PR2 PR1 PR0 OFFA W
- Examples
- an lt- (anon)mn OFFA
- an lt- (anOFFA)mn 1
- an lt- (an1)mn OFFA
- an lt- (an1)mn
Selection of 3 operations
(MACGIC)
32MACGIC Performance results
- Power-consumption for a 24-bits, 10 MHz synthesis
_at_0.9 V in the 0.18µm TSMC technology (SYNOPSYS
MACHTA/PA simulations) - NOP 25 µA / MHz
- ADD 24-bit 102 µA / MHz 98k MOPS/Watt
- MAC 24-bit/56-bit 137 µA / MHz 81k MOPS/Watt
- 4 ADD 24-bit 167 µA / MHz 120k
MOPS/Watt - 4 MAC 24-bit/56-bit 283 µA / MHz 86k
MOPS/Watt - MACV 24-bit/56-bit 269 µA / MHz 90k MOPS/Watt
- CBFY4 radix-4 FFT 273 µA / MHz 131k MOPS/Watt
- Number of transistors for this 24-bit version
600000 - Number of transistors for a 16-bit version
400000
33Performance results 64 pts Cpx FFT
- Macgic 250 clock cycles
- CARMEL 526 clock cycles
- PalmDSPCore 450 clock cycles
- SC140 Starcore 288 clock cycles
- R.E.A.L DSP 850 clock cycles
- SP-5flex (3DSP) 500 clock cycles
- TI C62x 675 clock cycles
- TI C64x 276 clock cycles
34Comparison
35Introduction
- Applications
- Mapped
- onto
- Architectures
- Embedded DSP Multimedia
- Design Methods
- Low Power!
36Design ArchitectureExploration
- Ed Deprettere, Professor
- Bart Kienhuis, Assistant Professor
- Leiden University
- LIACS, The Netherlands
37Embedded DSP Architectures
Synfora PicoChip Silicon HiveArt Builder
CPU A simple Microprocessor RPU Reconfigurable
Processing Unit IPcore Dedicated Accelerator
block NoC Network on a Chip
Weakly coupled Processing elements
38System Level Design
- Three aspects are important in System Level
Design - The Architecture
- The Application
- How the Application is Mapped on the
Architecture. - To optimize a system, you need to take all three
aspect into consideration. - This is expressed in terms of the Y-Chart
39Y-chart Approach
Tuning the architecture to get better performance
40Y-chart Approach
Applications
Applications
Architecture Instance
Applications
Mapping
Performance Analysis
Performance Numbers
There are three different ways to improve a
system.
41Design Space Exploration
Decision Variables
Valuation System
Metrics
Acquisition of Insight
42Design Space Exploration
Exploration is about finding The inverse
relationship
Decision Variables
Metrics
Applications
Applications
Architecture Instance
Applications
Mapping
Performance Analysis
Performance Numbers
43Y-chart Design For GP Processors
Applications
Applications
ARM Processor Tensilica X86 process
Applications In C
Compiler (GNU GCC)
Performance Analysis
- Numbers are
- Clock Speed
- Power Consumption
- Throughput
- Latency
Performance Numbers
44Y-chart Design For DSP Applications
Applications
Applications
Architecture Template Xilinx FPGA
Applications In VHDL/C
Synplicity Xilinx Foundation Tools
Companies like PICO and ART (Arm/Adelante) are
trying to provide the Y-chart components as well.
Performance Analysis
- Numbers are
- Clock Speed
- Power Consumption
- Throughput
- Latency
Performance Numbers
45How to improve performance
- How can we improve the performance of the system
we are interested in. - Others focus on architecture, we want to focus on
the application. - For a low-power architecture parallelism is
important. - Exploiting more parallelism leads to fast
calculations - using Voltage and Frequency scaling, we assume
that power is saved - There is already a lot of theory developed to
employ bit-level parallelism, instruction
parallelism and Task Level parallelism. - Especially Task Level parallelism is getting more
and more important to effectively map DSP
application onto the new emerging architectures.
46Programming Problem
Application
Programming
Compaan
Laura
47Kahn Process Network (KPN)
- Kahn Process Networks Kahn 1974ParksLee 95
- Processes run autonomously
- Communicate via unbounded FIFOs
- Synchronize via blocking read
- Process is either
- executing (execute)
- communicating(send/get)
- Characteristics
- Deterministic
- Distributed Control
- No Global Scheduler is needed
- Distributed Memory
- No memory contention
48Kahn Process Network (KPN)
Fifo
CPU 1
FPGA A
Process A
Process B
Fifo
Fifo
Fifo
Fifo
FPGA B
Process C
- Autonomously operating Processes no global
schedule needed - Blocking Read simple realize in Hardware
- Buffer Sizes of the FIFOs are quite often very
small
49Matlab to Process Networks to FPGA
Matlab Program
C / SystemC / YAPI
Compaan Compiler
Kahn Process Network
Laura
Functional Simulation in Ptolemy II
Synthesizable VHDL
50Matlab to Matlab Transformations
- To make the flow from Matlab to FPGA interesting,
we had to give the designers means to change the
characteristics - Unrolling (Unfolding)
- Increases parallelism
- Retiming (Skewing)
- Improved pipeline behavior
- Clustering (Merging)
- Reducing parallelism
- All these operations can be applied to the source
level of Matlab, leading to a new Matlab program
51Y-chart Design For DSP Applications
Algorithmic Toolbox Retiming UnrollingMerging
Matlab Program
Compaan
Laura Synplicity Xilinx Foundation
Process Network
Xilinx/Altera FPGA
Performance Analysis
Performance Numbers
52Case Study
- We use the Y-chart environment on a real case
study - Adaptive QR DAES 2002
- Using commercial IP cores
- QinetiQ Ltd.
- Vectorize 42 pipeline stages
- Rotate 55 pipeline stages
- QR is interesting as it requires deeply pipelined
IP cores - Most Design tools have difficulties with such IP
cores - We will explore a number of simple steps to
improve the performance of the QR algorithm - Was reported to run at 12Mflops.
53Example Adaptive QR (Step 1)
for k 1121, for j 117, for i
j17, if i lt j,
r(j,j), rr(j,j), a(j),b(j), d(k) bcell(
r(j,j), rr(j,j), x(k,j), d(k) ) else
r(j,i), x(k,i), a(I), b(I)
icell(r(j,i),x(k,i), a(i), b(i) )
end end end end
- This algorithm runs at 60 Mflops _at_ 100Mhz,
out-of-the-box. This is more than 12Mflops
DAES2002 due to improved hardware design of the
processes and controllers. - Still, the efficiency of the pipelines of
bcell/icell is very poor lt1, for example, due to
the self-loop created by variables a and b.
54Example Adaptive QR (Step 2)
for k 1121, for j 117, for i
j17, if i lt j,
r(j,j), rr(j,j), a(j),b(j), d(k) bcell(
r(j,j), rr(j,j), x(k,j), d(k) ) else
a(i) Pass_qr(a(i-1))
b(i) Pass_qr(b(i-1))
r(j,i), x(k,i)
icell(r(j,i),x(k,i), a(i-1), b(i-1) )
end end end end
- We broke the self-loops on variables a and b
by introducing a pass function. This gave a 5
improvement leading to 67.7 Mflops. - Still the efficiency is bcell (3.76) / icell
(1.56) is very poor.
55Example Adaptive QR (Step 3)
for k 1121, for j 117, for i
j17, for p 1 1 10,
if i lt j, r(p,j,i), rr(p,j,i),
a(p,i),b(p,i),d(p,k) bcell( r(p,j,i),
rr(p,j,i), x(p,k,i), d(p,k) ) else
a(p,i) Pass_qr(a(p,i-1))
b(p,i) Pass_qr(b(p,i-1))
r(p,j,i), x(p,k,i) icell(
r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) )
end end end end
end
- To fill the pipelines, we run independent
problems over the same resources (sometimes
referred to as strip mining). This fills the
pipelines more efficiently, leading to 472.55
Mflops. (bcell 8.2/ icell 24.2).
56Example Adaptive QR (Step 4)
for k 1121, for j 117, for i
j17, for p 1 1 10,
if i lt j, r(p,j,i), rr(p,j,i),
a(p,i),b(p,i), d(p,k) bcell( r(p,j,i),
rr(p,j,i), x(p,k,i), d(p,k) ) else
a(p,i) Pass_qr(a(p,i-1))
b(p,i) Pass_qr(b(p,i-1))
if mod(p,2) lt 0,
r(p,j,i), x(p,k,i) icell( r(p,j,i),
x(p,k,i), (p,i-1), b(p,i-1) )
else r(p,j,i), x(p,k,i)
icell( r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) )
end end end
end end end
- The icell and bcell work load is not balanced.
The bcell gives more workload then the icell can
process. Therefore, we unroll the icell, leading
to 673.81 Mflops. (bcell 11.8/ icell 217.4
34.8).
57Conclusions
- Optimizing a System Level Design for Low-Power
requires that you look at the architecture, the
mapping, and the application. - The Y-chart gives a simple framework to tune a
system for optimal performance. - The Y-chart forms the basis for DSE
- We showed that by playing with the way
applications are written, we get in a number of
steps orders better performance 60MFlops -gt
673MFlops. This without changing the
architecture!
58Introduction
- Applications
- Mapped
- onto
- Architectures
- Embedded DSP Multimedia
- Design Methods
- Low Power!
59Domain-Specific Co-design Environments
60Lets waste no time !
You
61Design of the RINGS Multiprocessor
Core
Core
Hardware
Co-processor
NoCInterface
NoCInterface
NoCInterface
NoC Routers Topology
- Need an environment that combines
- Instruction-set Simulation (multi-core)
- Hardware Simulation
62ARMZILLA
HardwareProcessors
NetworkOn Chip
C
C
C
FDL
FDL
CrossCompiler
EXE
Config
EXE
EXE
ConfigurationUnit
ARM ISS
GEZELKernel
ARM ISS
ARM ISS
Memory-mappedChannels
ARMZILLA
63ARMZILLA
HardwareProcessors
NetworkOn Chip
FDL
FDL
Config
ConfigurationUnit
GEZELKernel
Memory-mappedChannels
ARMZILLA
64Requirements for an ISS ina multiprocessor
simulator
Hardware SimulationKernel
memorywrite( )
memoryread( )
Linkable
ISS
Reentrant
65ARMZILLA
HardwareProcessors
NetworkOn Chip
C
C
C
FDL
FDL
CrossCompiler
EXE
EXE
EXE
ARM ISS
GEZELKernel
ARM ISS
ARM ISS
ARMZILLA
66Memory-mapped channels easy in C
volatile int k 0x80000000 b k // read
Software on core Intialized Pointers
ARM ISS
Memorybus
67Hardware Simulation Kernel
C
C
C
CrossCompiler
EXE
Config
EXE
EXE
ConfigurationUnit
ARM ISS
ARM ISS
ARM ISS
Memory-mappedChannels
ARMZILLA
68GEZEL Hardware Simulation Kernel has Hybrid
Architecture
GEZEL Language
.fdl
symboltable
codegenerators
parser
cyclesimulator
hdltranslation
C
sim API
syn API
SH3DSP
LEON2Sparc
ARM
SystemC
VHDL
69The PONG Example
paddle
paddle
ball
- Designed in 1967 by Baer - interactive TV feature
- 1977, General Instruments AY-3-8500
pong-in-a-chip - Magnavox, Coleco, Atari, Philips, URL, GHP, ...
70Multiprocessor Model of PONG
Four-processor Model
paddle
paddle
ball
BALL (Ball dynamics)
FIELD(Rendering)
71Multiprocessor Operation - Initialize
Initialization Sequence
PADDLE1
PADDLE2
FIELD
BALL
TIME
72Multiprocessor Operation - Play
Play Sequence
PADDLE1
PADDLE2
FIELD
BALL
(CONDITIONAL)LOOP
73Lets play!
74Multiprocessor Architecture
DISPLAYHW
PADDLE1
BALL
PADDLE2
FIELD
memory-mapped
memory-mapped
memory-mapped
memory-mapped
O
I
O
I
O
I
O
I
Point-to-point Model
75Multiprocessor Architecture
Network-on-Chip Model
PADDLE1
BALL
PADDLE2
FIELD
memory-mapped
memory-mapped
memory-mapped
memory-mapped
DISPLAYHW
1DRouter
1DRouter
1DRouter
1DRouter
76Links
SimIt-ARM ISS (W. Qin, Princeton University)
http//www.ee.princeton.edu/wqin/armsim.htm
Cross Compiler arm-linux-gcc
ftp//ftp.arm.linux.org.uk
GEZEL ARMZILLA
http//www.ee.ucla.edu/schaum/gezel
All these tools are free! (free as in freedom,
not as in free beer)
77Conclusion
- Applications
- Mapped
- onto
- Architectures
- Embedded DSP Multimedia
- Design Methods
- Low Power!
78Thanks for your attention !