Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

About This Presentation

Title:

Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

Description:

Flexibility (1) - Abstraction level. Computational Abstraction Level ... Computational Abstraction Level. 14. Flexibility (3) - Binding rate. Binding rate ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 79

Provided by: jau53

Learn more at: http://www.emsec.ee.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

1
Architectures and Design Techniques for Energy
Efficient Embedded DSP and Multimedia

Ingrid Verbauwhede, Christian Piguet, Bart
Kienhuis/Ed Deprettere, Patrick Schaumont

2
Outline

Intro
Low Power Observations
SoC Architectures Ingrid
Low Power Components Christian
Design Methods Ed
Design Methods Patrick
Conclusion

3
Introduction

Applications
Mapped
onto
Architectures

Embedded DSP Multimedia
Design Methods
Low Power!

Ingrid Verbauwhede, UCLA, K.U.Leuven
4
Low Power observation 1 architecture tuned to
application
5
Observation 2 Energy-flexibility trade-off
Application
Power
Cost
ASIC
Fixed
???
Platform
GeneralPurpose
6
Example DSP processors

Specialized instructions MAC
Dedicated co-processors Viterbi acceleration

7
FIR on TI C55x Dual MAC
FIR filter two outputs in parallel with 3 busses
y(0) c(0)x(0) c(1)x(-1) c(2)x(-2) . . .
c(N-1)x(1-N) y(1) c(0)x(1) c(1)x(0)
c(2)x(-1) . . . c(N-1)x(2-N) y(2)
c(0)x(2) c(1)x(1) c(2)x(0) . . .
c(N-1)x(3-N) y(3) c(0)x(3) c(1)x(2)
c(2)x(1) . . . c(N-1)x(4-N)
8
Energy comparison
Total energy for one output sample
Adaptation of the datapath MAC, DMAC Adaptation
of the memory architecture and bus
network Adaptation of the instruction set
9
Viterbi on TIC54x
ALU and CSSU CMPS instruction
DB1(16)
DB0(16)
G min (G1 m1), (G2 m2)
m2
m1
TREG

ALU splits in 16 bit halves
ACC splits in half
Shortest distance saved
CSSU compares halves
Path indicator saved
4 cycles / butterfly

ALU
G1
G2
Accumulator
ALU
MSW/LSW Select
decision bit
Comp
G
TRN reg
Data bus EB, to memory
Source TI Application Report, Viterbi Decoding
in the TMS320C54x family, document SPRA071
10
Observation Also general purpose architectures
become heterogeneous.
Source Xilinx webpage
11
Question

Energy - flexibility are opposite demands!
How to navigate in this jungle?
3D design space
Next question how to map (or compile) an
application onto such an architecture?

Computational Abstraction Level
Binding rate
Reconfigurable feature
12
Flexibility (1) - Abstraction level
Computational Abstraction Level

Instruction set level programmable
CLB level reconfigurable

13
Flexibility (2) - Reconfigurable feature

Basic components

Computational Abstraction Level
Communication Storage Computation
Systems Interconnect network Memory hierarchy Number type of processes
Instruction set Architecture Size address/ data bus Register set Custom instructions
Micro-architecture Cross-bar Busses Register file Execution unit type
Implementation Switches, Muxes RAM details CLB
Reconfigurable feature
14
Flexibility (3) - Binding rate
Binding rate

Compare processing to binding
Configurable (compile-time)
Re-configurable
Dynamic reconfigurable (adaptive)

15
SOC architecture RINGS
16
Instruction set extension

Instruction set extension
Register mapped
Tightly coupled
Experiment DFT

1000 iterations SW on Embedded proc. SW with HW datapath Improve-ment
Energy 67.6 mJ 5.76 mJ 12.5 times
17
Co-processor

Memory mapped
Loosely coupled
Experiment AES

Local Memory
175 iterations SW on emb. Proc. SW with HW datapath Improve-ment
Energy 89.2 mJ 13.5 mJ 25 times
18
Independent IP

Loosely coupled
Network on chip connected
Flexible interconnect
Experiment TCP/IP checksum

router
router
100 packets SW on emb. Proc. HW datapath Improve-ment
Energy 17.0 mJ 0.20 mJ 84 times
19
Communication Energy-flexibility
Daly-DAC01

Also energy - flexibility conflict!
General purpose NOC tiles
FPGA general purpose
Therefore domain specific NOC

20
Conclusion

Low Power by going domain-specific
Energy-flexibility conflict
How to program this RINGS?
Next Ultra-low power components Christian
Design exploration Ed
Co-design environment Patrick

21
Introduction

Applications
Mapped
onto
Architectures

Embedded DSP Multimedia
Design Methods
Low Power!

22
Efficient Embedded DSPUltra-Low-Power
Components

Christian Piguet, CSEM

23
Ultra Low Power DSP Processors

The design of DSP processors is very challenging,
as it has to take into account contradictory
goals
an increased throughput request
at a reduced energy budget
New issues due to very deep submicron
technologies such as interconnect delays and
leakage
History of hearing aids circuits
analog filters 15 years ago
digital ASIC-like circuits 5 years ago
powerful DSP processors today, below 1 Volt and 1
mW

24
DSP Architectures for Low-Power

single MAC DSP core of 5-10 years ago
parallel architectures with several MAC working
in parallel
VLIW or multitask DSP architectures
Benchmark
number of simple operations executed per clock
cycle, up to 50 or more
Drawbacks of VLIW
very large instruction words up to 256 bits
Some instructions in the set are still missing
transistor count is not favorable to reduce
leakage

25
VLIW TMS320C6x (VelociTI)
Peak C62 About 1000 MIPS/watt (5000
MOPS/watt) C64 About 5000 MIPS/watt (25000
MOPS/watt) About 5 Op/Instr.
26
3 Ways to be more Energy Efficient

To to design specific very small DSP engines for
each task, in such a way that each DSP task is
executed in the most energy efficient way on the
smallest piece of hardware (N co-processors)
to design reconfigurable architectures such as
the DART cluster
in which configuration bits allow the user to
modify the hardware in such a way that it can
much better fit to the executed algorithms.

27
Co-processors

- best one regarding power
- minimal number of transistors and transitions
to perform a task
- control code on a microcontroller
- main issue is the software mapping of a given
application onto so many heterogeneous processors
and co-processors
- cut-off for leakage reduction

Ph.D of O. Paker,
DTU, Lyngby, June 2002, Minicores
IIR 73000 MOPS/watt
FIR 33000 MOPS/watt

28
DART O. Sentieys, ENSSAT

RDPx

RDP Reconfigurable DataPath

- Many identical co-processors or hardware
accelerators for // tasks
- Reconfigurable interconnections
- in 0.18, 1.9 V., 130 MHz, WCDMA, 6.2 GOPS,
40000 MOPS/watt
- FPGA Xcv200E 4000 MOPS/watt DSP C64 3000
MOPS/watt

29
Reconfigurable DSP Architectures

Not FPGA, much more efficient than FPGA. The key
point is to reconfigure only a limited number of
units
Reconfigurable datapath
Reconfigurable interconnections
Reconfigurable Addressing Units (AGU)
FPGA
MACGIC DSP consumes 1 mW/MHz in 0.18
Same MACGIC in Altera Stratix consumes 10 mW/MHz
plus 900 mW of static power, so 1000 mW at 10 MHz

30
Reconfigurable Datapaths
31
Reconfigurable Addressing Modes

operands fetch is generally a severe bottleneck
in parallel machines for which 8-16 operands are
required each clock cycle.
sophisticated addressing modes can be dynamically
reconfigured depending on the DSP task to be
executed

PR2 PR1 PR0 OFFA W

Examples
an lt- (anon)mn OFFA
an lt- (anOFFA)mn 1
an lt- (an1)mn OFFA
an lt- (an1)mn

Selection of 3 operations
(MACGIC)
32
MACGIC Performance results

Power-consumption for a 24-bits, 10 MHz synthesis
_at_0.9 V in the 0.18µm TSMC technology (SYNOPSYS
MACHTA/PA simulations)
NOP 25 µA / MHz
ADD 24-bit 102 µA / MHz 98k MOPS/Watt
MAC 24-bit/56-bit 137 µA / MHz 81k MOPS/Watt
4 ADD 24-bit 167 µA / MHz 120k
MOPS/Watt
4 MAC 24-bit/56-bit 283 µA / MHz 86k
MOPS/Watt
MACV 24-bit/56-bit 269 µA / MHz 90k MOPS/Watt
CBFY4 radix-4 FFT 273 µA / MHz 131k MOPS/Watt
Number of transistors for this 24-bit version
600000
Number of transistors for a 16-bit version
400000

33
Performance results 64 pts Cpx FFT

Macgic 250 clock cycles
CARMEL 526 clock cycles
PalmDSPCore 450 clock cycles
SC140 Starcore 288 clock cycles
R.E.A.L DSP 850 clock cycles
SP-5flex (3DSP) 500 clock cycles
TI C62x 675 clock cycles
TI C64x 276 clock cycles

34
Comparison
35
Introduction

Applications
Mapped
onto
Architectures

Embedded DSP Multimedia
Design Methods
Low Power!

36
Design ArchitectureExploration

Ed Deprettere, Professor
Bart Kienhuis, Assistant Professor
Leiden University
LIACS, The Netherlands

37
Embedded DSP Architectures
Synfora PicoChip Silicon HiveArt Builder
CPU A simple Microprocessor RPU Reconfigurable
Processing Unit IPcore Dedicated Accelerator
block NoC Network on a Chip
Weakly coupled Processing elements
38
System Level Design

Three aspects are important in System Level
Design
The Architecture
The Application
How the Application is Mapped on the
Architecture.
To optimize a system, you need to take all three
aspect into consideration.
This is expressed in terms of the Y-Chart

39
Y-chart Approach
Tuning the architecture to get better performance
40
Y-chart Approach
Applications
Applications
Architecture Instance
Applications
Mapping
Performance Analysis
Performance Numbers
There are three different ways to improve a
system.
41
Design Space Exploration
Decision Variables
Valuation System
Metrics
Acquisition of Insight
42
Design Space Exploration
Exploration is about finding The inverse
relationship
Decision Variables
Metrics
Applications
Applications
Architecture Instance
Applications
Mapping
Performance Analysis
Performance Numbers
43
Y-chart Design For GP Processors
Applications
Applications
ARM Processor Tensilica X86 process
Applications In C
Compiler (GNU GCC)
Performance Analysis

Numbers are
Clock Speed
Power Consumption
Throughput
Latency

Performance Numbers
44
Y-chart Design For DSP Applications
Applications
Applications
Architecture Template Xilinx FPGA
Applications In VHDL/C
Synplicity Xilinx Foundation Tools
Companies like PICO and ART (Arm/Adelante) are
trying to provide the Y-chart components as well.
Performance Analysis

Numbers are
Clock Speed
Power Consumption
Throughput
Latency

Performance Numbers
45
How to improve performance

How can we improve the performance of the system
we are interested in.
Others focus on architecture, we want to focus on
the application.
For a low-power architecture parallelism is
important.
Exploiting more parallelism leads to fast
calculations
using Voltage and Frequency scaling, we assume
that power is saved
There is already a lot of theory developed to
employ bit-level parallelism, instruction
parallelism and Task Level parallelism.
Especially Task Level parallelism is getting more
and more important to effectively map DSP
application onto the new emerging architectures.

46
Programming Problem
Application
Programming
Compaan
Laura
47
Kahn Process Network (KPN)

Kahn Process Networks Kahn 1974ParksLee 95
Processes run autonomously
Communicate via unbounded FIFOs
Synchronize via blocking read
Process is either
executing (execute)
communicating(send/get)
Characteristics
Deterministic
Distributed Control
No Global Scheduler is needed
Distributed Memory
No memory contention

48
Kahn Process Network (KPN)
Fifo
CPU 1
FPGA A
Process A
Process B
Fifo
Fifo
Fifo
Fifo
FPGA B
Process C

Autonomously operating Processes no global
schedule needed
Blocking Read simple realize in Hardware
Buffer Sizes of the FIFOs are quite often very
small

49
Matlab to Process Networks to FPGA
Matlab Program
C / SystemC / YAPI
Compaan Compiler
Kahn Process Network
Laura
Functional Simulation in Ptolemy II
Synthesizable VHDL
50
Matlab to Matlab Transformations

To make the flow from Matlab to FPGA interesting,
we had to give the designers means to change the
characteristics
Unrolling (Unfolding)
Increases parallelism
Retiming (Skewing)
Improved pipeline behavior
Clustering (Merging)
Reducing parallelism
All these operations can be applied to the source
level of Matlab, leading to a new Matlab program

51
Y-chart Design For DSP Applications
Algorithmic Toolbox Retiming UnrollingMerging
Matlab Program
Compaan
Laura Synplicity Xilinx Foundation
Process Network
Xilinx/Altera FPGA
Performance Analysis
Performance Numbers
52
Case Study

We use the Y-chart environment on a real case
study
Adaptive QR DAES 2002
Using commercial IP cores
QinetiQ Ltd.
Vectorize 42 pipeline stages
Rotate 55 pipeline stages
QR is interesting as it requires deeply pipelined
IP cores
Most Design tools have difficulties with such IP
cores
We will explore a number of simple steps to
improve the performance of the QR algorithm
Was reported to run at 12Mflops.

53
Example Adaptive QR (Step 1)
for k 1121,    for j 117,        for i
j17,            if i lt j,
r(j,j), rr(j,j), a(j),b(j), d(k) bcell(
r(j,j), rr(j,j), x(k,j), d(k) )            else
             r(j,i), x(k,i), a(I), b(I)
icell(r(j,i),x(k,i), a(i), b(i) )
end        end    end end

This algorithm runs at 60 Mflops _at_ 100Mhz,
out-of-the-box. This is more than 12Mflops
DAES2002 due to improved hardware design of the
processes and controllers.
Still, the efficiency of the pipelines of
bcell/icell is very poor lt1, for example, due to
the self-loop created by variables a and b.

54
Example Adaptive QR (Step 2)
for k 1121,    for j 117,        for i
j17,            if i lt j,
r(j,j), rr(j,j), a(j),b(j), d(k) bcell(
r(j,j), rr(j,j), x(k,j), d(k) )            else
             a(i) Pass_qr(a(i-1))
             b(i) Pass_qr(b(i-1))
             r(j,i), x(k,i)
icell(r(j,i),x(k,i), a(i-1), b(i-1) )
           end        end    end end

We broke the self-loops on variables a and b
by introducing a pass function. This gave a 5
improvement leading to 67.7 Mflops.
Still the efficiency is bcell (3.76) / icell
(1.56) is very poor.

55
Example Adaptive QR (Step 3)
for k 1121,    for j 117,       for i
j17,          for p 1 1 10,
if i lt j,               r(p,j,i), rr(p,j,i),
a(p,i),b(p,i),d(p,k) bcell( r(p,j,i),
rr(p,j,i), x(p,k,i), d(p,k) )             else
              a(p,i) Pass_qr(a(p,i-1))
              b(p,i) Pass_qr(b(p,i-1))
              r(p,j,i), x(p,k,i) icell(
r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) )
            end          end       end    end
end

To fill the pipelines, we run independent
problems over the same resources (sometimes
referred to as strip mining). This fills the
pipelines more efficiently, leading to 472.55
Mflops. (bcell 8.2/ icell 24.2).

56
Example Adaptive QR (Step 4)
for k 1121,    for j 117,       for i
j17,          for p 1 1 10,
if i lt j,               r(p,j,i), rr(p,j,i),
a(p,i),b(p,i), d(p,k) bcell( r(p,j,i),
rr(p,j,i), x(p,k,i), d(p,k) )             else
              a(p,i) Pass_qr(a(p,i-1))
              b(p,i) Pass_qr(b(p,i-1))
              if mod(p,2) lt 0,
r(p,j,i), x(p,k,i) icell( r(p,j,i),
x(p,k,i), (p,i-1), b(p,i-1) )
else                 r(p,j,i), x(p,k,i)
icell( r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) )
              end            end         end
      end    end end

The icell and bcell work load is not balanced.
The bcell gives more workload then the icell can
process. Therefore, we unroll the icell, leading
to 673.81 Mflops. (bcell 11.8/ icell 217.4
34.8).

57
Conclusions

Optimizing a System Level Design for Low-Power
requires that you look at the architecture, the
mapping, and the application.
The Y-chart gives a simple framework to tune a
system for optimal performance.
The Y-chart forms the basis for DSE
We showed that by playing with the way
applications are written, we get in a number of
steps orders better performance 60MFlops -gt
673MFlops. This without changing the
architecture!

58
Introduction

Applications
Mapped
onto
Architectures

Embedded DSP Multimedia
Design Methods
Low Power!

59
Domain-Specific Co-design Environments

Patrick Schaumont, UCLA

60
Lets waste no time !
You
61
Design of the RINGS Multiprocessor
Core
Core
Hardware
Co-processor
NoCInterface
NoCInterface
NoCInterface
NoC Routers Topology

Need an environment that combines
Instruction-set Simulation (multi-core)
Hardware Simulation

62
ARMZILLA
HardwareProcessors
NetworkOn Chip
C
C
C
FDL
FDL
CrossCompiler
EXE
Config
EXE
EXE
ConfigurationUnit
ARM ISS
GEZELKernel
ARM ISS
ARM ISS
Memory-mappedChannels
ARMZILLA
63
ARMZILLA
HardwareProcessors
NetworkOn Chip
FDL
FDL
Config
ConfigurationUnit
GEZELKernel
Memory-mappedChannels
ARMZILLA
64
Requirements for an ISS ina multiprocessor
simulator
Hardware SimulationKernel
memorywrite( )
memoryread( )
Linkable
ISS
Reentrant
65
ARMZILLA
HardwareProcessors
NetworkOn Chip
C
C
C
FDL
FDL
CrossCompiler
EXE
EXE
EXE
ARM ISS
GEZELKernel
ARM ISS
ARM ISS
ARMZILLA
66
Memory-mapped channels easy in C
volatile int k 0x80000000 b k // read
Software on core Intialized Pointers
ARM ISS
Memorybus
67
Hardware Simulation Kernel
C
C
C
CrossCompiler
EXE
Config
EXE
EXE
ConfigurationUnit
ARM ISS
ARM ISS
ARM ISS
Memory-mappedChannels
ARMZILLA
68
GEZEL Hardware Simulation Kernel has Hybrid
Architecture
GEZEL Language
.fdl
symboltable
codegenerators
parser
cyclesimulator
hdltranslation
C
sim API
syn API
SH3DSP
LEON2Sparc
ARM
SystemC
VHDL
69
The PONG Example
paddle
paddle
ball

Designed in 1967 by Baer - interactive TV feature
1977, General Instruments AY-3-8500
pong-in-a-chip
Magnavox, Coleco, Atari, Philips, URL, GHP, ...

70
Multiprocessor Model of PONG
Four-processor Model
paddle
paddle
ball
BALL (Ball dynamics)
FIELD(Rendering)
71
Multiprocessor Operation - Initialize
Initialization Sequence
PADDLE1
PADDLE2
FIELD
BALL
TIME
72
Multiprocessor Operation - Play
Play Sequence
PADDLE1
PADDLE2
FIELD
BALL
(CONDITIONAL)LOOP
73
Lets play!
74
Multiprocessor Architecture
DISPLAYHW
PADDLE1
BALL
PADDLE2
FIELD
memory-mapped
memory-mapped
memory-mapped
memory-mapped
O
I
O
I
O
I
O
I
Point-to-point Model
75
Multiprocessor Architecture
Network-on-Chip Model
PADDLE1
BALL
PADDLE2
FIELD
memory-mapped
memory-mapped
memory-mapped
memory-mapped
DISPLAYHW
1DRouter
1DRouter
1DRouter
1DRouter
76
Links
SimIt-ARM ISS (W. Qin, Princeton University)
http//www.ee.princeton.edu/wqin/armsim.htm
Cross Compiler arm-linux-gcc
ftp//ftp.arm.linux.org.uk
GEZEL ARMZILLA
http//www.ee.ucla.edu/schaum/gezel
All these tools are free! (free as in freedom,
not as in free beer)
77
Conclusion