Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia - PowerPoint PPT Presentation

About This Presentation
Title:

Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

Description:

Flexibility (1) - Abstraction level. Computational Abstraction Level ... Computational Abstraction Level. 14. Flexibility (3) - Binding rate. Binding rate ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 79
Provided by: jau53
Category:

less

Transcript and Presenter's Notes

Title: Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia


1
Architectures and Design Techniques for Energy
Efficient Embedded DSP and Multimedia
  • Ingrid Verbauwhede, Christian Piguet, Bart
    Kienhuis/Ed Deprettere, Patrick Schaumont

2
Outline
  • Intro
  • Low Power Observations
  • SoC Architectures Ingrid
  • Low Power Components Christian
  • Design Methods Ed
  • Design Methods Patrick
  • Conclusion

3
Introduction
  • Applications
  • Mapped
  • onto
  • Architectures
  • Embedded DSP Multimedia
  • Design Methods
  • Low Power!

Ingrid Verbauwhede, UCLA, K.U.Leuven
4
Low Power observation 1 architecture tuned to
application
5
Observation 2 Energy-flexibility trade-off
Application
Power
Cost
ASIC
Fixed
???
Platform
GeneralPurpose
6
Example DSP processors
  • Specialized instructions MAC
  • Dedicated co-processors Viterbi acceleration

7
FIR on TI C55x Dual MAC
FIR filter two outputs in parallel with 3 busses
y(0) c(0)x(0) c(1)x(-1) c(2)x(-2) . . .
c(N-1)x(1-N) y(1) c(0)x(1) c(1)x(0)
c(2)x(-1) . . . c(N-1)x(2-N) y(2)
c(0)x(2) c(1)x(1) c(2)x(0) . . .
c(N-1)x(3-N) y(3) c(0)x(3) c(1)x(2)
c(2)x(1) . . . c(N-1)x(4-N)
8
Energy comparison
Total energy for one output sample
Adaptation of the datapath MAC, DMAC Adaptation
of the memory architecture and bus
network Adaptation of the instruction set
9
Viterbi on TIC54x
ALU and CSSU CMPS instruction
DB1(16)
DB0(16)
G min (G1 m1), (G2 m2)
m2
m1
TREG
  • ALU splits in 16 bit halves
  • ACC splits in half
  • Shortest distance saved
  • CSSU compares halves
  • Path indicator saved
  • 4 cycles / butterfly

ALU
G1
G2
Accumulator
ALU
MSW/LSW Select
decision bit
Comp
G
TRN reg
Data bus EB, to memory
Source TI Application Report, Viterbi Decoding
in the TMS320C54x family, document SPRA071
10
Observation Also general purpose architectures
become heterogeneous.
Source Xilinx webpage
11
Question
  • Energy - flexibility are opposite demands!
  • How to navigate in this jungle?
  • 3D design space
  • Next question how to map (or compile) an
    application onto such an architecture?

Computational Abstraction Level
Binding rate
Reconfigurable feature
12
Flexibility (1) - Abstraction level
Computational Abstraction Level
  • Instruction set level programmable
  • CLB level reconfigurable

13
Flexibility (2) - Reconfigurable feature
  • Basic components

Computational Abstraction Level
Communication Storage Computation
Systems Interconnect network Memory hierarchy Number type of processes
Instruction set Architecture Size address/ data bus Register set Custom instructions
Micro-architecture Cross-bar Busses Register file Execution unit type
Implementation Switches, Muxes RAM details CLB
Reconfigurable feature
14
Flexibility (3) - Binding rate
Binding rate
  • Compare processing to binding
  • Configurable (compile-time)
  • Re-configurable
  • Dynamic reconfigurable (adaptive)

15
SOC architecture RINGS
16
Instruction set extension
  • Instruction set extension
  • Register mapped
  • Tightly coupled
  • Experiment DFT

1000 iterations SW on Embedded proc. SW with HW datapath Improve-ment
Energy 67.6 mJ 5.76 mJ 12.5 times
17
Co-processor
  • Memory mapped
  • Loosely coupled
  • Experiment AES

Local Memory
175 iterations SW on emb. Proc. SW with HW datapath Improve-ment
Energy 89.2 mJ 13.5 mJ 25 times
18
Independent IP
  • Loosely coupled
  • Network on chip connected
  • Flexible interconnect
  • Experiment TCP/IP checksum

router
router
100 packets SW on emb. Proc. HW datapath Improve-ment
Energy 17.0 mJ 0.20 mJ 84 times
19
Communication Energy-flexibility
Daly-DAC01
  • Also energy - flexibility conflict!
  • General purpose NOC tiles
  • FPGA general purpose
  • Therefore domain specific NOC

20
Conclusion
  • Low Power by going domain-specific
  • Energy-flexibility conflict
  • How to program this RINGS?
  • Next Ultra-low power components Christian
  • Design exploration Ed
  • Co-design environment Patrick

21
Introduction
  • Applications
  • Mapped
  • onto
  • Architectures
  • Embedded DSP Multimedia
  • Design Methods
  • Low Power!

22
Efficient Embedded DSPUltra-Low-Power
Components
  • Christian Piguet, CSEM

23
Ultra Low Power DSP Processors
  • The design of DSP processors is very challenging,
    as it has to take into account contradictory
    goals
  • an increased throughput request
  • at a reduced energy budget
  • New issues due to very deep submicron
    technologies such as interconnect delays and
    leakage
  • History of hearing aids circuits
  • analog filters 15 years ago
  • digital ASIC-like circuits 5 years ago
  • powerful DSP processors today, below 1 Volt and 1
    mW

24
DSP Architectures for Low-Power
  • single MAC DSP core of 5-10 years ago
  • parallel architectures with several MAC working
    in parallel
  • VLIW or multitask DSP architectures
  • Benchmark
  • number of simple operations executed per clock
    cycle, up to 50 or more
  • Drawbacks of VLIW
  • very large instruction words up to 256 bits
  • Some instructions in the set are still missing
  • transistor count is not favorable to reduce
    leakage

25
VLIW TMS320C6x (VelociTI)
Peak C62 About 1000 MIPS/watt (5000
MOPS/watt) C64 About 5000 MIPS/watt (25000
MOPS/watt) About 5 Op/Instr.
26
3 Ways to be more Energy Efficient
  • To to design specific very small DSP engines for
    each task, in such a way that each DSP task is
    executed in the most energy efficient way on the
    smallest piece of hardware (N co-processors)
  • to design reconfigurable architectures such as
    the DART cluster
  • in which configuration bits allow the user to
    modify the hardware in such a way that it can
    much better fit to the executed algorithms.

27
Co-processors
  • - best one regarding power
  • - minimal number of transistors and transitions
    to perform a task
  • - control code on a microcontroller
  • - main issue is the software mapping of a given
    application onto so many heterogeneous processors
    and co-processors
  • - cut-off for leakage reduction
  • Ph.D of O. Paker,
  • DTU, Lyngby, June 2002, Minicores
  • IIR 73000 MOPS/watt
  • FIR 33000 MOPS/watt

28
DART O. Sentieys, ENSSAT
  • RDPx
  • RDP Reconfigurable DataPath
  • - Many identical co-processors or hardware
    accelerators for // tasks
  • - Reconfigurable interconnections
  • - in 0.18, 1.9 V., 130 MHz, WCDMA, 6.2 GOPS,
    40000 MOPS/watt
  • - FPGA Xcv200E 4000 MOPS/watt DSP C64 3000
    MOPS/watt

29
Reconfigurable DSP Architectures
  • Not FPGA, much more efficient than FPGA. The key
    point is to reconfigure only a limited number of
    units
  • Reconfigurable datapath
  • Reconfigurable interconnections
  • Reconfigurable Addressing Units (AGU)
  • FPGA
  • MACGIC DSP consumes 1 mW/MHz in 0.18
  • Same MACGIC in Altera Stratix consumes 10 mW/MHz
    plus 900 mW of static power, so 1000 mW at 10 MHz

30
Reconfigurable Datapaths
31
Reconfigurable Addressing Modes
  • operands fetch is generally a severe bottleneck
    in parallel machines for which 8-16 operands are
    required each clock cycle.
  • sophisticated addressing modes can be dynamically
    reconfigured depending on the DSP task to be
    executed

PR2 PR1 PR0 OFFA W
  • Examples
  • an lt- (anon)mn OFFA
  • an lt- (anOFFA)mn 1
  • an lt- (an1)mn OFFA
  • an lt- (an1)mn

Selection of 3 operations
(MACGIC)
32
MACGIC Performance results
  • Power-consumption for a 24-bits, 10 MHz synthesis
    _at_0.9 V in the 0.18µm TSMC technology (SYNOPSYS
    MACHTA/PA simulations)
  • NOP 25 µA / MHz
  • ADD 24-bit 102 µA / MHz 98k MOPS/Watt
  • MAC 24-bit/56-bit 137 µA / MHz 81k MOPS/Watt
  • 4 ADD 24-bit 167 µA / MHz 120k
    MOPS/Watt
  • 4 MAC 24-bit/56-bit 283 µA / MHz 86k
    MOPS/Watt
  • MACV 24-bit/56-bit 269 µA / MHz 90k MOPS/Watt
  • CBFY4 radix-4 FFT 273 µA / MHz 131k MOPS/Watt
  • Number of transistors for this 24-bit version
    600000
  • Number of transistors for a 16-bit version
    400000

33
Performance results 64 pts Cpx FFT
  • Macgic 250 clock cycles
  • CARMEL 526 clock cycles
  • PalmDSPCore 450 clock cycles
  • SC140 Starcore 288 clock cycles
  • R.E.A.L DSP 850 clock cycles
  • SP-5flex (3DSP) 500 clock cycles
  • TI C62x 675 clock cycles
  • TI C64x 276 clock cycles

34
Comparison
35
Introduction
  • Applications
  • Mapped
  • onto
  • Architectures
  • Embedded DSP Multimedia
  • Design Methods
  • Low Power!

36
Design ArchitectureExploration
  • Ed Deprettere, Professor
  • Bart Kienhuis, Assistant Professor
  • Leiden University
  • LIACS, The Netherlands

37
Embedded DSP Architectures
Synfora PicoChip Silicon HiveArt Builder
CPU A simple Microprocessor RPU Reconfigurable
Processing Unit IPcore Dedicated Accelerator
block NoC Network on a Chip
Weakly coupled Processing elements
38
System Level Design
  • Three aspects are important in System Level
    Design
  • The Architecture
  • The Application
  • How the Application is Mapped on the
    Architecture.
  • To optimize a system, you need to take all three
    aspect into consideration.
  • This is expressed in terms of the Y-Chart

39
Y-chart Approach
Tuning the architecture to get better performance
40
Y-chart Approach
Applications
Applications
Architecture Instance
Applications
Mapping
Performance Analysis
Performance Numbers
There are three different ways to improve a
system.
41
Design Space Exploration
Decision Variables
Valuation System
Metrics
Acquisition of Insight
42
Design Space Exploration
Exploration is about finding The inverse
relationship
Decision Variables
Metrics
Applications
Applications
Architecture Instance
Applications
Mapping
Performance Analysis
Performance Numbers
43
Y-chart Design For GP Processors
Applications
Applications
ARM Processor Tensilica X86 process
Applications In C
Compiler (GNU GCC)
Performance Analysis
  • Numbers are
  • Clock Speed
  • Power Consumption
  • Throughput
  • Latency

Performance Numbers
44
Y-chart Design For DSP Applications
Applications
Applications
Architecture Template Xilinx FPGA
Applications In VHDL/C
Synplicity Xilinx Foundation Tools
Companies like PICO and ART (Arm/Adelante) are
trying to provide the Y-chart components as well.
Performance Analysis
  • Numbers are
  • Clock Speed
  • Power Consumption
  • Throughput
  • Latency

Performance Numbers
45
How to improve performance
  • How can we improve the performance of the system
    we are interested in.
  • Others focus on architecture, we want to focus on
    the application.
  • For a low-power architecture parallelism is
    important.
  • Exploiting more parallelism leads to fast
    calculations
  • using Voltage and Frequency scaling, we assume
    that power is saved
  • There is already a lot of theory developed to
    employ bit-level parallelism, instruction
    parallelism and Task Level parallelism.
  • Especially Task Level parallelism is getting more
    and more important to effectively map DSP
    application onto the new emerging architectures.

46
Programming Problem
Application
Programming
Compaan
Laura
47
Kahn Process Network (KPN)
  • Kahn Process Networks Kahn 1974ParksLee 95
  • Processes run autonomously
  • Communicate via unbounded FIFOs
  • Synchronize via blocking read
  • Process is either
  • executing (execute)
  • communicating(send/get)
  • Characteristics
  • Deterministic
  • Distributed Control
  • No Global Scheduler is needed
  • Distributed Memory
  • No memory contention

48
Kahn Process Network (KPN)
Fifo
CPU 1
FPGA A
Process A
Process B
Fifo
Fifo
Fifo
Fifo
FPGA B
Process C
  • Autonomously operating Processes no global
    schedule needed
  • Blocking Read simple realize in Hardware
  • Buffer Sizes of the FIFOs are quite often very
    small

49
Matlab to Process Networks to FPGA
Matlab Program
C / SystemC / YAPI
Compaan Compiler
Kahn Process Network
Laura
Functional Simulation in Ptolemy II
Synthesizable VHDL
50
Matlab to Matlab Transformations
  • To make the flow from Matlab to FPGA interesting,
    we had to give the designers means to change the
    characteristics
  • Unrolling (Unfolding)
  • Increases parallelism
  • Retiming (Skewing)
  • Improved pipeline behavior
  • Clustering (Merging)
  • Reducing parallelism
  • All these operations can be applied to the source
    level of Matlab, leading to a new Matlab program

51
Y-chart Design For DSP Applications
Algorithmic Toolbox Retiming UnrollingMerging
Matlab Program
Compaan
Laura Synplicity Xilinx Foundation
Process Network
Xilinx/Altera FPGA
Performance Analysis
Performance Numbers
52
Case Study
  • We use the Y-chart environment on a real case
    study
  • Adaptive QR DAES 2002
  • Using commercial IP cores
  • QinetiQ Ltd.
  • Vectorize 42 pipeline stages
  • Rotate 55 pipeline stages
  • QR is interesting as it requires deeply pipelined
    IP cores
  • Most Design tools have difficulties with such IP
    cores
  • We will explore a number of simple steps to
    improve the performance of the QR algorithm
  • Was reported to run at 12Mflops.

53
Example Adaptive QR (Step 1)
for k 1121,    for j 117,        for i
j17,            if i lt j,             
r(j,j), rr(j,j), a(j),b(j), d(k) bcell(
r(j,j), rr(j,j), x(k,j), d(k) )            else
             r(j,i), x(k,i), a(I), b(I)
icell(r(j,i),x(k,i), a(i), b(i) )           
end        end    end end
  • This algorithm runs at 60 Mflops _at_ 100Mhz,
    out-of-the-box. This is more than 12Mflops
    DAES2002 due to improved hardware design of the
    processes and controllers.
  • Still, the efficiency of the pipelines of
    bcell/icell is very poor lt1, for example, due to
    the self-loop created by variables a and b.

54
Example Adaptive QR (Step 2)
for k 1121,    for j 117,        for i
j17,            if i lt j,             
r(j,j), rr(j,j), a(j),b(j), d(k) bcell(
r(j,j), rr(j,j), x(k,j), d(k) )            else
             a(i) Pass_qr(a(i-1))
             b(i) Pass_qr(b(i-1))
             r(j,i), x(k,i)
icell(r(j,i),x(k,i), a(i-1), b(i-1) )
           end        end    end end
  • We broke the self-loops on variables a and b
    by introducing a pass function. This gave a 5
    improvement leading to 67.7 Mflops.
  • Still the efficiency is bcell (3.76) / icell
    (1.56) is very poor.

55
Example Adaptive QR (Step 3)
for k 1121,    for j 117,       for i
j17,          for p 1 1 10,            
if i lt j,               r(p,j,i), rr(p,j,i),
a(p,i),b(p,i),d(p,k) bcell( r(p,j,i),
rr(p,j,i), x(p,k,i), d(p,k) )             else
              a(p,i) Pass_qr(a(p,i-1))
              b(p,i) Pass_qr(b(p,i-1))
              r(p,j,i), x(p,k,i) icell(
r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) )
            end          end       end    end
end
  • To fill the pipelines, we run independent
    problems over the same resources (sometimes
    referred to as strip mining). This fills the
    pipelines more efficiently, leading to 472.55
    Mflops. (bcell 8.2/ icell 24.2).

56
Example Adaptive QR (Step 4)
for k 1121,    for j 117,       for i
j17,          for p 1 1 10,            
if i lt j,               r(p,j,i), rr(p,j,i),
a(p,i),b(p,i), d(p,k) bcell( r(p,j,i),
rr(p,j,i), x(p,k,i), d(p,k) )             else
              a(p,i) Pass_qr(a(p,i-1))
              b(p,i) Pass_qr(b(p,i-1))
              if mod(p,2) lt 0,                
r(p,j,i), x(p,k,i) icell( r(p,j,i),
x(p,k,i), (p,i-1), b(p,i-1) )              
else                 r(p,j,i), x(p,k,i)
icell( r(p,j,i), x(p,k,i), a(p,i-1), b(p,i-1) )
              end            end         end
      end    end end
  • The icell and bcell work load is not balanced.
    The bcell gives more workload then the icell can
    process. Therefore, we unroll the icell, leading
    to 673.81 Mflops. (bcell 11.8/ icell 217.4
    34.8).

57
Conclusions
  • Optimizing a System Level Design for Low-Power
    requires that you look at the architecture, the
    mapping, and the application.
  • The Y-chart gives a simple framework to tune a
    system for optimal performance.
  • The Y-chart forms the basis for DSE
  • We showed that by playing with the way
    applications are written, we get in a number of
    steps orders better performance 60MFlops -gt
    673MFlops. This without changing the
    architecture!

58
Introduction
  • Applications
  • Mapped
  • onto
  • Architectures
  • Embedded DSP Multimedia
  • Design Methods
  • Low Power!

59
Domain-Specific Co-design Environments
  • Patrick Schaumont, UCLA

60
Lets waste no time !
You
61
Design of the RINGS Multiprocessor
Core
Core
Hardware
Co-processor
NoCInterface
NoCInterface
NoCInterface
NoC Routers Topology
  • Need an environment that combines
  • Instruction-set Simulation (multi-core)
  • Hardware Simulation

62
ARMZILLA
HardwareProcessors
NetworkOn Chip
C
C
C
FDL
FDL
CrossCompiler
EXE
Config
EXE
EXE
ConfigurationUnit
ARM ISS
GEZELKernel
ARM ISS
ARM ISS
Memory-mappedChannels
ARMZILLA
63
ARMZILLA
HardwareProcessors
NetworkOn Chip
FDL
FDL
Config
ConfigurationUnit
GEZELKernel
Memory-mappedChannels
ARMZILLA
64
Requirements for an ISS ina multiprocessor
simulator
Hardware SimulationKernel
memorywrite( )
memoryread( )
Linkable
ISS
Reentrant
65
ARMZILLA
HardwareProcessors
NetworkOn Chip
C
C
C
FDL
FDL
CrossCompiler
EXE
EXE
EXE
ARM ISS
GEZELKernel
ARM ISS
ARM ISS
ARMZILLA
66
Memory-mapped channels easy in C
volatile int k 0x80000000 b k // read
Software on core Intialized Pointers
ARM ISS
Memorybus
67
Hardware Simulation Kernel
C
C
C
CrossCompiler
EXE
Config
EXE
EXE
ConfigurationUnit
ARM ISS
ARM ISS
ARM ISS
Memory-mappedChannels
ARMZILLA
68
GEZEL Hardware Simulation Kernel has Hybrid
Architecture
GEZEL Language
.fdl
symboltable
codegenerators
parser
cyclesimulator
hdltranslation
C
sim API
syn API
SH3DSP
LEON2Sparc
ARM
SystemC
VHDL
69
The PONG Example
paddle
paddle
ball
  • Designed in 1967 by Baer - interactive TV feature
  • 1977, General Instruments AY-3-8500
    pong-in-a-chip
  • Magnavox, Coleco, Atari, Philips, URL, GHP, ...

70
Multiprocessor Model of PONG
Four-processor Model
paddle
paddle
ball
BALL (Ball dynamics)
FIELD(Rendering)
71
Multiprocessor Operation - Initialize
Initialization Sequence
PADDLE1
PADDLE2
FIELD
BALL
TIME
72
Multiprocessor Operation - Play
Play Sequence
PADDLE1
PADDLE2
FIELD
BALL
(CONDITIONAL)LOOP
73
Lets play!
74
Multiprocessor Architecture
DISPLAYHW
PADDLE1
BALL
PADDLE2
FIELD
memory-mapped
memory-mapped
memory-mapped
memory-mapped
O
I
O
I
O
I
O
I
Point-to-point Model
75
Multiprocessor Architecture
Network-on-Chip Model
PADDLE1
BALL
PADDLE2
FIELD
memory-mapped
memory-mapped
memory-mapped
memory-mapped
DISPLAYHW
1DRouter
1DRouter
1DRouter
1DRouter
76
Links
SimIt-ARM ISS (W. Qin, Princeton University)
http//www.ee.princeton.edu/wqin/armsim.htm
Cross Compiler arm-linux-gcc
ftp//ftp.arm.linux.org.uk
GEZEL ARMZILLA
http//www.ee.ucla.edu/schaum/gezel
All these tools are free! (free as in freedom,
not as in free beer)
77
Conclusion
  • Applications
  • Mapped
  • onto
  • Architectures
  • Embedded DSP Multimedia
  • Design Methods
  • Low Power!

78
Thanks for your attention !
Write a Comment
User Comments (0)
About PowerShow.com