Embedded Systems in Silicon TD5102 Introduction and overview - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Embedded Systems in Silicon TD5102 Introduction and overview

Description:

The computer enters the 3rd fase. computing power - networking ... Yes, we can fabricate the ICs, but ... Can we design them ? Can we program them ? 103 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 84
Provided by: henkcor2
Category:

less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Introduction and overview


1
Embedded Systems in SiliconTD5102Introduction
and overview
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Contents
  • Trends
  • Platforms
  • Application mapping
  • Design flow
  • Summary

3
Observation 1The 3 Cs
  • Convergence of 3 Cs
  • computers, communications and consumer
  • electronics
  • The computer enters the 3rd fase
  • computing power - networking - intelligent
    processing
  • The world is one network
  • wherever, whenever, all information and
    communication available

We get a smart environment
4
Observation 2 Current design practise
  • Y-Chart (Gajski-Kuhn)
  • Design Flow is path in Y chart
  • Till RT-level largely manual flow

5
Observation 3 Informal system specification
6
Observation 4 design productivity
  • Yes, we can fabricate the ICs, but
  • Can we design them ?
  • Can we program them ?

7
Obervation 5More dynamic applications

Video
P. Kuhn, G. Diebel, Complexity Analysis of the
MPEG-4 VM 8.0, ISO/IEC JTC1/SC29/WG11/MPEG97/m28
62, Fribourg, October 1997

3D
8
Observation 6 Memory problem
Performance
µProc 55/year
1000
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
1980
1985
1990
1995
2000
Time
Patterson
9
What do we learn from these observations?
  • We need
  • Short Time-to-Market
  • reuse
  • short design time
  • Flexible solution
  • programmability
  • reconfigurability
  • Scalability
  • Low power
  • Low cost
  • QoS control

At sufficient performance !
10
Solution ?
  • Platforms
  • HW and SW IP reuse
  • Standardization (interfaces)
  • QoS (quality of service) hooks
  • Advanced Design Flow for Platforms
  • Raise abstraction level
  • Tool support
  • Modeling of Power, Cost, Performance
  • Predictable design

11
Lecture 1 Introduction
  • Trends
  • Platforms
  • Application mapping
  • Design flow
  • Summary

12
What is a platform?
A platform is a generic, but domain
specific information processing (sub-)system
In future available as single chip (SoC), or
package (SiP)
13
What is a platform?
  • HW properties
  • One or more programmable processors
  • Advanced memory organization
  • Programmable communication network
  • I/O (highly domain dependent)
  • Possible extra HW features
  • Reconfigurable logic
  • Domain specific accelerators

14
What is a platform?
  • SW components
  • Standardized RTOS
  • Proper tooling for platform system design
  • Compilers, Models, Exploration, Debugging,
    Simulation,
  • Possible extra SW features
  • Middleware layer on top of OS for features like
  • QoS
  • Domain specific protocols
  • Domain specific SW interfaces
  • Control reconfigurable logic
  • Library components
  • Distributed / Active network processing
  • Billing
  • Security

15
Example Platform Philips Nexperia
  • Available in the Billion Transistor Era
  • E.g. TI OMAP, Sony Cell, Philips Nexperia, TRIPS,
    Xilinx Virtex-4 Pro,

16
Future platforms
  • Example Smart Networked Devices

active packets
Virtual Machine
Protocols Multimedia (MPEG 21) Network
OS
library
reconfig. hardware
accelerator hardware
programmable hardware
radio
17
Future platform architecture concept
Reconfigurable HW blocks
CPUs
Reconfigurable HW blocks
Accelerators
Reconfigurable HW blocks
CPUs
Accelerators
Accelerators
CPUs
Communication network
Memory
Memory
I/O
Level 0
Communication network
Level 1
Communication network
I/O
Level N
Memory
18
Future platforms
Network interface
On-chip Network
IP core
  • IP - Isles
  • 32 RISC microprocessor 20 Kgates
  • MPEG decoding 100 Kgates
  • Wavelet filtering 40 Kgates
  • SRAM
  • DRAM
  • FPGA block

19
Lecture 1 Introduction
  • Trends
  • Platforms
  • Application mapping
  • Design flow
  • Summary

20
Platform and platform design
Applications
SDT system design technology
Design technology
Platform
PDT platform design technology
Enabling technologies
21
What is the system designers problem ?
  • Idea

Specification
Implementation
Find for an application (idea/specification) an
efficient mapping/implementation on a given
realization space, under given constraints
(cost, P, E, T, ED, Throughput, pins, ..)
22
A (single) processor how does it
look inside?
23
Mapping placing operations in space and time
  • d a b
  • e a d
  • f 2 b d
  • r f e
  • x z y

24
How to map these operations?
  • Architecture 1
  • One Function Unit
  • All operations single cycle latency

b
a
2


d


z
y
e
f

-
x
r
25
How to map these operations?
  • Architecture 2
  • One Add-Sub and one Mul unit
  • All operations single cycle latency

26
How to map these operations?
  • Architecture 3
  • One Add-sub and one Mul unit
  • Add/Sub 1 cycle, Mul 2 cycles

27
There are many mapping solutions
Let S be the solution space containing solutions
x (xi), then x Pareto point ? x ? S, and ? y
? S ?i xi lt yi
28
Can we do better?
Yes !!
  • Much better !!
  • transforming the specification
  • a different architecture
  • a different mapping
  • speculative execution
  • be creative ..

29
Transforming the specification (1)
Example tree height reduction
Based on associativity of operation a (b c)
(a b) c
30
Transforming the specification (2)
r f e 2b d (a d) 2b a x
z y
d a b e a d f 2 b d r f e x
z y
31
Changing the architecture adding more complex
units



4-input adder why is this faster?
32
Changing the architecture adding more complex
units
  • In the extreme case put everything into one unit!

Spatial mapping - no control flow
33
More complex control flow
Program part
-a- If cond Then -b- Else -c- -d-
34
Mapping the CFG example 3
options what's the best?
-a- br c
-a- br b
-a- br c
-b- jmp d
-c- jmp d
-b-
-b-
-c-
-d-
-d-
-d-
-c- jmp d
35
Why not removing the control flow ?
36
If conversion shortens the schedule
-a- br c
-a-
-b- jmp d
cond -b-
!cond -c-
-c-
-d-
-d-
Using guarded instructions like r3 add
r1,r2,r5 !r3 mul r4,r5,3
37
Speculative execution makes it even shorter!
-a- br c
-a-
-b-
-c-
-b- jmp d
-d-
-c-
-d-
Why not executing -d- in parallel?
38
However Real life much more complex
E.g. MPEG-4 multimedia
Huge requirements gt 10 GOP/s gt 6 GB/s gt 10
MB storage
Software specification - more than 200 000
lines C - hundreds of files - written by
approx. 80 teams
39
Can we handle this?
Nowadays implementations - small images -
decoding only - not real-time - several W -
single task - limited dynamism
Wanted features - large images (HDTV) -
encoding and decoding - real-time - 100 mW
(mobile) - multiple tasks - dealing with
dynamism
40
Lecture 1 Introduction
  • Trends
  • Platforms
  • Application mapping
  • Design flow
  • Summary

41
Embedded system design
How to map your application graph A(L,A,D) to
hardware graph (L,N,C)
L design level (e.g. architecture,
implementation or realization level) A
application components (e.g. tasks, operations,
data structures) D dependences between
application components N hardware components
(e.g. processors, ASICs, FPGA,memories) C
connections between hardware components
42
Abstraction levels
Level specification
Inter-level transformation
System specification level
languages
Level 0 Requirements
English
Idea
Is modeled by
ES/RT-UML, Esterel, SDL
Level 1 Architecture
Is implemented by
C, JAVA,
Level 2 Implementation
C, VHDL, SystemC
Compiles into
Machine code,
Level 3 Realization
Hardware modules
Exploration
search area
43
Design space exploration
Level n-1
Design point
Cost
LT(n-1,n)
Exploration at
level n
Exploration
search area
Realization
space
global optimum
Exploration search area
Design transformation
44
Design space exploration framework- another
Y-chart
45
Design flow steps and constraints
idea
high abstraction level
Refinement steps
Architecture / Platform constraints
Transformation
low abstraction level
realization
46
In which order should we perform the steps?
Decision trees
47
Well-known phase ordering examples
  • Concurrency versus Data management
  • e.g. loop partitioning versus array partitioning
    for a multiprocessor
  • Scheduling versus Register allocation
  • Logic synthesis versus Placement and Routing

48
Rule of thumb!
  • Perform steps with biggest impact first
  • Biggest impact
  • depends on your interest ( cost function)
  • min. E, P, ED, D, Area, Npins, ...

49
Phase ordering exampleWhy fix data
storage/transfer before concurrency management
issues?
Recursive image processing algorithm on local
neighborhoods (for i 0 .. I-1 ) (for j 0
.. J-1 ) imgij f(imgij-k,
old_imgij)
50
Why fix data storage/transfer before concurrency
mngnt issues?
  • Unrolling outerloop (i) M times
  • needed M J-word FIFOs (image lines)
  • M data paths

51
Why fix data storage/transfer before concurrency
mngnt issues?
Unrolling (j) innerloop (limited by k) M - 1
buffer reg (i 0 .. I-1 ) (j 0 .. (J div
2)-1 ) imgi2j-1 f(imgi2j-k-1,
old_imgi2j-1) imgi2j
f(imgi2j-k, old_imgi2j)
52
Proposed System Design Methodology
System Specification
System-Level Exploration and refinement
Optimized algorithms
(C/C specification)
SW/HW
Partitioning/
Traditional
Architecture
Exploration
(parallelizing)
HW Synthesis
Compiler Steps
Steps
Code per (parallel) proc.
Structural VHDL Code
53
Design flow
54
Remove OO overhead
55
Object-based versus Object-oriented
Object-oriented
Object-based
  • calls through function pointer
  • cannot be inlined
  • direct calls
  • can be inlined

gt OO is good for specification, not for
implementation
56
Whole-system optimization techniques
  • Aggressive use of traditional inter-procedural
    techniques
  • in the embedded world you often know the whole
    application !
  • OO specific optimization
  • Data allocation optimization

57
Example data inlining
  • Eliminate
  • dynamic allocation
  • pointer de-reference
  • polymorphic calls

class A
B bA() b new C A() delete b void
f() b-gtg()
58
Example dynamic allocation removal
  • Eliminate dynamic allocation
  • Re-use stack memory already needed for
    other call tree branches

void teq(,short size,) float Ryy Ryy
new floatsize teq computation delete
Ryy
void teq(,) float Ryy64 teq
computation
teq(,64,) teq(,64,)
teq(,) teq(,)
59
ADSL result footprint -33
Unoptimized
ARM C optimized (-O2 -Ospace)
Inlining, dead code, constant prop.
Virtual call elimination
400kB
Data alloc. optim.
200kB
106
100
83
82
67
Total memory footprint (code data)
60
Dynamic Memory Management
  • Data type refinement
  • Virtual memory management

61
Data type refinement
ATM_cell Data_In Association_Table
Routing_Table Routing_Table new
Association_Table() Data_In new
ATM_cell() if ( Routing_Table-gtLookup(Data_In)
) ...
Impl. alternatives
62
Data type refinement Array
ATM_cell Data_In Array Routing_Table Routin
g_Table new Array () Data_In new
ATM_cell() if ( Routing_Table-gtLookup(Data_In)
) ...
Impl. alternatives
63
Data type refinement Linked List
ATM_cell Data_In Linked_List
Routing_Table Routing_Table new Linked_List
() Data_In new ATM_cell() if (
Routing_Table-gtLookup(Data_In) ) ...
Impl. alternatives
64
Data type refinement Binary Tree
ATM_cell Data_In Binary_Tree
Routing_Table Routing_Table new Binary_Tree
() Data_In new ATM_cell() if (
Routing_Table-gtLookup(Data_In) ) ...
Impl. alternatives
65
Task Concurrency Management
Going from specification concurrency to
implementation concurrency
66
Modelling MTG
67
TCM transformations
  • Why transformations?
  • shift existing Pareto curves
  • create new points on the Pareto curves
  • improve available task level parallelism

68
TCM Transformations
less memory
Shared Memory Area
MA Cycle Budget
Tasks freely assigned to 2 Processors
Tasks order constrained to reduce memory
requirements
Independent, dynamic tasks assigned to 1
Processor
Partial Order Constraints
Conflict
P1
HW1
T1
T2
T3
T5
T1
T6
T3
HW1
T4
T2
P2
T4
T5
T6
69
Static Memory Management
DTSE data transfer and storage exploration
70
Static data memory management (DMM)
3 Exploit memory hierarchy
Local Latch 1 Bank 1
Processor Data Paths
L1 Cache
L2 Cache
Cache Bank Recombine
Local Latch N Bank N
Chip
Off-chip SDRAM
6 Exploit limited life-time and data layout
freedom
5 Meet real-time constraints
71
DMM how to improve locality?
FOR i1 TO N DO Bif(Ai) FOR i1 TO N
DO Cig(Bi) FOR i1 TO N DO
Bif(Ai) Cig(Bi)
Local Latch 1 Bank 1
Processor Data Paths
L1 Cache
L2 Cache
Cache Bank Recombine
Local Latch N Bank N
Chip
Off-chip SDRAM
72
Exploiting Memory Hierarchy
A 100
A 1
A 10
M''
M''
Processor Data Paths
Reg. file
M''
P0.01
P0.1
P1
P (before) 100 P (after) 1000.01
100.1 1 1 3
73
How to Avoid N-port Memories?
74
Address Optimization
75
Algebraic Transformations and Aggressive Code
Hoisting for Expression Elimination
Initial
for(y0..9 y) for(x0..99 x) if
(xgt1) A (y3)3 (x-2)3 ... if (xgt4)
...A (y3)3 (x-5)3
76
Modulo substitution for piece-wise linear
addressing
Optimised-1st
for(y0..9 y) v_y (y3)3
for(x0..99 x) v_yx (x-2)3v_y
if (xgt1) Av_yx if (xgt4) Av_yx
77
What do we gain?Running example cavity detection
  • Application domain
  • Computer Tomography in medical imaging
  • Algorithm
  • Cavity detection in CT-scans
  • Detect dark regions in successive images
  • Indicate cavity in brain

78
Starting point
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y
  • Reference (conceptual) C code for the algorithm
  • all functions image_inN x Mt-1 -gt image_outN
    x Mt
  • new value of pixel depends on its neighbors
  • neighbor pixels read from background memory
  • approximately 110 lines of C code (ignoring file
    I/O etc)
  • experiments with N x M 640 x 400 pixels
  • straightforward implementation 6 image buffers

79
Cavity Detector Results
80
Lecture 1 Introduction
  • Trends
  • Platforms
  • Application mapping
  • Design flow
  • Summary

81
Summary
  • Billions of Embedded systems, everywhere!!!
  • Multi-media applications become extremely complex
    and dynamic
  • Time-to-Market pressure
  • Solution
  • Platforms as design target (raise abstraction
    level)
  • Advanced emb. system design flow needed

82
Traditional Design Methodology
System Specification
SW/HW
Partitioning/
HW System
(SW System
Exploration
Exploration
Exploration)
Optimized SW spec
Optimized HW spec
(C specification)
(VHDL specification)
Architecture
Traditional
HW Synthesis
(parallelizing)
Steps
Compiler Steps
Structural VHDL Code
Code per (parallel) proc.
83
Proposed System Design Methodology
System Specification
Our main focus
System-Level Exploration and refinement
Optimized algorithms
(C/C specification)
SW/HW
Partitioning/
Traditional
Architecture
Exploration
(parallelizing)
HW Synthesis
Compiler Steps
Steps
Code per (parallel) proc.
Structural VHDL Code
Write a Comment
User Comments (0)
About PowerShow.com