Title: Predictable Design of Embedded Systems using Networked Architectures
1Predictable Design of Embedded Systemsusing
Networked Architectures
- Henk Corporaal
- www.ics.ele.tue.nl/heco
- ASCI Winterschool on Embedded Systems
- Rockanje, March 2006
2Outline
- Trends and design problems
- Unpredictability
- Platforms
- Predictable design
- Proposed design flow
- Open issues
- Note this lecture is not about a solved problem
3Outline
- Trends and design problems
- Embedded systems everywhere
- Design practice
- Design complexity
- Memory wall
- Unpredictability
- Platforms
- Predictable design
- Design flow
- Open issues
4Embedded systems everywhere
- Convergence of 3 Cs
- computers, communications and consumer
electronics - The computer enters the 3rd fase
- computing power - networking - intelligent
processing - The world is 1 network
- wherever, whenever, all information and
communication available
We get a smart environment
5Design practice Informal system specification
6Design practice
System
Structure description
Behavioral specification
Algorithm
R/T
Logic
circuit
- Y-Chart (Gajski-Kuhn)
- Design Flow is path in Y chart
- Till RT-level largely manual flow
Physical realization
7Design complexity problem
8Hitting the memory wall
Performance
µProc 55/year
1000
CPU
100
Moores Law
10
DRAM 7/year
DRAM
1
2005
1980
1985
1990
1995
2000
Time
Patterson
9Outline
- Trends and design problems
- Unpredictability
- Platforms
- Predictable design
- Proposed design flow
- Open issues
10Unpredictability at all levels
applications
architectures
DSM VLSI design
- Uncertainty increases at all levels
11Application Two forms of unpredictability
- Applications can be data dependent
- Applications may have different scenarios
12In addition dynamic changing set of applications
- Multi-standard modem operation
- Several applications have to be activated
simultaneously - Too many combinations for an analysis at design
time (non deterministic events)
Philips EVP
13Architecture unpredictability
ext. mem
- Local schedulers
- OS
- task switching
- interrupts
- cache strategy
- cache pollution
- interconnect
- busses, bridges
- networks
- memory controllers
- external memory
- e.g. RR, TDMA, FCFS, LRU, EDLF, FIFO, priority,
mem arb.
interconnect
interconnect
interconnect
What is the global behavior (end-to-end), composed
of interacting local solutions ?
14DSM VLSI Unpredictability
- Global wiring delay becomes dominant over gate
delay (timing closure)
15DSM VLSI Unpredictability
Length of Isosynchronous zone as function of
frequency
- Other DSM problems
- Clock distribution, skew
- VDD and VSS voltage drop
- Signal integrity, cross-talk
- Variance in process parameters increases
16Unpredictability Design Closure problems
- Design closure
- a realization meets all requirements, including
functionality, speed, power, area, yield, etc.,
without design iterations
application
mapping scheduling
architecture
placement routing
Closure problem at all levels
FPGA realization
VLSI realization
17Unpredictability Design Closure problems
Computational Requirements ?
Orders of Magnitude
Time ?
Mapping with performance guarantees looks
impossible !!
18Solution ingredients
- Higher abstraction levels
- SW and HW IP reuse / PnP principle
- Standards
- Avoid large design iterations
- Design correct by synthesis
- Avoid worst case resource requirements
- How do we achieve all of this?
19Outline
- Trends and design problems
- Unpredictability
- Platforms
- Predictable design
- Design flow
- Open issues
20What is a platform?
- A platform is a generic, but domain specific
- information processing (sub-)system
- Generic means that it is flexible, containing
programmable component(s). - Platforms are meant to quickly realize your
next system - (in a certain domain).
- Single chip?
21Platforms, why?
- Reuse
- Short Time-to-Market
- High Quality
- Flexible and Programmable
- Large software component
- Standardization
- Optimized for specific domain
- and you do not have to solve this design closure
problem !!
22Platforms separate the design communities !
23Platform examples Digital camera
Sanyo Okada99
24TI OMAP
Up to 192Mbyte off-chip memory
25SpaceCake (Philips research)
- Homogeneous set of equal tiles
- Per tile e.g.
- n MIPS
- m TriMedia
- Accelerators
- k L2 Cache bank
- Shared memory
- Cache coherency
- Big interconnect switch
- Inter Tile
- Router
- Message passing
- Working on inter tile cache coherence
switch
L2 cache memory banks
Single tile
26IMAGINE Stream Processor (Stanford)
- IMAGINE SIMD of VLIWs
- It is controlled by a host processor, which send
it stream instructions (Load, store, receive,
send, VLIW op, load microcode)
27Hybrid FPGAs Xilinx Virtex 4-Pro
Memory blocks Multipliers
PowerPCs
ReConfig. logic
Reconfigurable logic blocks
Courtesy of Xilinx (Virtex II Pro)
28Fundamental platform design decisions
- Homogeneous versus Heterogeneous ?
- Bus versus Network ?
- Shared memory versus Message passing ?
- QoS support, Guarantees built-in ?
- Generic versus Application specific ?
- What types of parallelism to support ?
- ILP, DLP, TLP
- Focus on Performance, Power or Cost ?
- Memory organisation ?
- HW or SW reconfigurable ?
- And further
- OS support, Middleware ?
- Mapping support?
29Homogeneous or Heterogeneous
- Homogenous
- replication effect
- memory dominated any way
- solve realization issuesonce and for all
- less flexible
30Homogeneous or Heterogeneous
- Heterogeneous
- more flexible
- better fit to application domain
- smaller increments
- no tile reuse
31Homogeneous or Heterogeneous
- Middle of the road approach
- Flexibile tiles
- Fixed tile structure at top level
tile
router
32HW or SW reconfigurable?
reset
Reconfiguration time
loopbuffer
context
Subword parallelism
1 cycle
fine
coarse
Data path granularity
33Outline
- Trends and design problems
- Unpredictability
- Platforms
- Predictable design
- Current practise
- Predictability
- Architecture consequences
- Design consequences
- Design flow
- Open issues
34How should we design ?
- Trajectory, from Idea to Realization
- Desicions based on models
- Abstract from implementation details (not all
known yet) - Relatively cheap to create, validate and simulate
Idea
Design Time
Concepts Requirements
Design Problem
- Generate Ideas
- Construct Models
- Evaluate Properties
- Make Design Decisions
Steers
Realization
35Current practiceMapping, easy, but...........
Idea
- Given
- reference C code for applicatione.g. MPEG-4
Motion Estimation - platform SUPERDUPER-LX50
- Task
- map application on architecture
- But wait a moment
- me_at_workgt CC o2 mpeg4_me mpeg4_me.cThank you
for running SUPERDUPER-LX50 compiler.Your
program uses 257321886 bytes memory, 78 Watt,
428798765291 clock cycles
ab5d for (...) ..
36Current design process
application
mapping
constraints
OK ?
no
yes
- Post analysis check constraints after mapping
- Simulation based
- Does it still work for other data ?
- Does it still work when other applications are
active ? - Too many iterations
- Easy to program, hard to tune
- Can this be improved ?
- e.g. Constraints input
37Predictable design
- What is it?
- Being able to reason at a high level about a
design (in terms of functional and non-functional
properties) and - Being able to realize this design without time
consuming iterations in the design flow (design
closure) - How
- Predictable architecture
- Making resources predictable
- Proper modeling of less predictable elements
- Predictable design flow
- Compositionality
- Composability
- Design time analysis ? Run time analysis
38Making architectures predictable
- Getting rid of all unpredictable elements
- Caches ?
- No problem, but WCET estimation may be big and
unacceptable ! - Software controlled
- locked cache lines
- non-cachable memory
- controlled replacement
- Shared memory
- Communication
39Making architectures predictable NoC Philips
AETHEREAL
Router provides both guaranteed throughput (GT)
and best effort (BE) services to communicate with
IPs. Combination of GT and BE leads to
efficient use of bandwidth and simple programming
model.
Router Network
Network Interface
IP
Network Interface
Network Interface
IP
IP
40Making the NoC predictable how to support GT
traffic?
- Time wheel concept
- control injection traffic at network interface
time
1
8
2
7
3
6
5
4
41Making the design flow predictable
Compositionality
42Making the design flow predictable
- Design time
- Determine of upper bounds on time and resources
- ?pareto curves
- Scenario discovery
- separate your application in parts for which
upper bounds not too far from worst case
43What do we want ? Design time analysis
- Single application
- Reasoning about end-to-end timing constraints
(for given resources and quality)
predictability - Which local arbitration mechanisms are needed ?
- How to translate this to the global level ?
44Scenarios MP3
45What do we want ? Composability
- Multiple applications
- If app. 1 and app. 2 fit each individually, what
can be said about the combination ? - Concept of virtual platform
46Predictability ComposabilityCan we add Pareto
points?
application 1
application 2
Q
Q
(q1,c1)
(q2,c2)
Cost (resources)
Cost (resources)
(q1q2,c1c2) ?
47Problem Predictable Resource utilization?
50
50
50
50
A
B
50
50
48Problem Predictable Resource utilization?
Add ordering dependences (edges)
Only 50 processor utilization !
49Where is the problem?
- Different throughput obtained for different order
of actors - Possibilities of overall graph increases
exponentially with number of actors and
individual graphs - Very difficult to do a complete analysis to
obtain an optimal order - Hard to model and analyze different arbitration
strategies realistically
50Problem Too many possibilities!
A
B
C
51So, what is Composability?
- The degree to which we can analyze the
applications in isolation - Throughput, Latency, Resource utilization,
Deadlock, Switching / reconfiguration overhead,
etc. - Design time analysis for complete system is too
expensive and often infeasible - Each job should be executed as if it had access
to its own dedicated resources Virtualization - Consider applications separately and then reason
about the behavior of overall system
52Providing a Bound for Resources
- Arbitration strategy plays an important role in
determining resource requirement - A naive strategy leads to over-estimation of
resources - Worst-case estimate is not always possible
- Need predictable arbitration mechanism
- More realistic worst case bounds
- Handle dynamism in the system
- An overall quality versus resources Pareto curve
needed
53Making the design flow predictable Run-time
aspects
- Scalable applications
- QoS management
Application n / Scenario m
Local manager
QoS protocol
Global manager
Platform
54Match quality with resources
55Outline
- Trends and design problems
- Unpredictability
- Platforms
- Predictable design
- Design flow
- Open issues
56Design flow
Idea
Requirements spec
Models
POOSL/SystemC
Spec
Reactive Process Network
Kahn Process Network (YAPI)
BDF
SDF
correct by synthesis
Platform
57RPN (Reactive Process Networks) events and
streaming
- Processing of events
- Finite State Machine
- Controlling host-CPU (e.g. ARM)
- RTOS hard real-time
- classical SW complexity
Event_in
Event_out
status
mode
- Soft Real-time
- Compute intensive
- Special hardware
Stream_in
Stream_out
58POOSL Modeling Language
- Mathematically defined semantics
- Allows formal analysis of model properties
- Can formally describe
- concurrency
- synchronous communication
- timing (delay statements)
- functionality
P1
P2
delay 1
59POOSL Phases of Model Execution
State space
State space
State space
Synchronous time passage
Asynchronous actions execution
model time
60From Model to Realization
Possible execution (timed) traces (S1, t1), (S2,
t1), (S3, t1d1), (S5, t1d1) (S1, t1), (S2, t1),
(S4, t1d2), (S6, t1d2) (S1, t1), (S2,
t1wcet(a)), (S3, t1d1), (S5,
t1d1wcet(b)) (S1, t1), (S2, t1wcet(a)), (S4,
t1wcet(a)wcet(c)), (S6, t1d2)
a()() sel delay d1 b()() or c()() delay
d2 les
61?-Hypothesis property preservation
- If the time-deviation between two timed execution
traces is less than ?, then, if one trace
satisfies a real-time property, that property,
weakened upto ?, is preserved in the second one
as well
e1, e2 lt e
62Extending SDF
- SADF Scenario Aware Data Flow
- Can deal with dynamism
- Still possible to reason about
- deadlock,
- resource utilization,
- latency and throughput
- Currently implemented in POOSL
63SADF example MPEG-2 Decoder
- Pipelined MPEG-2 decoder for I and P frames
- VLD and IDCT fire per macro-block
- MC and RC fire per frame
- FD (frame detector) models control part of
VLDthat determines frame type - Image size 176x144
- I-frame
- 99 macro-blocks
- No motion vectors
- Px-frame
- x macro-blocks
- Motion vectors from VLD to MC
- Previous frame from RC to MC
- P0-frame (still video)
- Copy previous frame
- FD model based on occurrenceprobability of frame
types - Execution time distributions ofkernels
determined with profiling tool
Rate I P0 Px
a 0 0 1
b 0 0 x
c 99 1 x
d 1 0 1
e 99 0 x
x 30, 40, 50 ,60, 70, 80, 99
64Results for MPEG-2 Decoder
Process Throughput
VLD 0.063 rel. error 0.036
IDCT 0.063 rel. error 0.036
MC 0.00106 rel. error 0.190
RC 0.00106 rel. error 0.191
Accuracy results based on confidence levels of
0.95
Process Max. Latency between Successive Firings Average Latency betweenSuccessive Firings Variance in Latency betweenSuccessive Firings
VLD 710 15.99 rel. error 0.031 75.38 rel. error 0.18
IDCT 698 15.99 rel. error 0.031 56.45 rel. error 4.99
MC 3305 940.3 rel. error 0.017 2.4105 rel. error 3.46
RC 2216 940.3 rel. error 0.017 1.5105 rel. error 4.99
Channel Memory between Processes Maximum Occupancy Time-Average Occupancy Time-Variance in Occupancy
VLD and IDCT 9 1.910 rel. error 0.064 0.528 rel. error 1.99
IDCT and RC 154 60.19 rel. error 0.178 671.8 rel. error 4.55
VLD and MC 133 34.73 rel. error 0.517 698.4 rel. error 4.39
MC and RC 1 0.577 rel. error 0.561 0.244 rel. error 3.27
65Design flow
- Run-time
- Combine pareto points
- exploit pareto algebra
- QoS management / scalable application
66Mapping multiple jobs
T1
T2
T0
- Multiple jobs can be active simultaneously.
- When can a second job start ?
- Are the requested resources available ?
- If not, can the quality level be lowered ?
- If not, can other jobs go for a lower quality ?
- If yes, independent from other jobs ?
- How to give guarantees?
67Combining Pareto points
Application 1
Application 2
Cost
Cost
80
100
Cycle Budget
Cycle Budget
- A new thread frame coming
- 20 cycle budgets available
Application 3
Cost
Cycle Budget
68Combining Pareto points
Application 1
Application 2
Cost
Cost
80
100
Cycle Budget
Cycle Budget
Application 3
Cost
feasible, but optimal?
20
Cycle Budget
69Combining Pareto points
Application 1
Application 2
Cost
Cost
cost increase
?1
80
80
100
Cycle Budget
Cycle Budget
Application 3
Cost
a better solution
cost decrease and
?2 gt ?1
40
20
Cycle Budget
70Outline
- Trends and design problems
- Unpredictability
- Platforms
- Predictable design
- Design flow
- Open issues
71Open issues
- Gap between specification and architecture
modeling - High level modeling
- use of modeling pattern library
- Incorporate multiple pareto solutions into DSE
- Pareto Algebra
- Get synthesis correct for
- control applications including compute intensive
tasks - mapping to multi-processor
- Managing QoS
- Scenario detection, merging, prediction and
exploitation - Runtime resource manager optimizing overall
quality - Measuring overall quality
72Open issues (cont'd)
- Architecture modeling
- how to deal with local memory (scratch pad /
cache) - Modeling scheduling and arbitration
- make things composable !
- Definition NAL (run-time services)
- Automatic partitioning
- e.g., SPRINT tool of IMEC is a good start (C to
SystemC) - VLSI tiling
- . and many more .. e.g. see Ogras e.a. Key
research problems in NoC Design A holistic
perspective CODES ISSS 2005
73Thanks