Title: A PerformanceOriented HardwareSoftware Partitioning for Datapath Applications
1A Performance-Oriented Hardware/Software
Partitioningfor Datapath Applications
- Laura Frigerio Politecnico di Milano
- Fabio Salice Politecnico di Milano
In collaboration with the European Technology
Center, Altera Corporation, High Wycombe, UK
2Outline
- Introduction
- Overview of the proposed approach
- Used formalism
- Timed Petri Net Model
- Performance Evaluation
- Exploration of the solution space
- Experimental results
3Introduction
- Embedded system design challenges
- Increasing complexity
- Conflicting requirements timing, area,
flexibility, time to market, - Need for methodologies and tools to support the
design choices - Datapath applications, where dataflow elaboration
dominates over the control flow constructs, are
gaining increasingly popularity in embedded
system design - DSP applications
- Packet processing applications
-
- Meet strict timing constraints without
sacrificing too much the flexibility and at a
reasonable cost.
4Proposed approach
To manage the exploration of the solution space
YCHART approach
Application Modelling
Architecture Modelling
Reference Platform
Mapping
Performance Analysis
Timed Petri Net
Decision
Bounds
Branch and Bound
5Proposed approach
- Timed Petri Net to model the application-architect
ure mapping - Petri Nets are a simple and powerful graphical
and mathematical tool for system modelling - Petri Net allows to represent concurrent,
asynchronous, distributed, parallel systems - Suitable for HW/SW systems
- Timed Petri Net allows to evaluate the system
performance - The mathematical formalism allows to extract
properties of the system that can be used to
automatically explore the solution space
6Formalism
- A Petri Net is defined by places, transitions,
arcs, weight function and initial marking - The dynamic behavior of a PN is described in
terms of two rules the enabling rule and the
firing rule.
The incidence matrix A atp is a nxm matrix of
integers (m places, n transitions) where atp
atp - atp- , atp w(t,p), atp- w(p,t) A
-1 -1 1 1 An integer solution y to Ay 0 is
called S-invariant. An integer solution of ATx
0 is called T-invariant.
e
In Timed Petri Net, each transition is associated
with a time.
7Application modeling
- Datapath applications are dominated by dataflow
behavior, with few control-flow constructs. - Decomposed in distinct tasks at a coarse level of
granularity (called functions) - The tasks are computation intensive and
internally strongly interconnected. - They have iterative nature repeatedly execute
over different sets of input data - Each independent chunk of input data is referred
as data unit. - Data dependent based task graph at coarse
granularity.
8Architecture modeling
- The architecture is composed of executors (called
resources) - Processors
- Hardware modules.
- There can exist multiple instances of the same
executor, in order to satisfy the performance
requirements - The availability is the number of instances of an
executor
9Mapping
- A mapping (g) associates application functions
(F) to architecture resources (R). - g F ? R
- The execution of a function Fi on a resource Rj
requires a certain execution time eij . - Values eij are known if the design process is
based on IPs (Intellectual Property) or can be
estimated on the basis of previous and similar
implementations. - A Timed Petri Net is used to model the mapping
10Timed Petri Nets for the mapping
- F-Place represents a function
- R-Place represents a resource
- The initial marking is the availability
- Q-Place represents a queue
- Transitions are annotated with the timing
11Timed Petri Nets for the mapping
- Pipelined resource
- Execution time eij total time to execute the
function. - Stage time sij rate at which the input data can
be accepted (usually equal to one clock cycle).
12Timed Petri Nets for the mapping
- Limit on the number of data units that can be
processed simultaneously. - It is explicit or implicit depending on the
platforms - P-Place having as initial marking a number of
tokens equal to the maximum number of data units
allowed in the system. - In case the communication introduces substantial
overheads, it can be modeled using the same
framework - for example, a data transfer becomes a function
and a bus becomes a resource
13Performance Evaluation
- The Petri Net model is consistent (? xgt0 ATx0)
- The minimum cycle time of the net can be computed
as - Over the set of all the S-Invariants (solutions
of equation Ay 0) with D diagonal matrix of
times associated to transitions, M0 initial
marking - S-invariants are the rows of Bf
-
A11 is a non singular rxr matrix, with r the rank
of A
Partition of A
14Performance Evaluation
F-Places and Q-Places
R-Places and P-Place
15Performance Evaluation
- There are m - r S-invariants
- m - r - 1 corresponding to the m - r - 1
R-places in the system. Each vector has elements
equal to 1 for the R-Place and for the F-Places
using that resource (other elements are equal to
0). - One S-invariant corresponding to the P-Place.
This vector has elements equal to 1 for the
P-Place and all the F-Places and Q-Places in the
system (other elements are equal to 0). - Intuitively, the minimum cycle is related to the
processing time required by the resources to
process a data unit. - Resources that execute functions requiring long
processing time are more likely to influence the
minimum cycle time. - A long computational path, even if supported by
several resources, can affect the system
performance.
16Performance Evaluation
- Considering a semantical interpretation, the
minimum cycle can be expressed as -
- M0(Rj) is equal to the marking of the place
associated to Rj - If the resource is pipelined we consider time
sij instead of eij when computing rlj
v functions z resources
17Resources selection algorithm (hw/sw partitioning)
- The exploration of the solution space is
automated considering the previous equations
given a throughput constraint - v functions, z resources ? zv alternatives
- Coarse grain (10-20 functions)
- Not all the resources can execute all the
functions - Real alternatives ltlt zv
- Branch and bound algorithm
18Exploration of the solution space
- Algorithm Branch and Bound approach
- At each level a new function is assigned to the
available resources (branching operation) - The bounds provided by the semantic framework are
evaluated (bounding operation) - The generated tree is pruned according to the
result of the bounding
F1uP
F1HW1
F2uP
F2HW2
F2HW1
F2uP
At each node Kill the branch if Required
throughput gt 1/
19Experimental results
- Packet processing application performing an IP
(Internet protocol) packet forwarding function - The system receives a MAC (Medium Access Control)
input packet, verifies that the packet is valid,
modifies some packet fields, computes the
destination MAC address and issues the packet - Reference platform HW/SW architecture for
datapath applications developed by Altera - Two phases
- verify the suitability of the description of a
system with the presented Timed Petri Net
approach. - application of the algorithm for the solution
space exploration.
20Reference architecture
- Altera hw/sw solution for high performance
datapath applications - Processor that can execute 8 threads
simultaneously by means of a non conventional
multithreading - Represented as a resource having availability
equal to eight and frequency equal to Fsoft/8. - Asynchronous execution paradigm that combines
Tasks that are executed in software and Events
executed by dedicated hardware blocks
21Petri Net Model verification
- Comparison of the value of the minimum cycle by
- defining and simulating the Timed Petri Net with
a PN simulation tool (CPN tool) - implementing and simulating the system through
the Altera toolchain that combines an ISS for the
processor with software models of hardware blocks
- applying the performance analysis
22Petri Net Model verification
CPN model
23Petri Net Model verification
- Fixed partitioning
- F1, F3, F5 and F7 are executed by hardware
modules (a shared - module is used for F3 and F5) and F2, F4, F6 and
F8 are - executed on the multithreaded processor (with 8
threads) - Different configurations
24Exploration algorithm
- Identify the solution that minimizes the area,
while maintaining the flexibility and satisfying
the throughput constraints
25Conclusion
- This paper presents a method for the solution
space exploration of datapath applications with
stringent throughput constraints. - Timed Petri Nets to represent the mapping of the
application onto the architecture, in a Y-chart
approach. - A set of bounds that are exploited by an
exploration algorithm, based on a branch and
bound approach, to search the solution space for
a suitable performance/area configuration. - Experimental results on a packet processing
application show - the Petri Net Model can accurately represent the
behavior of a real system - the exploration algorithm is able to find a
suitable compromise in terms of area/throughput
in reasonable times