Title: HardwareSoftware Cosynthesis for Digital Systems
1Hardware-Software Co-synthesis for Digital Systems
- The study of embedded computing system design
- Gupta Micheli,
- IEEE Design Test of Computers 10, no 3. (Sept
93)29-41 - P5 of HW/SW Co-design Book
- Prepared by Dr. Kocan
2The Problems in HW/SW Co-design
- Co-specification
- Creating specifications that describe
- Hardware elements
- Software elements
- The relationships between the elements
- Co-synthesis
- Automatic or semi-automatic design of hardware
and software to meet a specification - Co-simulation
- Simultaneous simulation of hardware and software
elements (often at different levels of
abstraction)
3Co-synthesis Phases
- Scheduling choosing times at which computations
occur - Allocation determining the processing elements
(PEs) on which computations occur - Partitioning dividing up the functionality into
units of computation - Mapping choosing particular component types for
the allocated units
These phases are related.
4HW-SW Co-synthesis for Digital Systems
- Embedded System Applications
- general-purpose processors ASICs memory
- Application-specific the relative timing of
their actions - Real-time embedded systems
5Challenges in Real-time ESD
- Performance estimation
- Selection of appropriate parts for system
implementation - Verification of temporal and functional
properties of the system
6(No Transcript)
7Synthesis-oriented approach
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Capturing specification of system
- Capture system functionality using a HDL, e.g
HardwareC, Verilog, VHDL - HardwareC a programming language with correct
unambiguous hardware modeling - HardwareC description a set of interacting
concurrent processes - A process restarts itself on completion
- Nested concurrent sequential operations in the
body
12Example HDL functionality specification
- Two data input operations,
- a conditional operation to generate counter seed
z - a while-loop to implement a down-counter
- A graph-based representation captures this spec
13(No Transcript)
14System model
- A system model consists of a set of
hierarchically related sequencing graphs - Vertices represents language-level operations
- Edges represent dependencies between operations
- Advantages of graph representation
- Makes explicit the concurrency inherent in the
input specification - Makes it easier to reason about properties of the
input description - Allow analysis of timing properties of the input
description
15Graph Model Properties
- Sink / source vertices that represent
no-operations - A set of variables defines the shared memory
between operations in the graph - Storage common to the operations
- Facilitates communication between operations
- Exactly one execution of an operation with
respect to each execution of any other operation
Single Rate Execution of Operations in a graph
16Multiple Graph Models
- Operation across graph models follow multi-rate
execution semantics - Variable numbers of executions of an operation
for an operation in another graph model - Use message-passing primitives (send/receive) to
implement communication across graph models - Specification of inter-model communication made
simple
17(No Transcript)
18Modeling Heterogeneous Systems
- Use multirate spec
- E.g. ASIC and processor run on different speeds
clocks
19Nondeterministic Delay (ND) operations
- Operations to represent synchronization to
external events - E.g. receive() operation
- Data-dependent loop operations
- ND unknown execution delays
- Modeling ND operations is vital for reactive
embedded system descriptions
Double circles for ND ops
20Many possible implementations per system model
- Timing constraints for defining performance
requirements of the desired implementation - Two types of timing constraints
- Min/max delay constraint
- Execution rate constraint
21(No Transcript)
22Timing constraints
- Min/max delay constraints
- Execution rate constraints
- Sufficient to capture constraints needed by most
real-time systems
23Modeling of delay constraints
rate
Min/max
Edge weight delay of the source operation
Backward edges maximum delay const.
24Model Analysis
- Estimate system performance
- Verify the consistency of specified constraints
- Performance measures
- Estimation of operation delays
- Separate estimations for hardware and software
implementations - Based on the processor to run the software
- Based on the type of the hardware to be used
25Processor cost model
- Execution delay function for basic set of
processor operations - Memory address calculation function
- Memory access time
- Processor interruption response time
26Timing constraint analysis
- Can imposed constraints be satisfied for a given
implementation? - Assign appropriate delays to the operations with
known delays in the graph model - CONSTRAINT SATISFIABILITY
- Relating structure, actual delay and constraint
values - Some structural properties of graphs may make a
constraint unsatisfiable (ND operations) - Some constraints may be mutually inconsistent
- E.g. maximum delay constraint between two
operations that also have a larger minimum delay
constraint - No assignment of nonnegative operation delay
values can satisfy such constraints
27Presence of ND operations
- A timing constraint is satisfiable if it is
satisfied for all possible (and may infinite)
delay values of the ND operations - A timing constraint is marginally satisfiable if
it can be satisfied for all possible values
within the specified bounds on the delay of the
ND operations - Some implementation assumptions (acceptable
bounds on ND operation delays)
28Timing Analysis by graph analysis
- (1) No ND operations in the graph
- Edges with finite/known weight
- Cant satisfy a min/max delay constraint if a
positive cycle in the graph model exists - (sum of the weights on the cycle is positive)
- (2) ND operations exist
- Satisfiable if no cycle contains ND operations
- Cycle contains ND ops, impossible to determine
the satisfiability of timing cosntraints only
marginal satisfiability can be guaranteed - Cycle breaking by graph transformation
29Timing analysis
- Nonpipelined implementations
- Rate constrains can be min/max delay constraints
between corresponding source sink operations of
the graph model - Apply min/max constraint satisfiability criterion
to the analysis of rate constraints
30Example Rate constraints (graphs with ND ops)
- process test(p,)
- in port pSIZE
-
- Boolean vINT-SIZE
-
- v read p
- while (v gt0)
-
- ltloop-bodygt
- vv-1
-
Rate constraint on read operation Unbounded while
operation ? ND operation
v Boolean array to represent an integer
31- Overall execution time of the while loop
determines - the interval between successive executions of the
read operation - This variable-delay while loop operation
- The input rate at port p is variable
- Cannot be always guaranteed to meet the required
rate constraint - Ensure marginal satisfiability of rate constraint
by graph transformation and by using a
finite-size bufffer
32P transformed into fragments Q R
Rate constraint from sink to source
33Software Implementation of Ex. A
- Two threads for each execution of T1, T2
executes v times - Thread T1 Thread T2
- read v loop synch
- detach ltloop_bodygt
- v v-1
- detach
34Process P with ND operation
- ND operation due to an unbounded loop
- ND operation induces a bipartition of the calling
process P - PF U B
- F e.g. read operation
- The set of operations in F must be performed
before invoking the loop body - The set of operations in B can only be performed
after completing executions of the loop body - Functional Pipeline F ? B? Loop to improve the
reaction rate of P - Note we assume nonpipelined hardware, therefore
the pipelining done in software
35Constraint Analysis and Software
- Linear execution semantics imposed by software
running on single-processor - Complicates constraint analysis for software
implementation of graph model - Complete order of operations necessary to perform
delay analysis - Complete ordering (may) create unbounded cycles ?
make constraints unsatisfiable
36Example for Completely Ordering of Operations
37Communication Ops in SW
- Computation ops must be performed serially
- Communication ops can proceed concurrently
- Overlap execution of ND ops (wait for
synchronization or communication) with some
(unrelated) computation - Requires dynamic software scheduling
- Simultaneous active ND operations may complete in
orders that cannot be determined statically
38Software model a set of fixed-latency concurrent
threads
Delay overheads of dynamic scheduling
39Thread
- A linearized set of operations
- May or may not begin with ND operation (indicated
by a circle) - A thread does not contain any ND operation (other
than beginning with one) - The delay of the initial ND operation is part of
the scheduling delay (not included in the latency
of the thread) - Multiple threads avoid complete serialization of
all operations ? may create unbounded cycles - SW model enables checking of marginal
satisfiability of constraints on operations
belonging to different threads - Assume fixed and known delay of scheduling
operations associated with ND operations
40System Partitioning
- system-level partitioning problem
- The assignments of operations to hardware or
software - Assignment determines the delay of the ops
- Communication overheads due to ASIC or processor
assignment - Min. comm. Delay
- Increase ops in SW to increase the processor
utilization - Overall System Performance
- The effect of HW/SW partition on the utilization
of processor AND the bandwidth of the bus between
the processor and ASIC hardware. - Devise a partitioning cost function
- Sizes of hw/sw parts
- Timing behavior capture the timing performance
during the partitioning - Hard to capture timing behavior (use
approximation techs)
41Hardware/Program partitioning
- HW partitioning divide circuits that implement
scheduled operations - Program-level partitioning addresses operations
that are scheduled at runtime - Use statistical timing properties to drive
partitioning algorithms
42Use of timing properties in partition cost
function
Use deterministic bounds on timing properties
that are incrementally computable in the
partition cost function
43Characterization of SW
- Thread latency (L) execution delay of a program
thread - Thread execution rate (R) the invocation rate of
the thread - Processor utilization PSum (LxR)
- Bus utilization (B) total amount of
communication between the HW and SW. - To transfer m variables Bsum rj
- rj the inverse of the minimum time interval
between two consecutive samples for variable j - Calculate static bounds on SW performance with
L,R,B,P - Overestimating performance parameters Why ?
- Distribution of thread invocations, and
communications based on actual data values
44Hardware Size, Interface Characterization
- Sum of the size estimates of the resources
implementing the operations - Assign ports for communication between HW and SW.
one port per variable - Bus bandwidth captures the overhead of
communication
45Partitioning a specification into HW and SW
implementations
- Given the cost model for software, hardware, and
interface - Given a set of sequencing graph models and timing
constraints between operations, create two sets
of sequencing graph models s. t. one can be
implemented in hardware and the other in software
46Constraints after Partitioning
- Timing constraints are satisfied for the two sets
of graph models - Processor utilization P lt 1
- Bus utilization B lt B
- A partition cost function
- min f(Size_HW,B,P(-1),m)
47Institutive features of partitioning algorithm
- Identify operations can be implemented in
software s.t. - Constraint graph implementation can be satisfied
- The resulting software meets rate constraints on
its inputs/outputs. - Initial partition
- ND ops of the data dependent loop operations
define the beginning of the program threads - All other operations in HW
- Compute the reaction rates of the threads
- Maximum reaction rate the inverse of its
latency - Latency of a program thread is computed from the
processor delay cost model and a fixed scheduling
overhead delay - Iterative improvement
- Migrate an operation (affects (1) execution
delay, (2) latency, (3) reaction rate of the
thread into which the ops moved) - Compute its effect on processor and bus bandwidth
48System Synthesis
- Synthesize individual HW and SW components
- Here generation of interface and software from
partitioned models - We know the program threads
- Use coroutine scheme for program generation
- Limit all external dependencies to the first and
last statements of the threads to have convex
threads - Concurrency might be reduced!
49Rate Constraints and Software
- The presence of dependencies on ND operations
- Sw implementation may not meet the data rate
constraints on its I/O ports - Synchronization-related ND operations
- Assign a context-switch delay to the respective
wait operations - Check for marginal satisfiability of timing
constraints - Unbounded loop-related ND operations
- Estimate loop index values for marginal
satisfiability analysis
50Example C
- To obtain a deterministic bound on the reaction
rate of the calling thread T1. - Unroll the looping thread by a variable number
program threads - Scheduling overhead per new thread
- Dynamic creation of the threads may lead to
violation of processor utilization constraint - Overlap execution of T1 and T2 to ensure marginal
timing constraint satisfiability - Remove wait2 op if T2 does not modify a common
variable use a buffer to maintain the reaction
rate
51No unbounded-delay operations
- Simplify a SW component into one single program
thread and a single data channel - All data transfers are serialized
- Disadvantage of the approach no support for
reordering or branching
52Example D
- HW/SW interface
- Data queues on each channel
- A control FIFO
- (holds the thread_ids in the order in which their
input data arrives) - FIFO depth the number of threads of execution
- Nonpreemptive, priority-based scheduling with
FIFO control
53Example E
- Actual interconnection schematic between HW and
SW for single data queue - Implement ControlFIFO and associated control
logic as a part of the ASIC or in software
54Marginal timing satisfiability analysis
The input rate at port p is variable . Cannot
guarantee the reaction rate of T1
55Hardware-software interface
- Data transfer from HW to SW must be explicitly
synchronized - Polling strategy
- Accommodation of different rates of execution
among HW and SW components (and due to
unbounded-delay ops) - A dynamic scheduling of different threads of
execution - Use Control FIFO for scheduling
- Data items are consumed in the order in which
they are produced
56Interface Protocol for Graphics Controller
Two threads generates line and circle
coordinates in software Control FIFO hold the
ideas of the threads
57(No Transcript)