Title: Characterizing Embedded Applications for InstructionSet Extensible Processors
1Characterizing Embedded Applications for
Instruction-Set Extensible Processors
- By Pan Yu at embedded system seminar
- Mar. 11th, 2004
2Overview
- What is the limit of potential performance
speedup using instruction-set extensible
processors? - How different constraints will restrict
performance potential? - By relaxing control flow, can we gain much?
3Agenda
- Background introduction about custom architecture
- Aim and motivation of this work
- Methodology and results interpretation
- Implications and conclusion
4Background Introduction about Custom Architecture
5The efficiency of HW
- 8 years ago
- When CPU is not sufficient, use HW
- Efficiency of HW
In the age of 486 processors
200 MPEG-I decompression card!
6Energy Efficiency
- The same amount of energy
- HW Read data from memory, do computation, store
data back - SW the same as the above, plus...
- Read instruction from the memory,
- decode instruction,
- schedule and issue instruction,
- prefetch instruction,
- pipeline flushing
-
4 hours of mpeg2 decoding on 900mAh battery
using dedicate decoding HW
2 hours of mpeg2 decoding on 3800mAh battery
OR
7Comparison between HW and SW
- SW
- Highly adaptive
- Suitable for control oriented computation
- Lower throughput, higher energy consumption
- HW
- Highly specialized (not adaptive)
- Suitable for computation intensive computation
- Higher throughput, lower energy consumption
- Custom architecture
- Easy to design (adaptive, flexible)
- Higher performance, lower energy consumption
8Typical Instruction-set extensible processor
- A fragment of the programs data dependence graph
mapped to CFU (Custom Functional Unit) - Higher performance, lower energy consumption
- Release the use of temporary registers
- Reduce code size
- CFUs interface to the software as ISA Extension
9Aim and Motivation of This Work
10Sub questions of custom instruction
identification automation
- What topology of sub data dependency graph will
be formed as custom instruction? - How to identify all potential subgraphs
efficiently? - How to evaluate and pick out the ones that are
most useful in terms of performance speedup or
energy efficiency, under difference design
constraints?
Multiple InputSingle Output (MISO)
Multiple InputMultiple Output (MIMO)
Simultaneously schedulable
11A Limit Study
- Questions about custom instruction identification
should include how limiting are the different
constraints - Number of operands
- Number of custom instructions
- Area for custom logic
- Control flow constraint
- Control flow constraint
- Previous works identifies and evaluates custom
instructions within individual basic blocks.
Thus, simplify compiler support, but restrict
potential performance improvement - We use trace based method to break control flow
constraint
12Three questions to answer
- What is the limit of potential performance
speedup using instruction-set extensible
processors? - How different constraints will restrict
performance potential? - By relaxing control flow, can we gain much?
13Methodology and Results Interpretation
14Pattern Identification
- Enumerate all the MIMO subgraphs of the DDG
- Contain only arithmetic operations
- Number of input/output constraints
- Convexity condition
- An exhaustive enumeration of existence/non-existen
ce of each node - Number all the nodes by a reverse topological
sort - Enumeration is from lower numbering nodes to
higher numbering ones - The searching space can be viewed as a N level
binary tree (where N is the total number of node
on the DDG)
Non-convex subgraph
15Pattern Identification (contd)
- Prune the search space when violating convexity
condition or No. of output constraint
Example of identifying patterns contain less
than 2 outputs
6
Non-convex.Conflict.Prune subsearching space.
3 outputs. Conflict. Prune sub searching space
16Pattern evaluation and selection
- Objective
- Cover the original instructions in the code with
zero/one custom instruction, that - 1. Achieve best program acceleration
- 2. In case multiple patterns are partially
overlapped, select only one - 3. Under different design constraints
17Pattern evaluation and selection ILP formulation
- Isomorphic subgraphs are instances of the same
pattern, which can use the same CFU - Objective function
- Maximize total performance gain
- Covering exclusion constraint
- Area constraint
- Number of custom instructions constraint
18Heuristic selection methods
- ILP is intractable when variables and constraints
are too many - Heuristic methods
- Ranking each pattern instance with priority
function - Scan the rank list once, greedily select current
highest ranked pattern instance if not violating
constraints - If selected, throw out those have collision with
the current one - 3 Priority functions
- 1. Performance/Cost ratio2. Software execution
time3. Performance gain - 3 heuristic methods using 3 priority functions
will be used, the result with highest total
performance gain will be used, which is nearest
to optimal solution given by ILP method
19Number of output operand MISO vs MIMO
- Significant improvement MISO?MIMO
- Usually, 2 output is good enough
20Number of input operand
- 45 input is quite good for both MISO and MIMO
21Resource Constraint
- 25 adders resource is good for most benchmarks
- (jpeg the biggest benchmark, 200 adders
resource will suffice)
22Total No. of custom instructions
- Usually, 5 custom instructions will be quite good
23Breaking Basic Block Boundary Using Compact
Trace
- Break Basic Block Boundaries (4B)
- Need execution order among basic blocks
- Control Flow Graph with profiling
- Accumulative information can only partially
deduce programs dynamic behavior - Program Trace
- Marnix Arnords PhD thesis
- Huge in space
- Traversing the same program region multiple times
- Compact Trace
- Reduce both space and time complexity considerably
24Whole Program Path (WPP) using SEQUITUR compressor
- SEQUITUR a on-the-fly string compressor
- Input a string (the trace) over the alphabet of
basic blocks - Output context free grammar representing the
trace - The repeated sequence of sub strings are
represented by the same non-terminal symbol of
grammar rules - An example of a short trace 232324245
A DAG, in which leave nodes represent single
basic blocks, interior nodes represent a sequence
of consecutively executed basic blocks
25Traversing WPP
- Hierarchically walk on the WPP
- Starting from WPP leaf node, identify all
patterns within each single basic block - Walk upwards to the next higher level, identify
all patterns within each interior node (a
sequence of consecutively executed basic blocks) - To avoid producing large number of low frequency
patterns, only walk through hot leaves and hot
interior nodes - Walk up at most 34 levels, because useful
computation usually wont expand among many basic
blocks, also wide expanded computation will be
hard to implement by the compiler
WPP
26Control flow constraint (Across basic block)
- Significant speedup (up to 148) using across
basic block patterns - MIMO has more chance to improve performance
27Across basic block Crossing if/loop branch
- Either of 2 cases can contribute considerably
- Using Predicate execution, Loop Unrolling to help
explore
28Conclusion and implication
29Conclusion
- A reasonable resource and No. of custom
instructions constraint wont affect performance
much - Under 4-input 2-output, we can achieve most
performance speedup - Significant improvement by relaxing control flow
constraints (using across basic block pattern)
30Implication
- The resource offered by moderate custom logic
hardware is almost enough for single embedded
application - Research for multi-tasking custom architecture
are needed
31Summary
- A feasible approach to identify computation
patterns on program trace - Patterns across basic block boundaries
- Evaluating and selecting most valuable patterns
by ILP or heuristic methods - A limit study on the potential benefit of
instruction-set extensible processors