Title: Characterizing Embedded Applications for InstructionSet Extensible Processors
1Characterizing Embedded Applications for
Instruction-Set Extensible Processors
Speaker Pan Yu Pan Yu and Tulika Mitra School
of Computing National University of
Singapore Republic of Singapore, 117543 June
10th, 2004
2Instruction-set extensible processors
- Typical architecture
- Examples of commercially available
Instruction-set extensible processors
S5000
xtensa
Nios II
3Instruction-set extensible processors (contd)
- A fragment of the programs dataflow graph mapped
to CFU (Custom Functional Unit) - Higher performance (less cycles, fetch/decode
overhead) - Release the use of temporary registers
- Reduce code size
- Lower energy consumption
4Outline
- Aim and motivation of this work
- Methodology and results interpretation
- Conclusions and Implications
5Aim and Motivation of This Work
6Constraints on Custom Instructions
- Various technology or design constraints on
custom instructions - Number of input/output operands
- Limited number of ports to register file
- Limited instruction encoding length
- Methodology imposed constraint to achieve fast
algorithm - Number of custom instructions
- Micro-architectural limits
- Limited number of free slots in ISA
- Area of custom logic
- Cost and complexity
- Control flow boundary
- Most works within individual basic blocks
7A Limit Study
- What is the limit of potential performance
speedup using instruction-set extensible
processors under relaxed design constraints? - How different constraints will impact the
performance potential? - By relaxing control flow, how much we can gain?
8Methodology and Results Interpretation
9Custom Instruction Identification
- Sub problem 1 Candidate pattern identification
- MISO Pozzi01, Goodwin03 and Cong04
- MIMO Arnold01, Baleani02, Clark03 and
Atasu03 - Disconnected components Brisk02 and Atasu03
- Exhaustive enumeration all the valid MIMO
subgraphs of the DFG (based on K. Atasu et. al.
DAC2003)
Simultaneously Schedulable disconnected
components
Multiple InputSingle Output (MISO)
Multiple InputMultiple Output (MIMO)
?
10Custom Instruction Identification (contd)
- Sub problem 2 Pattern evaluation and selection
- Objective
- Cover the original instructions in the code with
zero/one custom instruction, such that - 1. Achieve best program acceleration
- 2. In case multiple patterns are partially
overlapped, select only one - 3. Satisfy design constraints
- Techniques
- Arnold01 Dynamic programming based
- Lee02 ILP (integer linear programming) based
- Clark03 Heuristic based
11Pattern evaluation and selection ILP formulation
- Isomorphic subgraphs
- Are Instances of the same Pattern
- Can be executed by the same CFU
- Objective function Maximize total performance
gain - sij boolean variable, presence/absence of jth
instance of ith pattern - Fij execution count of the corresponding
instance - Pi performance gain of ith pattern compared to
s/w execution
Pattern 1
Instance 2
--
--
ltlt
ltlt
Instance 1
12Pattern evaluation and selection ILP formulation
- Covering exclusion constraint
- Each primitive operation is covered by at most
one pattern - Area constraint
- Area is not more than a predefined R
- Number of custom instructions constraint
- Less than M custom instructions
13Heuristic selection methods
- ILP is intractable with too many variables and
constraints - Heuristic methods
- Rank each pattern instance with priority function
- Scan the ranked list once, greedily selecting the
current highest ranked pattern instance if is
does not violate any constraint - Given a selected pattern instance, throw out
others that have collision with it - Three Priority functions
- 1. Performance/Cost ratio2. Software execution
time3. Performance gain - All three priority functions will be used on each
design point, the result with highest performance
gain will be used, which is nearest to optimal
solution given by ILP method
14Experiment Environment
- Simplescalar v3.0
- A MIPS like superscalar processor simulator
- Gcc-2.7.2.3 with O3 optimization
- Benchmarks MiBench suite
15Number of output operand MISO vs MIMO
- Within basic blocks
- Significant improvement MISO?MIMO
- Usually, 2 output is good enough
16Number of input operand
- 45 input is quite good for both MISO and MIMO
17Resource Impact
- Resource equivalent to 25 adders is good enough
for most benchmarks - (Djpeg the biggest benchmark, resource
equivalent to 200 adders will suffice)
18Total No. of custom instructions
- Usually, 5 custom instructions will be quite good
19Breaking Basic Block Boundary Using Compact
Trace
- Break Basic Block Boundaries
- From frequently executed basic block sequence
- Control Flow Graph with profiling
- Accumulative information can only partially
deduce programs dynamic behavior - Program Trace
- Arnord01
- Huge in space
- Traversing the same program region multiple times
- Compact Trace
- Reduce both space and time complexity considerably
20Whole Program Path (WPP)using SEQUITUR compressor
- SEQUITUR a on-the-fly string compressor
Nevill-Manning97 - Input a string
- Output hierarchical context free grammar
representing the string - The repeated sequence of sub strings are
represented by the same non-terminal symbol of
grammar rules - Whole Program Path (WPP) Larus99
- The string is the program path trace
- The grammar is represented as a DAG
- Good compression ratio
21Whole Program Path (WPP)using SEQUITUR compressor
- Our case
- On the alphabet of basic blocks
- An example of a short trace 232324245
- Advantage
- Process the same consecutively executed basic
block sequence on the interior node only once
A DAG, in which leave nodes represent single
basic blocks, interior nodes represent a sequence
of consecutively executed basic blocks
22Traversing WPP
- Hierarchically walk on the WPP
- Starting from WPP leaf node, identify all
patterns within each single basic block - Walk upwards to the next higher level, identify
all patterns within each interior node (a
sequence of consecutively executed basic blocks) - walk through only hot leaves and hot interior
nodes - Walk up at most 34 levels
WPP
23Control flow constraint (Across basic block)
- Significant speedup (up to 148) using across
basic block patterns - MIMO has more chance to improve performance
- Number of output may need to increase to 3
24Number of Input and Area(across basic block)
- Number of Inputs
- Under 3 outputs
- 45 suffices to achieve near optimal
- Area
- Still quite reasonable
25Across basic block Crossing if/loop branch
- Either of the two cases can contribute
considerably - Predicate execution or Loop Unrolling can help
26Conclusion and implication
27Conclusion
- A reasonable amount of resource and No. of custom
instructions constraint does not limit
performance - Under 5-input 3-output operands, we can achieve
close to optimal speedup - Significant improvement by relaxing control flow
constraints (using across basic block patterns)
28Implication
- The resource offered by moderate custom logic
hardware is almost enough for single embedded
application - Resource requirements for multi-tasking
applications running on ISA-extensible processors
needs to be explored in future
29Additional Material
30Latency and Area estimation
- From synopsys synthesis tool
- Timing estimation
- Normalized against the MAC operation
- Pi TSW THW
- THW Critical path latency (longest latency)
from each pattern - TSW summation of execution cycle of each
primitive operation - Area estimation (Ri)
- Normalized against that of an adder
- Sum up all operations in a pattern
Area
Latency
Primitive Opr.
1
0.16
Add
0.94
0.17
Sub
0.12
0.02
And
0.12
0.02
Or
0.06
0.03
Nor
0.18
0.03
Xor
17.3
0.73
Mult
21.4
1
MAC
26.36
6.00
Div
3.03
0.42
Ashift
0.94
0.17
Bshift
0.37
0.13
Cmp
31Candidate Pattern Identification
- Enumerate all the MIMO subgraphs of the DFG under
constraints of - Contain only arithmetic/logic operations
- Number of input/output constraints
- Convexity condition
- An exhaustive enumeration of existence/non-existen
ceof each node (based on K. Atasu et. al.
DAC2003) - Enumeration in reverse topologically sorted order
of nodes - Search space can be viewed as a N level binary
tree (where N is the total number of node on the
DFG) - Prune the search space when convexity condition
and/or No. of output constraint is violated
Non-convex subgraph
32Conclusion
- Evaluating and selecting most valuable
patternsby ILP or heuristic methods - A limit study on the potential benefit of
instruction-set extensible processors - A feasible approach to identify computation
patterns on program trace - Patterns across basic block boundaries