Characterizing Embedded Applications for InstructionSet Extensible Processors - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Characterizing Embedded Applications for InstructionSet Extensible Processors

Description:

A fragment of the program's dataflow graph mapped to CFU (Custom Functional Unit) ... Limited number of free slots in ISA. Area of custom logic. Cost and complexity ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 33

Provided by: carl290

Category:

more less

Transcript and Presenter's Notes

Title: Characterizing Embedded Applications for InstructionSet Extensible Processors

1
Characterizing Embedded Applications for
Instruction-Set Extensible Processors
Speaker Pan Yu Pan Yu and Tulika Mitra School
of Computing National University of
Singapore Republic of Singapore, 117543 June
10th, 2004
2
Instruction-set extensible processors

Typical architecture
Examples of commercially available
Instruction-set extensible processors

S5000
xtensa
Nios II
3
Instruction-set extensible processors (contd)

A fragment of the programs dataflow graph mapped
to CFU (Custom Functional Unit)
Higher performance (less cycles, fetch/decode
overhead)
Release the use of temporary registers
Reduce code size
Lower energy consumption

4
Outline

Aim and motivation of this work
Methodology and results interpretation
Conclusions and Implications

5
Aim and Motivation of This Work
6
Constraints on Custom Instructions

Various technology or design constraints on
custom instructions
Number of input/output operands
Limited number of ports to register file
Limited instruction encoding length
Methodology imposed constraint to achieve fast
algorithm
Number of custom instructions
Micro-architectural limits
Limited number of free slots in ISA
Area of custom logic
Cost and complexity
Control flow boundary
Most works within individual basic blocks

7
A Limit Study

What is the limit of potential performance
speedup using instruction-set extensible
processors under relaxed design constraints?
How different constraints will impact the
performance potential?
By relaxing control flow, how much we can gain?

8
Methodology and Results Interpretation
9
Custom Instruction Identification

Sub problem 1 Candidate pattern identification
MISO Pozzi01, Goodwin03 and Cong04
MIMO Arnold01, Baleani02, Clark03 and
Atasu03
Disconnected components Brisk02 and Atasu03
Exhaustive enumeration all the valid MIMO
subgraphs of the DFG (based on K. Atasu et. al.
DAC2003)

Simultaneously Schedulable disconnected
components
Multiple InputSingle Output (MISO)
Multiple InputMultiple Output (MIMO)
?
10
Custom Instruction Identification (contd)

Sub problem 2 Pattern evaluation and selection
Objective
Cover the original instructions in the code with
zero/one custom instruction, such that
1. Achieve best program acceleration
2. In case multiple patterns are partially
overlapped, select only one
3. Satisfy design constraints
Techniques
Arnold01 Dynamic programming based
Lee02 ILP (integer linear programming) based
Clark03 Heuristic based

11
Pattern evaluation and selection ILP formulation

Isomorphic subgraphs
Are Instances of the same Pattern
Can be executed by the same CFU
Objective function Maximize total performance
gain
sij boolean variable, presence/absence of jth
instance of ith pattern
Fij execution count of the corresponding
instance
Pi performance gain of ith pattern compared to
s/w execution

Pattern 1
Instance 2

--

--
ltlt
ltlt
Instance 1
12
Pattern evaluation and selection ILP formulation

Covering exclusion constraint
Each primitive operation is covered by at most
one pattern
Area constraint
Area is not more than a predefined R
Number of custom instructions constraint
Less than M custom instructions

13
Heuristic selection methods

ILP is intractable with too many variables and
constraints
Heuristic methods
Rank each pattern instance with priority function
Scan the ranked list once, greedily selecting the
current highest ranked pattern instance if is
does not violate any constraint
Given a selected pattern instance, throw out
others that have collision with it
Three Priority functions
1. Performance/Cost ratio2. Software execution
time3. Performance gain
All three priority functions will be used on each
design point, the result with highest performance
gain will be used, which is nearest to optimal
solution given by ILP method

14
Experiment Environment

Simplescalar v3.0
A MIPS like superscalar processor simulator
Gcc-2.7.2.3 with O3 optimization
Benchmarks MiBench suite

15
Number of output operand MISO vs MIMO

Within basic blocks
Significant improvement MISO?MIMO
Usually, 2 output is good enough

16
Number of input operand

45 input is quite good for both MISO and MIMO

17
Resource Impact

Resource equivalent to 25 adders is good enough
for most benchmarks
(Djpeg the biggest benchmark, resource
equivalent to 200 adders will suffice)

18
Total No. of custom instructions

Usually, 5 custom instructions will be quite good

19
Breaking Basic Block Boundary Using Compact
Trace

Break Basic Block Boundaries
From frequently executed basic block sequence
Control Flow Graph with profiling
Accumulative information can only partially
deduce programs dynamic behavior
Program Trace
Arnord01
Huge in space
Traversing the same program region multiple times
Compact Trace
Reduce both space and time complexity considerably

20
Whole Program Path (WPP)using SEQUITUR compressor

SEQUITUR a on-the-fly string compressor
Nevill-Manning97
Input a string
Output hierarchical context free grammar
representing the string
The repeated sequence of sub strings are
represented by the same non-terminal symbol of
grammar rules
Whole Program Path (WPP) Larus99
The string is the program path trace
The grammar is represented as a DAG
Good compression ratio

21
Whole Program Path (WPP)using SEQUITUR compressor

Our case
On the alphabet of basic blocks
An example of a short trace 232324245
Advantage
Process the same consecutively executed basic
block sequence on the interior node only once

A DAG, in which leave nodes represent single
basic blocks, interior nodes represent a sequence
of consecutively executed basic blocks
22
Traversing WPP

Hierarchically walk on the WPP
Starting from WPP leaf node, identify all
patterns within each single basic block
Walk upwards to the next higher level, identify
all patterns within each interior node (a
sequence of consecutively executed basic blocks)
walk through only hot leaves and hot interior
nodes
Walk up at most 34 levels

WPP
23
Control flow constraint (Across basic block)

Significant speedup (up to 148) using across
basic block patterns
MIMO has more chance to improve performance
Number of output may need to increase to 3

24
Number of Input and Area(across basic block)

Number of Inputs
Under 3 outputs
45 suffices to achieve near optimal
Area
Still quite reasonable

25
Across basic block Crossing if/loop branch

Either of the two cases can contribute
considerably
Predicate execution or Loop Unrolling can help

26
Conclusion and implication
27
Conclusion

A reasonable amount of resource and No. of custom
instructions constraint does not limit
performance
Under 5-input 3-output operands, we can achieve
close to optimal speedup
Significant improvement by relaxing control flow
constraints (using across basic block patterns)

28
Implication

The resource offered by moderate custom logic
hardware is almost enough for single embedded
application
Resource requirements for multi-tasking
applications running on ISA-extensible processors
needs to be explored in future

29
Additional Material
30
Latency and Area estimation

From synopsys synthesis tool
Timing estimation
Normalized against the MAC operation
Pi TSW THW
THW Critical path latency (longest latency)
from each pattern
TSW summation of execution cycle of each
primitive operation
Area estimation (Ri)
Normalized against that of an adder
Sum up all operations in a pattern

Area
Latency
Primitive Opr.
1
0.16
Add
0.94
0.17
Sub
0.12
0.02
And
0.12
0.02
Or
0.06
0.03
Nor
0.18
0.03
Xor
17.3
0.73
Mult
21.4
1
MAC
26.36
6.00
Div
3.03
0.42
Ashift
0.94
0.17
Bshift
0.37
0.13
Cmp
31
Candidate Pattern Identification

Enumerate all the MIMO subgraphs of the DFG under
constraints of
Contain only arithmetic/logic operations
Number of input/output constraints
Convexity condition
An exhaustive enumeration of existence/non-existen
ceof each node (based on K. Atasu et. al.
DAC2003)
Enumeration in reverse topologically sorted order
of nodes
Search space can be viewed as a N level binary
tree (where N is the total number of node on the
DFG)
Prune the search space when convexity condition
and/or No. of output constraint is violated