Characterizing Embedded Applications for InstructionSet Extensible Processors - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Characterizing Embedded Applications for InstructionSet Extensible Processors

Description:

Custom architecture. Easy to design (adaptive, flexible) ... A fragment of the program's data dependence graph mapped to CFU (Custom Functional Unit) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 32

Provided by: compN

Category:

more less

Transcript and Presenter's Notes

Title: Characterizing Embedded Applications for InstructionSet Extensible Processors

1
Characterizing Embedded Applications for
Instruction-Set Extensible Processors

By Pan Yu at embedded system seminar
Mar. 11th, 2004

2
Overview

What is the limit of potential performance
speedup using instruction-set extensible
processors?
How different constraints will restrict
performance potential?
By relaxing control flow, can we gain much?

3
Agenda

Background introduction about custom architecture
Aim and motivation of this work
Methodology and results interpretation
Implications and conclusion

4
Background Introduction about Custom Architecture
5
The efficiency of HW

8 years ago
When CPU is not sufficient, use HW
Efficiency of HW

In the age of 486 processors
200 MPEG-I decompression card!
6
Energy Efficiency

The same amount of energy
HW Read data from memory, do computation, store
data back
SW the same as the above, plus...
Read instruction from the memory,
decode instruction,
schedule and issue instruction,
prefetch instruction,
pipeline flushing

4 hours of mpeg2 decoding on 900mAh battery
using dedicate decoding HW
2 hours of mpeg2 decoding on 3800mAh battery
OR
7
Comparison between HW and SW

SW
Highly adaptive
Suitable for control oriented computation
Lower throughput, higher energy consumption
HW
Highly specialized (not adaptive)
Suitable for computation intensive computation
Higher throughput, lower energy consumption
Custom architecture
Easy to design (adaptive, flexible)
Higher performance, lower energy consumption

8
Typical Instruction-set extensible processor

A fragment of the programs data dependence graph
mapped to CFU (Custom Functional Unit)
Higher performance, lower energy consumption
Release the use of temporary registers
Reduce code size
CFUs interface to the software as ISA Extension

9
Aim and Motivation of This Work
10
Sub questions of custom instruction
identification automation

What topology of sub data dependency graph will
be formed as custom instruction?
How to identify all potential subgraphs
efficiently?
How to evaluate and pick out the ones that are
most useful in terms of performance speedup or
energy efficiency, under difference design
constraints?

Multiple InputSingle Output (MISO)
Multiple InputMultiple Output (MIMO)
Simultaneously schedulable
11
A Limit Study

Questions about custom instruction identification
should include how limiting are the different
constraints
Number of operands
Number of custom instructions
Area for custom logic
Control flow constraint
Control flow constraint
Previous works identifies and evaluates custom
instructions within individual basic blocks.
Thus, simplify compiler support, but restrict
potential performance improvement
We use trace based method to break control flow
constraint

12
Three questions to answer

What is the limit of potential performance
speedup using instruction-set extensible
processors?
How different constraints will restrict
performance potential?
By relaxing control flow, can we gain much?

13
Methodology and Results Interpretation
14
Pattern Identification

Enumerate all the MIMO subgraphs of the DDG
Contain only arithmetic operations
Number of input/output constraints
Convexity condition
An exhaustive enumeration of existence/non-existen
ce of each node
Number all the nodes by a reverse topological
sort
Enumeration is from lower numbering nodes to
higher numbering ones
The searching space can be viewed as a N level
binary tree (where N is the total number of node
on the DDG)

Non-convex subgraph
15
Pattern Identification (contd)

Prune the search space when violating convexity
condition or No. of output constraint

Example of identifying patterns contain less
than 2 outputs
6
Non-convex.Conflict.Prune subsearching space.
3 outputs. Conflict. Prune sub searching space
16
Pattern evaluation and selection

Objective
Cover the original instructions in the code with
zero/one custom instruction, that
1. Achieve best program acceleration
2. In case multiple patterns are partially
overlapped, select only one
3. Under different design constraints

17
Pattern evaluation and selection ILP formulation

Isomorphic subgraphs are instances of the same
pattern, which can use the same CFU
Objective function
Maximize total performance gain
Covering exclusion constraint
Area constraint
Number of custom instructions constraint

18
Heuristic selection methods

ILP is intractable when variables and constraints
are too many
Heuristic methods
Ranking each pattern instance with priority
function
Scan the rank list once, greedily select current
highest ranked pattern instance if not violating
constraints
If selected, throw out those have collision with
the current one
3 Priority functions
1. Performance/Cost ratio2. Software execution
time3. Performance gain
3 heuristic methods using 3 priority functions
will be used, the result with highest total
performance gain will be used, which is nearest
to optimal solution given by ILP method

19
Number of output operand MISO vs MIMO

Significant improvement MISO?MIMO
Usually, 2 output is good enough

20
Number of input operand

45 input is quite good for both MISO and MIMO

21
Resource Constraint

25 adders resource is good for most benchmarks
(jpeg the biggest benchmark, 200 adders
resource will suffice)

22
Total No. of custom instructions

Usually, 5 custom instructions will be quite good

23
Breaking Basic Block Boundary Using Compact
Trace

Break Basic Block Boundaries (4B)
Need execution order among basic blocks
Control Flow Graph with profiling
Accumulative information can only partially
deduce programs dynamic behavior
Program Trace
Marnix Arnords PhD thesis
Huge in space
Traversing the same program region multiple times
Compact Trace
Reduce both space and time complexity considerably

24
Whole Program Path (WPP) using SEQUITUR compressor

SEQUITUR a on-the-fly string compressor
Input a string (the trace) over the alphabet of
basic blocks
Output context free grammar representing the
trace
The repeated sequence of sub strings are
represented by the same non-terminal symbol of
grammar rules
An example of a short trace 232324245

A DAG, in which leave nodes represent single
basic blocks, interior nodes represent a sequence
of consecutively executed basic blocks
25
Traversing WPP

Hierarchically walk on the WPP
Starting from WPP leaf node, identify all
patterns within each single basic block
Walk upwards to the next higher level, identify
all patterns within each interior node (a
sequence of consecutively executed basic blocks)
To avoid producing large number of low frequency
patterns, only walk through hot leaves and hot
interior nodes
Walk up at most 34 levels, because useful
computation usually wont expand among many basic
blocks, also wide expanded computation will be
hard to implement by the compiler

WPP
26
Control flow constraint (Across basic block)

Significant speedup (up to 148) using across
basic block patterns
MIMO has more chance to improve performance

27
Across basic block Crossing if/loop branch

Either of 2 cases can contribute considerably
Using Predicate execution, Loop Unrolling to help
explore

28
Conclusion and implication
29
Conclusion

A reasonable resource and No. of custom
instructions constraint wont affect performance
much
Under 4-input 2-output, we can achieve most
performance speedup
Significant improvement by relaxing control flow
constraints (using across basic block pattern)

30
Implication

The resource offered by moderate custom logic
hardware is almost enough for single embedded
application
Research for multi-tasking custom architecture
are needed

31
Summary