Characterizing Embedded Applications for InstructionSet Extensible Processors - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Characterizing Embedded Applications for InstructionSet Extensible Processors

Description:

Custom architecture. Easy to design (adaptive, flexible) ... A fragment of the program's data dependence graph mapped to CFU (Custom Functional Unit) ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 32
Provided by: compN
Category:

less

Transcript and Presenter's Notes

Title: Characterizing Embedded Applications for InstructionSet Extensible Processors


1
Characterizing Embedded Applications for
Instruction-Set Extensible Processors
  • By Pan Yu at embedded system seminar
  • Mar. 11th, 2004

2
Overview
  • What is the limit of potential performance
    speedup using instruction-set extensible
    processors?
  • How different constraints will restrict
    performance potential?
  • By relaxing control flow, can we gain much?

3
Agenda
  • Background introduction about custom architecture
  • Aim and motivation of this work
  • Methodology and results interpretation
  • Implications and conclusion

4
Background Introduction about Custom Architecture
5
The efficiency of HW
  • 8 years ago
  • When CPU is not sufficient, use HW
  • Efficiency of HW

In the age of 486 processors
200 MPEG-I decompression card!
6
Energy Efficiency
  • The same amount of energy
  • HW Read data from memory, do computation, store
    data back
  • SW the same as the above, plus...
  • Read instruction from the memory,
  • decode instruction,
  • schedule and issue instruction,
  • prefetch instruction,
  • pipeline flushing

4 hours of mpeg2 decoding on 900mAh battery
using dedicate decoding HW
2 hours of mpeg2 decoding on 3800mAh battery
OR
7
Comparison between HW and SW
  • SW
  • Highly adaptive
  • Suitable for control oriented computation
  • Lower throughput, higher energy consumption
  • HW
  • Highly specialized (not adaptive)
  • Suitable for computation intensive computation
  • Higher throughput, lower energy consumption
  • Custom architecture
  • Easy to design (adaptive, flexible)
  • Higher performance, lower energy consumption

8
Typical Instruction-set extensible processor
  • A fragment of the programs data dependence graph
    mapped to CFU (Custom Functional Unit)
  • Higher performance, lower energy consumption
  • Release the use of temporary registers
  • Reduce code size
  • CFUs interface to the software as ISA Extension

9
Aim and Motivation of This Work
10
Sub questions of custom instruction
identification automation
  • What topology of sub data dependency graph will
    be formed as custom instruction?
  • How to identify all potential subgraphs
    efficiently?
  • How to evaluate and pick out the ones that are
    most useful in terms of performance speedup or
    energy efficiency, under difference design
    constraints?

Multiple InputSingle Output (MISO)
Multiple InputMultiple Output (MIMO)
Simultaneously schedulable
11
A Limit Study
  • Questions about custom instruction identification
    should include how limiting are the different
    constraints
  • Number of operands
  • Number of custom instructions
  • Area for custom logic
  • Control flow constraint
  • Control flow constraint
  • Previous works identifies and evaluates custom
    instructions within individual basic blocks.
    Thus, simplify compiler support, but restrict
    potential performance improvement
  • We use trace based method to break control flow
    constraint

12
Three questions to answer
  • What is the limit of potential performance
    speedup using instruction-set extensible
    processors?
  • How different constraints will restrict
    performance potential?
  • By relaxing control flow, can we gain much?

13
Methodology and Results Interpretation
14
Pattern Identification
  • Enumerate all the MIMO subgraphs of the DDG
  • Contain only arithmetic operations
  • Number of input/output constraints
  • Convexity condition
  • An exhaustive enumeration of existence/non-existen
    ce of each node
  • Number all the nodes by a reverse topological
    sort
  • Enumeration is from lower numbering nodes to
    higher numbering ones
  • The searching space can be viewed as a N level
    binary tree (where N is the total number of node
    on the DDG)

Non-convex subgraph
15
Pattern Identification (contd)
  • Prune the search space when violating convexity
    condition or No. of output constraint

Example of identifying patterns contain less
than 2 outputs
6
Non-convex.Conflict.Prune subsearching space.
3 outputs. Conflict. Prune sub searching space
16
Pattern evaluation and selection
  • Objective
  • Cover the original instructions in the code with
    zero/one custom instruction, that
  • 1. Achieve best program acceleration
  • 2. In case multiple patterns are partially
    overlapped, select only one
  • 3. Under different design constraints

17
Pattern evaluation and selection ILP formulation
  • Isomorphic subgraphs are instances of the same
    pattern, which can use the same CFU
  • Objective function
  • Maximize total performance gain
  • Covering exclusion constraint
  • Area constraint
  • Number of custom instructions constraint

18
Heuristic selection methods
  • ILP is intractable when variables and constraints
    are too many
  • Heuristic methods
  • Ranking each pattern instance with priority
    function
  • Scan the rank list once, greedily select current
    highest ranked pattern instance if not violating
    constraints
  • If selected, throw out those have collision with
    the current one
  • 3 Priority functions
  • 1. Performance/Cost ratio2. Software execution
    time3. Performance gain
  • 3 heuristic methods using 3 priority functions
    will be used, the result with highest total
    performance gain will be used, which is nearest
    to optimal solution given by ILP method

19
Number of output operand MISO vs MIMO
  • Significant improvement MISO?MIMO
  • Usually, 2 output is good enough

20
Number of input operand
  • 45 input is quite good for both MISO and MIMO

21
Resource Constraint
  • 25 adders resource is good for most benchmarks
  • (jpeg the biggest benchmark, 200 adders
    resource will suffice)

22
Total No. of custom instructions
  • Usually, 5 custom instructions will be quite good

23
Breaking Basic Block Boundary Using Compact
Trace
  • Break Basic Block Boundaries (4B)
  • Need execution order among basic blocks
  • Control Flow Graph with profiling
  • Accumulative information can only partially
    deduce programs dynamic behavior
  • Program Trace
  • Marnix Arnords PhD thesis
  • Huge in space
  • Traversing the same program region multiple times
  • Compact Trace
  • Reduce both space and time complexity considerably

24
Whole Program Path (WPP) using SEQUITUR compressor
  • SEQUITUR a on-the-fly string compressor
  • Input a string (the trace) over the alphabet of
    basic blocks
  • Output context free grammar representing the
    trace
  • The repeated sequence of sub strings are
    represented by the same non-terminal symbol of
    grammar rules
  • An example of a short trace 232324245

A DAG, in which leave nodes represent single
basic blocks, interior nodes represent a sequence
of consecutively executed basic blocks
25
Traversing WPP
  • Hierarchically walk on the WPP
  • Starting from WPP leaf node, identify all
    patterns within each single basic block
  • Walk upwards to the next higher level, identify
    all patterns within each interior node (a
    sequence of consecutively executed basic blocks)
  • To avoid producing large number of low frequency
    patterns, only walk through hot leaves and hot
    interior nodes
  • Walk up at most 34 levels, because useful
    computation usually wont expand among many basic
    blocks, also wide expanded computation will be
    hard to implement by the compiler

WPP
26
Control flow constraint (Across basic block)
  • Significant speedup (up to 148) using across
    basic block patterns
  • MIMO has more chance to improve performance

27
Across basic block Crossing if/loop branch
  • Either of 2 cases can contribute considerably
  • Using Predicate execution, Loop Unrolling to help
    explore

28
Conclusion and implication
29
Conclusion
  • A reasonable resource and No. of custom
    instructions constraint wont affect performance
    much
  • Under 4-input 2-output, we can achieve most
    performance speedup
  • Significant improvement by relaxing control flow
    constraints (using across basic block pattern)

30
Implication
  • The resource offered by moderate custom logic
    hardware is almost enough for single embedded
    application
  • Research for multi-tasking custom architecture
    are needed

31
Summary
  • A feasible approach to identify computation
    patterns on program trace
  • Patterns across basic block boundaries
  • Evaluating and selecting most valuable patterns
    by ILP or heuristic methods
  • A limit study on the potential benefit of
    instruction-set extensible processors
Write a Comment
User Comments (0)
About PowerShow.com