Characterizing Embedded Applications for InstructionSet Extensible Processors - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Characterizing Embedded Applications for InstructionSet Extensible Processors

Description:

A fragment of the program's dataflow graph mapped to CFU (Custom Functional Unit) ... Limited number of free slots in ISA. Area of custom logic. Cost and complexity ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 33
Provided by: carl290
Category:

less

Transcript and Presenter's Notes

Title: Characterizing Embedded Applications for InstructionSet Extensible Processors


1
Characterizing Embedded Applications for
Instruction-Set Extensible Processors
Speaker Pan Yu Pan Yu and Tulika Mitra School
of Computing National University of
Singapore Republic of Singapore, 117543 June
10th, 2004
2
Instruction-set extensible processors
  • Typical architecture
  • Examples of commercially available
    Instruction-set extensible processors



S5000
xtensa
Nios II
3
Instruction-set extensible processors (contd)
  • A fragment of the programs dataflow graph mapped
    to CFU (Custom Functional Unit)
  • Higher performance (less cycles, fetch/decode
    overhead)
  • Release the use of temporary registers
  • Reduce code size
  • Lower energy consumption

4
Outline
  • Aim and motivation of this work
  • Methodology and results interpretation
  • Conclusions and Implications

5
Aim and Motivation of This Work
6
Constraints on Custom Instructions
  • Various technology or design constraints on
    custom instructions
  • Number of input/output operands
  • Limited number of ports to register file
  • Limited instruction encoding length
  • Methodology imposed constraint to achieve fast
    algorithm
  • Number of custom instructions
  • Micro-architectural limits
  • Limited number of free slots in ISA
  • Area of custom logic
  • Cost and complexity
  • Control flow boundary
  • Most works within individual basic blocks

7
A Limit Study
  • What is the limit of potential performance
    speedup using instruction-set extensible
    processors under relaxed design constraints?
  • How different constraints will impact the
    performance potential?
  • By relaxing control flow, how much we can gain?

8
Methodology and Results Interpretation
9
Custom Instruction Identification
  • Sub problem 1 Candidate pattern identification
  • MISO Pozzi01, Goodwin03 and Cong04
  • MIMO Arnold01, Baleani02, Clark03 and
    Atasu03
  • Disconnected components Brisk02 and Atasu03
  • Exhaustive enumeration all the valid MIMO
    subgraphs of the DFG (based on K. Atasu et. al.
    DAC2003)

Simultaneously Schedulable disconnected
components
Multiple InputSingle Output (MISO)
Multiple InputMultiple Output (MIMO)
?
10
Custom Instruction Identification (contd)
  • Sub problem 2 Pattern evaluation and selection
  • Objective
  • Cover the original instructions in the code with
    zero/one custom instruction, such that
  • 1. Achieve best program acceleration
  • 2. In case multiple patterns are partially
    overlapped, select only one
  • 3. Satisfy design constraints
  • Techniques
  • Arnold01 Dynamic programming based
  • Lee02 ILP (integer linear programming) based
  • Clark03 Heuristic based

11
Pattern evaluation and selection ILP formulation
  • Isomorphic subgraphs
  • Are Instances of the same Pattern
  • Can be executed by the same CFU
  • Objective function Maximize total performance
    gain
  • sij boolean variable, presence/absence of jth
    instance of ith pattern
  • Fij execution count of the corresponding
    instance
  • Pi performance gain of ith pattern compared to
    s/w execution

Pattern 1
Instance 2

--

--
ltlt
ltlt
Instance 1
12
Pattern evaluation and selection ILP formulation
  • Covering exclusion constraint
  • Each primitive operation is covered by at most
    one pattern
  • Area constraint
  • Area is not more than a predefined R
  • Number of custom instructions constraint
  • Less than M custom instructions

13
Heuristic selection methods
  • ILP is intractable with too many variables and
    constraints
  • Heuristic methods
  • Rank each pattern instance with priority function
  • Scan the ranked list once, greedily selecting the
    current highest ranked pattern instance if is
    does not violate any constraint
  • Given a selected pattern instance, throw out
    others that have collision with it
  • Three Priority functions
  • 1. Performance/Cost ratio2. Software execution
    time3. Performance gain
  • All three priority functions will be used on each
    design point, the result with highest performance
    gain will be used, which is nearest to optimal
    solution given by ILP method

14
Experiment Environment
  • Simplescalar v3.0
  • A MIPS like superscalar processor simulator
  • Gcc-2.7.2.3 with O3 optimization
  • Benchmarks MiBench suite

15
Number of output operand MISO vs MIMO
  • Within basic blocks
  • Significant improvement MISO?MIMO
  • Usually, 2 output is good enough

16
Number of input operand
  • 45 input is quite good for both MISO and MIMO

17
Resource Impact
  • Resource equivalent to 25 adders is good enough
    for most benchmarks
  • (Djpeg the biggest benchmark, resource
    equivalent to 200 adders will suffice)

18
Total No. of custom instructions
  • Usually, 5 custom instructions will be quite good

19
Breaking Basic Block Boundary Using Compact
Trace
  • Break Basic Block Boundaries
  • From frequently executed basic block sequence
  • Control Flow Graph with profiling
  • Accumulative information can only partially
    deduce programs dynamic behavior
  • Program Trace
  • Arnord01
  • Huge in space
  • Traversing the same program region multiple times
  • Compact Trace
  • Reduce both space and time complexity considerably

20
Whole Program Path (WPP)using SEQUITUR compressor
  • SEQUITUR a on-the-fly string compressor
    Nevill-Manning97
  • Input a string
  • Output hierarchical context free grammar
    representing the string
  • The repeated sequence of sub strings are
    represented by the same non-terminal symbol of
    grammar rules
  • Whole Program Path (WPP) Larus99
  • The string is the program path trace
  • The grammar is represented as a DAG
  • Good compression ratio

21
Whole Program Path (WPP)using SEQUITUR compressor
  • Our case
  • On the alphabet of basic blocks
  • An example of a short trace 232324245
  • Advantage
  • Process the same consecutively executed basic
    block sequence on the interior node only once

A DAG, in which leave nodes represent single
basic blocks, interior nodes represent a sequence
of consecutively executed basic blocks
22
Traversing WPP
  • Hierarchically walk on the WPP
  • Starting from WPP leaf node, identify all
    patterns within each single basic block
  • Walk upwards to the next higher level, identify
    all patterns within each interior node (a
    sequence of consecutively executed basic blocks)
  • walk through only hot leaves and hot interior
    nodes
  • Walk up at most 34 levels

WPP
23
Control flow constraint (Across basic block)
  • Significant speedup (up to 148) using across
    basic block patterns
  • MIMO has more chance to improve performance
  • Number of output may need to increase to 3

24
Number of Input and Area(across basic block)
  • Number of Inputs
  • Under 3 outputs
  • 45 suffices to achieve near optimal
  • Area
  • Still quite reasonable

25
Across basic block Crossing if/loop branch
  • Either of the two cases can contribute
    considerably
  • Predicate execution or Loop Unrolling can help

26
Conclusion and implication
27
Conclusion
  • A reasonable amount of resource and No. of custom
    instructions constraint does not limit
    performance
  • Under 5-input 3-output operands, we can achieve
    close to optimal speedup
  • Significant improvement by relaxing control flow
    constraints (using across basic block patterns)

28
Implication
  • The resource offered by moderate custom logic
    hardware is almost enough for single embedded
    application
  • Resource requirements for multi-tasking
    applications running on ISA-extensible processors
    needs to be explored in future

29
Additional Material
30
Latency and Area estimation
  • From synopsys synthesis tool
  • Timing estimation
  • Normalized against the MAC operation
  • Pi TSW THW
  • THW Critical path latency (longest latency)
    from each pattern
  • TSW summation of execution cycle of each
    primitive operation
  • Area estimation (Ri)
  • Normalized against that of an adder
  • Sum up all operations in a pattern

Area
Latency
Primitive Opr.
1
0.16
Add
0.94
0.17
Sub
0.12
0.02
And
0.12
0.02
Or
0.06
0.03
Nor
0.18
0.03
Xor
17.3
0.73
Mult
21.4
1
MAC
26.36
6.00
Div
3.03
0.42
Ashift
0.94
0.17
Bshift
0.37
0.13
Cmp
31
Candidate Pattern Identification
  • Enumerate all the MIMO subgraphs of the DFG under
    constraints of
  • Contain only arithmetic/logic operations
  • Number of input/output constraints
  • Convexity condition
  • An exhaustive enumeration of existence/non-existen
    ceof each node (based on K. Atasu et. al.
    DAC2003)
  • Enumeration in reverse topologically sorted order
    of nodes
  • Search space can be viewed as a N level binary
    tree (where N is the total number of node on the
    DFG)
  • Prune the search space when convexity condition
    and/or No. of output constraint is violated

Non-convex subgraph
32
Conclusion
  • Evaluating and selecting most valuable
    patternsby ILP or heuristic methods
  • A limit study on the potential benefit of
    instruction-set extensible processors
  • A feasible approach to identify computation
    patterns on program trace
  • Patterns across basic block boundaries
Write a Comment
User Comments (0)
About PowerShow.com