I - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

I

Description:

(common work with Kubilay Atasu and Laura Pozzi) Processor Architecture ... Horror, disgrace! Transformations (I) Expand/explicit multiplications, subtractions, ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 62
Provided by: josed70
Category:
Tags: disgrace

less

Transcript and Presenter's Notes

Title: I


1
IC Research Day 2003Embedded Processors From
Prêt-à-Porter to Tailor-made
  • Paolo Ienne
  • (common work with Kubilay Atasu and Laura Pozzi)
  • Processor Architecture Laboratory (LAP)
  • Centre for Advanced Digital Systems (CSDA)
  • Swiss Federal Institute of Technology Lausanne
    (EPFL)

2
Evolution?
From Tailor-Made
3
Processor Evolution
  • From Tailor-made to Prêt-à-Porter to
    Tailor-made!!
  • From Tailor-made to Prêt-à-Porter

Number of different architectures
Computers
4004
B5000
8080
PDP-11
uP
x86
PA-RISC
B6500
Z80
MIPS
IBM360
6502
RS600
RISC
68k
CDC6600
Alpha
Mainframes
SPARC
VAX
PowerPC
60
70
80
90
2000
10
4
Tailor-Made Embedded Processors?
  • Binary compatibility is less of an issue for
    embedded systems
  • Unlike general processors (? x86 dominance)
  • Performance/cost is at premium, not performance
    alone as in PCs (Intel model)
  • Also other metrics are now fundamental, such as
    energy
  • Systems-on-Chip can include any processor on chip
    ? no strong need for standard devices
  • Application-specific integrated circuits (ASICs)

5
Tailoring Instruction Set Extensions
Processor
?
6
Instruction Set Extensions
  • Collapse a subset of the Direct Acyclic Graph
    nodes into a single Functional Unit (AFU)
  • Exploit cheaply the parallelism within the basic
    block
  • Simplify operations with constant operands
  • Optimise sequences of instructions (logic,
    arithmetic, etc.)
  • Exploit limited precision

7
Outline
  • Problem definition
  • Main algorithm
  • Speed-up estimation
  • More tailoring problems
  • Conclusions

8
Instruction Set Extensions
  • Typical approach (I)
  • Find frequent patterns
  • Typically rather small
  • Might have too many inputs or outputs for
    register file
  • Reuse is not a good heuristic for high speedup
  • E.g., ChoiJun99, KastnerOct02, ArnoldApr01
  • Typical approach (II)
  • Grow clusters until I/O violation occurs
  • Limits possibilities
  • Usually only single output
  • E.g., RazdanNov94, Geurts97, AlippiMar99,
    KastrupApr99, YeJun00, BaleaniMay02

9
Existing Solutions Miss Potential Speed-ups
  • Goal Find subgraphs
  • having a user defined maximum number of inputs
    and outputs,
  • including disconnected components, and
  • that maximize the overall speedup

10
Problem Statement
  • Gi (V, E) are the graphs representing the DAGs of
    the algorithm basic blocks
  • S is a subgraph of G
  • M(S) represents the gain achievable by
    implementing the subgraph S as a special
    instruction
  • Problem Find the Ninstr subgraphs Sj of any Gi
    such that
  • Under the following constraints for each
    subgraphs Sj
  • Number of inputs of Sj lt Nin
  • Number of outputs of Sj lt Nout
  • Sj is convex

11
Convexity Constraint
A nonconvex subgraph cannot be implemented in a
special instruction (or it imposes too complex
constraints to the compilers scheduler)
12
Outline
  • Problem definition
  • Main algorithm
  • Speed-up estimation
  • More tailoring problems
  • Conclusions

13
Identification Algorithms
1 Single Subgraph Single Basic Block
2 Multiple Subgraphs (e.g., 2) Single Basic Block
3/4 Multiple Subgraphs (e.g., 3) Multiple Basic
Blocks
14
Single Subgraph within a Single Basic Block
  • A graph with N nodes has 2N subgraphs
  • Potential solutions are in fact rather sparse
  • How to avoid exploring unnecessarilythe whole
    design space?

15
Search Space Pruning
16
Search Space Pruning
Nodes are numbered based on a topological
sort (? source nodes follow destination
nodes) Example Nout 1
17
Algorithm Performance
Subgraphs Considered
Number of Graph Nodes (N)
Number of subgraphs consideredusing an output
port constraint of two
18
Identification Algorithms
1 Single Subgraph Single Basic Block
2 Multiple Subgraphs (e.g., 2) Single Basic Block
3/4 Multiple Subgraphs (e.g., 3) Multiple Basic
Blocks
19
Outline
  • Problem definition
  • Main algorithm
  • Speed-up estimation
  • More tailoring problems
  • Conclusions

20
Estimated Speedups
Input Output constraints
The speedup grows with the relaxation of
constraints The heuristic algorithm (Iterative)
for multiple subgraphsis practically as good as
the Optimal algorithm
21
Identification Results
  • Only Iterative algorithm run due to complexity

22
Outline
  • Problem definition
  • Main algorithm
  • Speed-up estimation
  • More tailoring problems
  • Conclusions

23
Whats Next?
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
24
Whats Next?
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
25
Instruction SelectionBased on Symbolic Algebra
  • Symbolic algebra instruction-selection engine
    used to add single-output functional units to
    Xtensa processors
  • BranchBound algorithm to select polynomial
    candidates and Maple V as the symbolic
    simplification engine (? Gröbner bases)
  • 18 to 77 with an average of 41 execution-time
    improvement reported by the a cycle-accurate
    instruction-set simulator

26
Area Cost
  • Added a total of ten different instructions
  • Complexity ranges from 2 to 20 operations per
    instruction
  • Small area penalty (max 20)

27
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
28
Example ofHigh-level Arithmetic Optimisations
Significant speed-up possible at no hardware cost
29
Conclusions
Application Algorithm
Manual Partitioning
SW Systems-on-Chip HW
30
Acknowledgements
  • LAP
  • Laura Pozzi
  • Kubilay Atasu, Miljan Vuletic
  • Ajay Kumar Verma
  • Other Groups
  • Armita Peymandoust, Giovanni De Micheli
    Stanford Univ.
  • Rainer Leupers RWTH Aachen

31
IC Research Day 2003Embedded Processors From
Prêt-à-Porter to Tailor-made
  • Paolo Ienne
  • (common work with Kubilay Atasu and Laura Pozzi)
  • Processor Architecture Laboratory (LAP)
  • Centre for Advanced Digital Systems (CSDA)
  • Swiss Federal Institute of Technology Lausanne
    (EPFL)

32
New Design Methodologies!
System Synthesis
Instruction Set Extensions / Coprocessor Synthesis
Symbolic Mapping
Retargetable Compilers
High-Level Synthesis
Arithmetic Optimisation
Logic Synthesis
Physical Design
33
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
34
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
35
Instruction Selection
  • Given a code segment
  • And having identified special instructions
  • Find out all places where the instruction can be
    used in the code
  • Maximise usageand hence advantage
  • Increase robustness to changes in the application

Y1 a R1 b G1 c B1 d Y2 a R2
b G2 c B2 dY Y2 q (Y1 - Y2)
36
Instruction Selection
Best?!
37
Simple Algebraic Properties
  • Neutral elements
  • Associativity, distributivity, etc.

c c 1
(a b) c a c b c
38
Simple Algebraic Properties
39
More Use of Algebra
  • Given a code segment

Y1 a R1 b G1 c B1 d Y2 a R2
b G2 c B2 dY Y2 q (Y1 - Y2)
40
Rewriting the Polynomial to Expose foo
41
Gröbner Bases Instruction Selection
  • Polynomial representation S of the dataflow
  • Polynomial representation of the special
    instructions L
  • To implement S with elements of L
  • Simplify S using elements of L
  • Can be done with Buchbergers algorithm and
    Gröbner bases
  • Available in Maple (simplify), Mathematica
    (AlgebraicRules),
  • Reduce S iteratively. Can S be reduced
    completely?
  • If yes, S can be implemented using only elements
    of L
  • If not, map the residual polynomial to adders and
    multipliers
  • Find minimal cost reduction (e.g., minimal number
    of instructions)

42
Polynomials Representing the Instructions
  • For each instruction available
  • Many polynomials candidates to simplify S
  • Large search space
  • Which instructions used to simplify?
  • Which polynomials generated with such
    instructions?

t1 a b R2 G2 t2 a R2 b G2 t3 a
b c d
43
Choice of the Polynomials
  • Example using Maple V commands and syntax
  • Y aqR1apR2bqG1bpG2cqB1cpB2d
  • siderels s1qR1pR2, s2qG1pG2,
    s3qB1pB2
  • Z simplify(Y, siderels, R1,R2,G1,G2,B1,B2)
  • Z as1bs2cs3d

Side Relations in Maple Algebraic Transformation
Rules in Mathematica
Y foo(foo(p,R2,q,R1),a,foo(p,G2,q,G1),b)
foo(foo(p,B2,q,B1),c,d,1)
44
Choice of the Polynomials
  • Example using Maple V commands and syntax
  • Y aqR1apR2bqG1bpG2cqB1cpB2d
  • siderels s1qR1pR2, s2qG1pG2,
    s3qB1pB2
  • Z simplify(Y, siderels, R1,R2,G1,G2,B1,B2)
  • Z as1bs2cs3d
  • siderel2s4s1as2b, s5cs3d
  • simplify(Z, siderel2, s1, s2, s3)
  • s4s5

Y foo(foo(p,R2,q,R1),a,foo(p,G2,q,G1),b)
foo(foo(p,B2,q,B1),c,d,1)
45
Optimal Instruction Selection
  • Build decomposition tree from left to right
  • Li ? polynomials generated from instructions
  • Base cost implementing basic block with adders
    and multipliers
  • Branch Bound algorithm
  • Continue until the result of simplify is a
    library element
  • Bound the tree depth every time a better solution
    is found

46
Overall View
Polynomial Representation of Critical Code
Polynomial Rep. of Library Elements
47
Symbolic Algebra Results (Speedup)
  • Symbolic algebra instruction-selection engine
    used to add single output functional units to
    Xtensa processors
  • 18 to 77 with an average of 41 execution-time
    improvement reported by the a cycle-accurate
    instruction-set simulator

48
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
49
Arithmetic Optimisations
  • Arithmetic operations often appear in sequence
  • Direct implementation not efficient
  • Exploit effectiveness of carry-save adders,
    column compressors, etc.
  • Can bring large advantages, often without
    additional hardware cost
  • Typical example MAC
  • Marginally slower than corresponding MUL
  • Practically same complexity

50
Example of Problem
  • Code of multiplication
  • Graph

51
Carry-Save Representation
  • Dont do successive sums (each TO(f(N))) but use
    CS representation wherever possible (each TO(1))

52
Example of Problem
  • Graph again with critical path
  • Horror, disgrace!

53
Transformations (I)
  • Expand/explicit multiplications, subtractions,
  • MUL ? PP COMP ADD
  • SUB ? NOT ADD
  • Use CS representation
  • ADD ADD ? COMP

54
Transformations (II)
  • Help clustering operations of the same kind
  • Homomorphism ADD SL ? SL/SL ADD
  • Pseudo-homomorphism COMP NEG ? NEG/NEG COMP
    (with some corrections K-1)
  • Then they can be collapsed
  • COMP COMP ? COMP
  • SL SL ? SL

55
Results
  • Graph again but good for synthesis, CP reasonable
  • Show speed and area, before and after and
    standard multiplier from a library

56
High-level Arithmetic Optimisations Example
original
implement multipliers
apply some rules
apply more rules

57
High-level Arithmetic OptimisationsExample
Significant speed-up possible at no hardware cost
58
Outline
  • Three Sample Problems
  • Automatic Instruction-Set Extensions
  • Atasu, Pozzi, Ienne DAC 2003 (Best Paper Award)
  • Advanced Instruction Selection
  • Peymandoust, Pozzi, Ienne, De Micheli ASAP 2003
  • Arithmetic Optimisations / Synthesis
  • Ongoing work
  • Architectural Outlook and Conclusions

59
Billion-Transistor Systems-on-Chip
60
A Reconfigurable HW Platform for MegaWatch
RF
61
A Reconfigurable HW Platform for MegaWatch
Dynamic Reconfiguration RF Design Low-power
software radio Logic Design Low-power
reconfigurable datapath Processor
Architecture Datapath function definition Compute
r Science Operating system support Communications
Low power on-chip communication network
Write a Comment
User Comments (0)
About PowerShow.com