Title: I
1IC Research Day 2003Embedded Processors From
Prêt-à-Porter to Tailor-made
- Paolo Ienne
- (common work with Kubilay Atasu and Laura Pozzi)
- Processor Architecture Laboratory (LAP)
- Centre for Advanced Digital Systems (CSDA)
- Swiss Federal Institute of Technology Lausanne
(EPFL)
2Evolution?
From Tailor-Made
3Processor Evolution
- From Tailor-made to Prêt-à-Porter to
Tailor-made!!
- From Tailor-made to Prêt-à-Porter
Number of different architectures
Computers
4004
B5000
8080
PDP-11
uP
x86
PA-RISC
B6500
Z80
MIPS
IBM360
6502
RS600
RISC
68k
CDC6600
Alpha
Mainframes
SPARC
VAX
PowerPC
60
70
80
90
2000
10
4Tailor-Made Embedded Processors?
- Binary compatibility is less of an issue for
embedded systems - Unlike general processors (? x86 dominance)
- Performance/cost is at premium, not performance
alone as in PCs (Intel model) - Also other metrics are now fundamental, such as
energy - Systems-on-Chip can include any processor on chip
? no strong need for standard devices - Application-specific integrated circuits (ASICs)
5Tailoring Instruction Set Extensions
Processor
?
6Instruction Set Extensions
- Collapse a subset of the Direct Acyclic Graph
nodes into a single Functional Unit (AFU) - Exploit cheaply the parallelism within the basic
block - Simplify operations with constant operands
- Optimise sequences of instructions (logic,
arithmetic, etc.) - Exploit limited precision
7Outline
- Problem definition
- Main algorithm
- Speed-up estimation
- More tailoring problems
- Conclusions
8Instruction Set Extensions
- Typical approach (I)
- Find frequent patterns
- Typically rather small
- Might have too many inputs or outputs for
register file - Reuse is not a good heuristic for high speedup
- E.g., ChoiJun99, KastnerOct02, ArnoldApr01
- Typical approach (II)
- Grow clusters until I/O violation occurs
- Limits possibilities
- Usually only single output
- E.g., RazdanNov94, Geurts97, AlippiMar99,
KastrupApr99, YeJun00, BaleaniMay02
9Existing Solutions Miss Potential Speed-ups
- Goal Find subgraphs
- having a user defined maximum number of inputs
and outputs, - including disconnected components, and
- that maximize the overall speedup
10Problem Statement
- Gi (V, E) are the graphs representing the DAGs of
the algorithm basic blocks - S is a subgraph of G
- M(S) represents the gain achievable by
implementing the subgraph S as a special
instruction - Problem Find the Ninstr subgraphs Sj of any Gi
such that - Under the following constraints for each
subgraphs Sj - Number of inputs of Sj lt Nin
- Number of outputs of Sj lt Nout
- Sj is convex
11Convexity Constraint
A nonconvex subgraph cannot be implemented in a
special instruction (or it imposes too complex
constraints to the compilers scheduler)
12Outline
- Problem definition
- Main algorithm
- Speed-up estimation
- More tailoring problems
- Conclusions
13Identification Algorithms
1 Single Subgraph Single Basic Block
2 Multiple Subgraphs (e.g., 2) Single Basic Block
3/4 Multiple Subgraphs (e.g., 3) Multiple Basic
Blocks
14Single Subgraph within a Single Basic Block
- A graph with N nodes has 2N subgraphs
- Potential solutions are in fact rather sparse
- How to avoid exploring unnecessarilythe whole
design space?
15Search Space Pruning
16Search Space Pruning
Nodes are numbered based on a topological
sort (? source nodes follow destination
nodes) Example Nout 1
17Algorithm Performance
Subgraphs Considered
Number of Graph Nodes (N)
Number of subgraphs consideredusing an output
port constraint of two
18Identification Algorithms
1 Single Subgraph Single Basic Block
2 Multiple Subgraphs (e.g., 2) Single Basic Block
3/4 Multiple Subgraphs (e.g., 3) Multiple Basic
Blocks
19Outline
- Problem definition
- Main algorithm
- Speed-up estimation
- More tailoring problems
- Conclusions
20Estimated Speedups
Input Output constraints
The speedup grows with the relaxation of
constraints The heuristic algorithm (Iterative)
for multiple subgraphsis practically as good as
the Optimal algorithm
21Identification Results
- Only Iterative algorithm run due to complexity
22Outline
- Problem definition
- Main algorithm
- Speed-up estimation
- More tailoring problems
- Conclusions
23Whats Next?
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
24Whats Next?
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
25Instruction SelectionBased on Symbolic Algebra
- Symbolic algebra instruction-selection engine
used to add single-output functional units to
Xtensa processors - BranchBound algorithm to select polynomial
candidates and Maple V as the symbolic
simplification engine (? Gröbner bases) - 18 to 77 with an average of 41 execution-time
improvement reported by the a cycle-accurate
instruction-set simulator
26Area Cost
- Added a total of ten different instructions
- Complexity ranges from 2 to 20 operations per
instruction - Small area penalty (max 20)
27Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
28Example ofHigh-level Arithmetic Optimisations
Significant speed-up possible at no hardware cost
29Conclusions
Application Algorithm
Manual Partitioning
SW Systems-on-Chip HW
30Acknowledgements
- LAP
- Laura Pozzi
- Kubilay Atasu, Miljan Vuletic
- Ajay Kumar Verma
- Other Groups
- Armita Peymandoust, Giovanni De Micheli
Stanford Univ. - Rainer Leupers RWTH Aachen
31IC Research Day 2003Embedded Processors From
Prêt-à-Porter to Tailor-made
- Paolo Ienne
- (common work with Kubilay Atasu and Laura Pozzi)
- Processor Architecture Laboratory (LAP)
- Centre for Advanced Digital Systems (CSDA)
- Swiss Federal Institute of Technology Lausanne
(EPFL)
32New Design Methodologies!
System Synthesis
Instruction Set Extensions / Coprocessor Synthesis
Symbolic Mapping
Retargetable Compilers
High-Level Synthesis
Arithmetic Optimisation
Logic Synthesis
Physical Design
33Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
34Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
35Instruction Selection
- Given a code segment
- And having identified special instructions
- Find out all places where the instruction can be
used in the code - Maximise usageand hence advantage
- Increase robustness to changes in the application
Y1 a R1 b G1 c B1 d Y2 a R2
b G2 c B2 dY Y2 q (Y1 - Y2)
36Instruction Selection
Best?!
37Simple Algebraic Properties
- Neutral elements
- Associativity, distributivity, etc.
c c 1
(a b) c a c b c
38Simple Algebraic Properties
39More Use of Algebra
Y1 a R1 b G1 c B1 d Y2 a R2
b G2 c B2 dY Y2 q (Y1 - Y2)
40Rewriting the Polynomial to Expose foo
41Gröbner Bases Instruction Selection
- Polynomial representation S of the dataflow
- Polynomial representation of the special
instructions L - To implement S with elements of L
- Simplify S using elements of L
- Can be done with Buchbergers algorithm and
Gröbner bases - Available in Maple (simplify), Mathematica
(AlgebraicRules), - Reduce S iteratively. Can S be reduced
completely? - If yes, S can be implemented using only elements
of L - If not, map the residual polynomial to adders and
multipliers - Find minimal cost reduction (e.g., minimal number
of instructions)
42Polynomials Representing the Instructions
- For each instruction available
- Many polynomials candidates to simplify S
- Large search space
- Which instructions used to simplify?
- Which polynomials generated with such
instructions?
t1 a b R2 G2 t2 a R2 b G2 t3 a
b c d
43Choice of the Polynomials
- Example using Maple V commands and syntax
- Y aqR1apR2bqG1bpG2cqB1cpB2d
- siderels s1qR1pR2, s2qG1pG2,
s3qB1pB2 - Z simplify(Y, siderels, R1,R2,G1,G2,B1,B2)
- Z as1bs2cs3d
Side Relations in Maple Algebraic Transformation
Rules in Mathematica
Y foo(foo(p,R2,q,R1),a,foo(p,G2,q,G1),b)
foo(foo(p,B2,q,B1),c,d,1)
44Choice of the Polynomials
- Example using Maple V commands and syntax
- Y aqR1apR2bqG1bpG2cqB1cpB2d
- siderels s1qR1pR2, s2qG1pG2,
s3qB1pB2 - Z simplify(Y, siderels, R1,R2,G1,G2,B1,B2)
- Z as1bs2cs3d
- siderel2s4s1as2b, s5cs3d
- simplify(Z, siderel2, s1, s2, s3)
- s4s5
Y foo(foo(p,R2,q,R1),a,foo(p,G2,q,G1),b)
foo(foo(p,B2,q,B1),c,d,1)
45Optimal Instruction Selection
- Build decomposition tree from left to right
- Li ? polynomials generated from instructions
- Base cost implementing basic block with adders
and multipliers - Branch Bound algorithm
- Continue until the result of simplify is a
library element - Bound the tree depth every time a better solution
is found
46Overall View
Polynomial Representation of Critical Code
Polynomial Rep. of Library Elements
47Symbolic Algebra Results (Speedup)
- Symbolic algebra instruction-selection engine
used to add single output functional units to
Xtensa processors - 18 to 77 with an average of 41 execution-time
improvement reported by the a cycle-accurate
instruction-set simulator
48Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
49Arithmetic Optimisations
- Arithmetic operations often appear in sequence
- Direct implementation not efficient
- Exploit effectiveness of carry-save adders,
column compressors, etc. - Can bring large advantages, often without
additional hardware cost - Typical example MAC
- Marginally slower than corresponding MUL
- Practically same complexity
50Example of Problem
- Code of multiplication
- Graph
51Carry-Save Representation
- Dont do successive sums (each TO(f(N))) but use
CS representation wherever possible (each TO(1))
52Example of Problem
- Graph again with critical path
- Horror, disgrace!
53Transformations (I)
- Expand/explicit multiplications, subtractions,
- MUL ? PP COMP ADD
- SUB ? NOT ADD
- Use CS representation
- ADD ADD ? COMP
54Transformations (II)
- Help clustering operations of the same kind
- Homomorphism ADD SL ? SL/SL ADD
- Pseudo-homomorphism COMP NEG ? NEG/NEG COMP
(with some corrections K-1) - Then they can be collapsed
- COMP COMP ? COMP
- SL SL ? SL
55Results
- Graph again but good for synthesis, CP reasonable
- Show speed and area, before and after and
standard multiplier from a library
56High-level Arithmetic Optimisations Example
original
implement multipliers
apply some rules
apply more rules
57High-level Arithmetic OptimisationsExample
Significant speed-up possible at no hardware cost
58Outline
- Three Sample Problems
- Automatic Instruction-Set Extensions
- Atasu, Pozzi, Ienne DAC 2003 (Best Paper Award)
- Advanced Instruction Selection
- Peymandoust, Pozzi, Ienne, De Micheli ASAP 2003
- Arithmetic Optimisations / Synthesis
- Ongoing work
- Architectural Outlook and Conclusions
59Billion-Transistor Systems-on-Chip
60A Reconfigurable HW Platform for MegaWatch
RF
61A Reconfigurable HW Platform for MegaWatch
Dynamic Reconfiguration RF Design Low-power
software radio Logic Design Low-power
reconfigurable datapath Processor
Architecture Datapath function definition Compute
r Science Operating system support Communications
Low power on-chip communication network