PPT – I PowerPoint presentation | free to view

About This Presentation

Title:

I

Description:

(common work with Kubilay Atasu and Laura Pozzi) Processor Architecture ... Horror, disgrace! Transformations (I) Expand/explicit multiplications, subtractions, ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 62

Provided by: josed70

Category:

Tags: disgrace

more less

Transcript and Presenter's Notes

Title: I

1
IC Research Day 2003Embedded Processors From
Prêt-à-Porter to Tailor-made

Paolo Ienne
(common work with Kubilay Atasu and Laura Pozzi)
Processor Architecture Laboratory (LAP)
Centre for Advanced Digital Systems (CSDA)
Swiss Federal Institute of Technology Lausanne
(EPFL)

2
Evolution?
From Tailor-Made
3
Processor Evolution

From Tailor-made to Prêt-à-Porter to
Tailor-made!!

From Tailor-made to Prêt-à-Porter

Number of different architectures
Computers
4004
B5000
8080
PDP-11
uP
x86
PA-RISC
B6500
Z80
MIPS
IBM360
6502
RS600
RISC
68k
CDC6600
Alpha
Mainframes
SPARC
VAX
PowerPC
60
70
80
90
2000
10
4
Tailor-Made Embedded Processors?

Binary compatibility is less of an issue for
embedded systems
Unlike general processors (? x86 dominance)
Performance/cost is at premium, not performance
alone as in PCs (Intel model)
Also other metrics are now fundamental, such as
energy
Systems-on-Chip can include any processor on chip
? no strong need for standard devices
Application-specific integrated circuits (ASICs)

5
Tailoring Instruction Set Extensions
Processor
?
6
Instruction Set Extensions

Collapse a subset of the Direct Acyclic Graph
nodes into a single Functional Unit (AFU)
Exploit cheaply the parallelism within the basic
block
Simplify operations with constant operands
Optimise sequences of instructions (logic,
arithmetic, etc.)
Exploit limited precision

7
Outline

Problem definition
Main algorithm
Speed-up estimation
More tailoring problems
Conclusions

8
Instruction Set Extensions

Typical approach (I)
Find frequent patterns
Typically rather small
Might have too many inputs or outputs for
register file
Reuse is not a good heuristic for high speedup
E.g., ChoiJun99, KastnerOct02, ArnoldApr01
Typical approach (II)
Grow clusters until I/O violation occurs
Limits possibilities
Usually only single output
E.g., RazdanNov94, Geurts97, AlippiMar99,
KastrupApr99, YeJun00, BaleaniMay02

9
Existing Solutions Miss Potential Speed-ups

Goal Find subgraphs
having a user defined maximum number of inputs
and outputs,
including disconnected components, and
that maximize the overall speedup

10
Problem Statement

Gi (V, E) are the graphs representing the DAGs of
the algorithm basic blocks
S is a subgraph of G
M(S) represents the gain achievable by
implementing the subgraph S as a special
instruction
Problem Find the Ninstr subgraphs Sj of any Gi
such that
Under the following constraints for each
subgraphs Sj
Number of inputs of Sj lt Nin
Number of outputs of Sj lt Nout
Sj is convex

11
Convexity Constraint
A nonconvex subgraph cannot be implemented in a
special instruction (or it imposes too complex
constraints to the compilers scheduler)
12
Outline

Problem definition
Main algorithm
Speed-up estimation
More tailoring problems
Conclusions

13
Identification Algorithms
1 Single Subgraph Single Basic Block
2 Multiple Subgraphs (e.g., 2) Single Basic Block
3/4 Multiple Subgraphs (e.g., 3) Multiple Basic
Blocks
14
Single Subgraph within a Single Basic Block

A graph with N nodes has 2N subgraphs
Potential solutions are in fact rather sparse
How to avoid exploring unnecessarilythe whole
design space?

15
Search Space Pruning
16
Search Space Pruning
Nodes are numbered based on a topological
sort (? source nodes follow destination
nodes) Example Nout 1
17
Algorithm Performance
Subgraphs Considered
Number of Graph Nodes (N)
Number of subgraphs consideredusing an output
port constraint of two
18
Identification Algorithms
1 Single Subgraph Single Basic Block
2 Multiple Subgraphs (e.g., 2) Single Basic Block
3/4 Multiple Subgraphs (e.g., 3) Multiple Basic
Blocks
19
Outline

Problem definition
Main algorithm
Speed-up estimation
More tailoring problems
Conclusions

20
Estimated Speedups
Input Output constraints
The speedup grows with the relaxation of
constraints The heuristic algorithm (Iterative)
for multiple subgraphsis practically as good as
the Optimal algorithm
21
Identification Results

Only Iterative algorithm run due to complexity

22
Outline

Problem definition
Main algorithm
Speed-up estimation
More tailoring problems
Conclusions

23
Whats Next?
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
24
Whats Next?
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
25
Instruction SelectionBased on Symbolic Algebra

Symbolic algebra instruction-selection engine
used to add single-output functional units to
Xtensa processors
BranchBound algorithm to select polynomial
candidates and Maple V as the symbolic
simplification engine (? Gröbner bases)
18 to 77 with an average of 41 execution-time
improvement reported by the a cycle-accurate
instruction-set simulator

26
Area Cost

Added a total of ten different instructions
Complexity ranges from 2 to 20 operations per
instruction
Small area penalty (max 20)

27
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
28
Example ofHigh-level Arithmetic Optimisations
Significant speed-up possible at no hardware cost
29
Conclusions
Application Algorithm
Manual Partitioning
SW Systems-on-Chip HW
30
Acknowledgements

LAP
Laura Pozzi
Kubilay Atasu, Miljan Vuletic
Ajay Kumar Verma
Other Groups
Armita Peymandoust, Giovanni De Micheli
Stanford Univ.
Rainer Leupers RWTH Aachen

31
IC Research Day 2003Embedded Processors From
Prêt-à-Porter to Tailor-made

Paolo Ienne
(common work with Kubilay Atasu and Laura Pozzi)
Processor Architecture Laboratory (LAP)
Centre for Advanced Digital Systems (CSDA)
Swiss Federal Institute of Technology Lausanne
(EPFL)

32
New Design Methodologies!
System Synthesis
Instruction Set Extensions / Coprocessor Synthesis
Symbolic Mapping
Retargetable Compilers
High-Level Synthesis
Arithmetic Optimisation
Logic Synthesis
Physical Design
33
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
34
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
35
Instruction Selection

Given a code segment
And having identified special instructions
Find out all places where the instruction can be
used in the code
Maximise usageand hence advantage
Increase robustness to changes in the application

Y1 a R1 b G1 c B1 d Y2 a R2
b G2 c B2 dY Y2 q (Y1 - Y2)
36
Instruction Selection
Best?!
37
Simple Algebraic Properties

Neutral elements
Associativity, distributivity, etc.

c c 1
(a b) c a c b c
38
Simple Algebraic Properties
39
More Use of Algebra

Given a code segment

Y1 a R1 b G1 c B1 d Y2 a R2
b G2 c B2 dY Y2 q (Y1 - Y2)
40
Rewriting the Polynomial to Expose foo
41
Gröbner Bases Instruction Selection

Polynomial representation S of the dataflow
Polynomial representation of the special
instructions L
To implement S with elements of L
Simplify S using elements of L
Can be done with Buchbergers algorithm and
Gröbner bases
Available in Maple (simplify), Mathematica
(AlgebraicRules),
Reduce S iteratively. Can S be reduced
completely?
If yes, S can be implemented using only elements
of L
If not, map the residual polynomial to adders and
multipliers
Find minimal cost reduction (e.g., minimal number
of instructions)

42
Polynomials Representing the Instructions

For each instruction available
Many polynomials candidates to simplify S
Large search space
Which instructions used to simplify?
Which polynomials generated with such
instructions?

t1 a b R2 G2 t2 a R2 b G2 t3 a
b c d
43
Choice of the Polynomials

Example using Maple V commands and syntax
Y aqR1apR2bqG1bpG2cqB1cpB2d
siderels s1qR1pR2, s2qG1pG2,
s3qB1pB2
Z simplify(Y, siderels, R1,R2,G1,G2,B1,B2)
Z as1bs2cs3d

Side Relations in Maple Algebraic Transformation
Rules in Mathematica
Y foo(foo(p,R2,q,R1),a,foo(p,G2,q,G1),b)
foo(foo(p,B2,q,B1),c,d,1)
44
Choice of the Polynomials

Example using Maple V commands and syntax
Y aqR1apR2bqG1bpG2cqB1cpB2d
siderels s1qR1pR2, s2qG1pG2,
s3qB1pB2
Z simplify(Y, siderels, R1,R2,G1,G2,B1,B2)
Z as1bs2cs3d
siderel2s4s1as2b, s5cs3d
simplify(Z, siderel2, s1, s2, s3)
s4s5

Y foo(foo(p,R2,q,R1),a,foo(p,G2,q,G1),b)
foo(foo(p,B2,q,B1),c,d,1)
45
Optimal Instruction Selection

Build decomposition tree from left to right
Li ? polynomials generated from instructions
Base cost implementing basic block with adders
and multipliers
Branch Bound algorithm
Continue until the result of simplify is a
library element
Bound the tree depth every time a better solution
is found

46
Overall View
Polynomial Representation of Critical Code
Polynomial Rep. of Library Elements
47
Symbolic Algebra Results (Speedup)

Symbolic algebra instruction-selection engine
used to add single output functional units to
Xtensa processors
18 to 77 with an average of 41 execution-time
improvement reported by the a cycle-accurate
instruction-set simulator

48
Outline
AutomaticInstruction-Set Extensions Atasu,
Pozzi, Ienne DAC 2003 (Best Paper Award)
Arithmetic Optimisations Ongoing work
Symbolic Algebra for Instruction
Selection Peymandoust, Pozzi, Ienne, De Micheli
ASAP 2003
49
Arithmetic Optimisations

Arithmetic operations often appear in sequence
Direct implementation not efficient
Exploit effectiveness of carry-save adders,
column compressors, etc.
Can bring large advantages, often without
additional hardware cost
Typical example MAC
Marginally slower than corresponding MUL
Practically same complexity

50
Example of Problem

Code of multiplication
Graph

51
Carry-Save Representation

Dont do successive sums (each TO(f(N))) but use
CS representation wherever possible (each TO(1))

52
Example of Problem

Graph again with critical path
Horror, disgrace!

53
Transformations (I)

Expand/explicit multiplications, subtractions,
MUL ? PP COMP ADD
SUB ? NOT ADD
Use CS representation
ADD ADD ? COMP

54
Transformations (II)

Help clustering operations of the same kind
Homomorphism ADD SL ? SL/SL ADD
Pseudo-homomorphism COMP NEG ? NEG/NEG COMP
(with some corrections K-1)
Then they can be collapsed
COMP COMP ? COMP
SL SL ? SL

55
Results

Graph again but good for synthesis, CP reasonable
Show speed and area, before and after and
standard multiplier from a library

56
High-level Arithmetic Optimisations Example
original
implement multipliers
apply some rules
apply more rules

57
High-level Arithmetic OptimisationsExample
Significant speed-up possible at no hardware cost
58
Outline

Three Sample Problems
Automatic Instruction-Set Extensions
Atasu, Pozzi, Ienne DAC 2003 (Best Paper Award)
Advanced Instruction Selection
Peymandoust, Pozzi, Ienne, De Micheli ASAP 2003
Arithmetic Optimisations / Synthesis
Ongoing work
Architectural Outlook and Conclusions

59
Billion-Transistor Systems-on-Chip
60
A Reconfigurable HW Platform for MegaWatch
RF
61
A Reconfigurable HW Platform for MegaWatch
Dynamic Reconfiguration RF Design Low-power
software radio Logic Design Low-power
reconfigurable datapath Processor
Architecture Datapath function definition Compute
r Science Operating system support Communications
Low power on-chip communication network

Write a Comment

User Comments (0)