Embedded Computer Architecture - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Embedded Computer Architecture

Description:

Title: No Slide Title Author: abc Last modified by: Henk Corporaal Created Date: 7/10/1998 11:19:28 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 64
Provided by: abc61
Category:

less

Transcript and Presenter's Notes

Title: Embedded Computer Architecture


1
Embedded Computer Architecture
VLIW architectures Generating VLIW code
  • TU/e 5kk73
  • Henk Corporaal

2
VLIW lectures overview
  • Enhance performance architecture methods
  • Instruction Level Parallelism
  • VLIW
  • Examples
  • C6
  • TM
  • TTA
  • Clustering and Reconfigurable components
  • Code generation
  • compiler basics
  • mapping and scheduling
  • TTA code generation
  • Design space exploration
  • Hands-on

3
Compiler basics
  • Overview
  • Compiler trajectory / structure / passes
  • Control Flow Graph (CFG)
  • Mapping and Scheduling
  • Basic block list scheduling
  • Extended scheduling scope
  • Loop scheduling
  • Loop transformations
  • separate lecture

4
Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
5
Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
6
Compiler basics structure Simple example from
HLL to (Sequential) Assembly code
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
7
Compiler basics Control flow graph (CFG)
CFG shows the flow between basic blocks
C input code
1 sub t1, a, b bgz t1, 2, 3
if (a gt b) r a b else r b
a
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
8
Compiler basics Basic optimizations
  • Machine independent optimizations
  • Machine dependent optimizations

9
Compiler basics Basic optimizations
  • Machine independent optimizations
  • Common subexpression elimination
  • Constant folding
  • Copy propagation
  • Dead-code elimination
  • Induction variable elimination
  • Strength reduction
  • Algebraic identities
  • Commutative expressions
  • Associativity Tree height reduction
  • Note not always allowed(due to limited
    precision)
  • For details check any good compiler book !

10
Compiler basics Basic optimizations
  • Machine dependent optimization example
  • Whats the optimal implementation of a34 ?
  • Use multiplier mul Tb, Ta, 34
  • Pro No thinking required
  • Con May take many cycles
  • Alternative
  • SHL Tb, Ta, 1
  • SHL Tc, Ta, 5
  • ADD Tb, Tb, Tc
  • Pros May take fewer cycles
  • Cons
  • Uses more registers
  • Additional instructions ( I-cache load / code
    size)

11
Compiler basics Register allocation
  • Register Organization
  • Conventions needed for parameter passing
  • and register usage across function calls

12
Register allocation using graph coloring
  • Given a set of registers, what is the most
    efficient
  • mapping of registers to program variables in
    terms
  • of execution time of the program?
  • Some definitions
  • A variable is defined at a point in program when
    a value is assigned to it.
  • A variable is used at a point in a program when
    its value is referenced in an expression.
  • The live range of a variable is the execution
    range between definitions and uses of a variable.

13
Register allocation using graph coloring
Live Ranges
define
use
14
Register allocation using graph coloring
Inference Graph
a
Coloring a red b green c blue d green
b
c
d
Graph needs 3 colors gt program needs 3 registers
Question map coloring requires (at most) 4
colors whats the maximum number of colors (
registers) needed for register interference
graph coloring?
15
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not
enough colors (registers) to color the
interference graph
Example Only two registers available !!
16
Register allocation for a monolithic RF
Scheme of the optimistic register allocator
Spill code
Renumber
Build
Spill costs
Simplify
Select
The Select phase selects a color ( machine
register) for a variable that minimizes the
heuristic h
h fdep(col, var) caller_callee(col, var)
where fdep(col, var) a measure
for the introduction of false dependencies
caller_callee(col, var) cost for mapping var on
a caller or callee saved register
17
Some explanation of reg allocation phases
  • Renumber The first phase finds all live ranges
    in a procedure
  • and numbers (renames) them uniquely.
  • Build This phase constructs the interference
    graph.
  • Spill Costs In preparation for coloring, a
    spill cost estimate
  • is computed for every live range. The cost is
    simply the sum of the
  • execution frequencies of the transports that
    define or use the variable
  • of the live range.
  • Simplify This phase removes nodes with degree
    lt k in an
  • arbitrary order from the graph and pushes them on
    a stack. Whenever
  • it discovers that all remaining nodes have degree
    gt k, it chooses
  • a spill candidate. This node is also removed from
    the graph and
  • optimistically pushed on the stack, hoping a
    color will be available in
  • spite of its high degree.
  • Select Colors are selected for nodes. In turn,
    each node is
  • popped from the stack, reinserted in the
    interference graph and given a

18
Compiler basics Code selection
  • CISC era (before 1985)
  • Code size important
  • Determine shortest sequence of code
  • Many options may exist
  • Pattern matching
  • Example M68029
  • D1 D1 M M10A1 16D2 20 ?
  • ADD (10,A1, D216, 20) D1
  • RISC era
  • Performance important
  • Only few possible code sequences
  • New implementations of old architectures optimize
    RISC part of instruction set only for e.g. i486
    / Pentium / M68020

19
Overview
  • Enhance performance architecture methods
  • Instruction Level Parallelism
  • VLIW
  • Examples
  • C6
  • TM
  • TTA
  • Clustering
  • Code generation
  • Compiler basics
  • Mapping and Scheduling of Operations
  • Design Space Exploration TTA framework
  • What is scheduling
  • Basic Block Scheduling
  • Extended Basic Block Scheduling
  • Loop Scheduling

20
Mapping / Scheduling placing operations in
space and time
  • d a b
  • e a d
  • f 2 b d
  • r f e
  • x z y

21
How to map these operations?
  • Architecture constraints
  • One Function Unit
  • All operations single cycle latency

22
How to map these operations?
  • Architecture constraints
  • One Add-sub and one Mul unit
  • All operations single cycle latency

23
There are many mapping solutions
24
Scheduling Overview
  • Transforming a sequential program into a parallel
    program
  • read sequential program
  • read machine description file
  • for each procedure do
  • perform function inlining
  • for each procedure do
  • transform an irreducible CFG into a reducible CFG
  • perform control flow analysis
  • perform loop unrolling
  • perform data flow analysis
  • perform memory reference disambiguation
  • perform register allocation
  • for each scheduling scope do
  • perform instruction scheduling
  • write out the parallel program

25
Basic Block Scheduling
  • Basic Block piece of code which can only be
    entered from the top (first instruction) and left
    at the bottom (final instruction)
  • Scheduling a basic block Assign resources and
    a cycle to every operation
  • List Scheduling Heuristic scheduling approach,
    scheduling the operation one-by-one
  • Time_complexity O(N), where N is operations
  • Optimal scheduling has Time_complexity O(exp(N)
  • Question what is a good scheduling heuristic

26
Basic Block Scheduling
  • Make a Data Dependence Graph (DDG)
  • Determine minimal length of the DDG (for the
    given architecture)
  • minimal number of cycles to schedule the graph
    (assuming sufficient resources)
  • Determine
  • ASAP (As Soon As Possible) cycle earliest cycle
    instruction can be scheduled
  • ALAP (As Late As Possible) cycle latest cycle
    instruction can be scheduled
  • Slack of each operation ALAP ASAP
  • Priority of operations f (Slack, decendants,
    register impact, . )
  • Place each operation in first cycle with
    sufficient resources
  • Notes
  • Basic Block a (maximal) piece of consecutive
    instructions which can only be entered at the
    first instruction and left at the end
  • Scheduling order sequential
  • Scheduling Priority determined by used heuristic
    e.g. slack other contributions

27
Basic Block Schedulingdetermine ASAP and ALAP
cycles
ASAP cycle
B
C
we assume all operations are single cycle !
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
28
Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E ready ready sched
? current_cycle 0 while sched ? V
do for each v ? ready (select in
priority order) do if
?ResourceConfl(v,current_cycle, sched) then
cycle(v) current_cycle
sched sched ? v endif
endfor current_cycle
current_cycle 1 ready v v ?
sched ? ? (u,v)? E, u ? sched ready
v v ? ready ? ? (u,v)? E, cycle(u)
delay(u,v) ? current_cycle endwhile endproc
29
Extended Scheduling Scope look at the CFG
Code
CFG Control Flow Graph
A If cond Then B Else C D If cond Then
E Else F G
Q Why enlarge the scheduling scope?
30
Extended basic block scheduling Code Motion
Q Why moving code?
  • Downward code motions?
  • a ? B, a ? C, a ? D, c ? D, d ? D
  • Upward code motions?
  • c ? A, d ? A, e ? B, e ? C, e ? A

31
Possible Scheduling Scopes
32
Create and Enlarge Scheduling Scope
33
Create and Enlarge Scheduling Scope
34
Comparing scheduling scopes
35
Code movement (upwards) within regions what to
check?
destination block
I
I
I
I
add
source block
36
Extended basic block schedulingCode Motion
  • A dominates B ? A is always executed before B
  • Consequently
  • A does not dominate B ? code motion from B to A
    requires
  • code duplication
  • B post-dominates A ? B is always executed after A
  • Consequently
  • B does not post-dominate A ? code motion from B
    to A is speculative

Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
37
Scheduling Loops
Loop Optimizations
A
B
C
D
38
Scheduling Loops
  • Problems with unrolling
  • Exploits only parallelism within sets of n
    iterations
  • Iteration start-up latency
  • Code expansion

Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
39
Software pipelining
  • Software pipelining a loop is
  • Scheduling the loop such that iterations start
    before preceding iterations have finished
  • Or
  • Moving operations across the backedge

LD LD ML LD ML ST ML ST ST
Unroling (3 times) 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
40
Software pipelining (contd)
  • Basic loop scheduling techniques
  • Modulo scheduling (Rau, Lam)
  • list scheduling with modulo resource constraints
  • Kernel recognition techniques
  • unroll the loop
  • schedule the iterations
  • identify a repeating pattern
  • Examples
  • Perfect pipelining (Aiken and Nicolau)
  • URPR (Su, Ding and Xia)
  • Petri net pipelining (Allan)
  • Enhanced pipeline scheduling (Ebcioglu)
  • fill first cycle of iteration
  • copy this instruction over the backedge

This algorithm most used in commercial compilers
41
Software pipelining Modulo scheduling
Example Modulo scheduling a loop
  • Prologue fills the SW pipeline with iterations
  • Epilogue drains the SW pipeline

42
Software pipelining determine II, the
Initiation Interval
Cyclic data dependences
For (i0.....) Ai6 3Ai-1
ld r1, (r2)
(0,1)
(1,0)
(delay, iteration distance)
mul r3, r1, 3
(1,6)
(0,1)
(1,0)
sub r4, r3, 1
(0,1)
(1,0)
st r4, (r5)
Initiation Interval
cycle(v) ? cycle(u) delay(u,v) -
II.distance(u,v)
43
Modulo scheduling constraints
MII, minimum initiation interval, bounded by
cyclic dependences and resources
MII max ResMinII, RecMinII
44
Let's go back to The Role of the Compiler
  • 9 steps required to translate an HLL program
  • (see online bookchapter)
  • Front-end compilation
  • Determine dependencies
  • Graph partitioning make multiple threads (or
    tasks)
  • Bind partitions to compute nodes
  • Bind operands to locations
  • Bind operations to time slots Scheduling
  • Bind operations to functional units
  • Bind transports to buses
  • Execute operations and perform transports

45
Division of responsibilities between hardware and
compiler
Application
(1)
Frontend
Superscalar
(2)
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
(3)
Multi-threaded
Scheduling
Scheduling
(4)
Indep. Arch
Binding of Operations
Binding of Operations
(5)
VLIW
Binding of Transports
Binding of Transports
(6)
TTA
Execute
(7)
Responsibility of compiler
Responsibility of Hardware
46
Overview
  • Enhance performance architecture methods
  • Instruction Level Parallelism
  • VLIW
  • Examples
  • C6
  • TM
  • TTA
  • Clustering
  • Code generation
  • Design Space Exploration TTA framework

47
Mapping applications to processorsMOVE framework
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
TTA based system
48
TTA (MOVE) organization
Data Memory
Socket
Instruction Memory
49
Code generation trajectory for TTAs
  • Frontend
  • GCC or SUIF
  • (adapted)

Application (C)
Compiler frontend
Sequential code
Sequential simulation
Input/Output
Architecture description
Compiler backend
Profiling data
Parallel code
Parallel simulation
Input/Output
50
Exploration TTA resource reduction
51
Exporation TTA connectivity reduction
Critical connections disappear
Reducing bus delay
Execution time
FU stage constrains cycle time
0
Number of connections removed
52
Can we do better?
Yes we can !!
  • How ?
  • Code Transformations
  • SFUs Special Function Units
  • Vector processing
  • Multiple Processors

53
Transforming the specification (1)



Based on associativity of operation a (b c)
(a b) c
54
Transforming the specification (2)
d a b e a d f 2 b d r f e x
z y
r 2b a x z y
a
55
Changing the architectureadding SFUs special
function units
4-input adder why is this faster?
56
Changing the architectureadding SFUs special
function units
  • In the extreme case put everything into one unit!

Spatial mapping - no control flow
However no flexibility / programmability !!
but could use FPGAs
57
SFUs fine grain patterns
  • Why using fine grain SFUs
  • Code size reduction
  • Register file ports reduction
  • Could be cheaper and/or faster
  • Transport reduction
  • Power reduction (avoid charging non-local wires)
  • Supports whole application domain !
  • coarse grain would only help certain specific
    applications
  • Which patterns do need support?
  • Detection of recurring operation patterns needed

58
SFUs covering results
Adding only 20 'patterns' of 2 operations
dramatically reduces of operations (with about
40) !!
59
Exploration resulting architecture
  • Architecture for image processing
  • Several SFUs
  • Note the reduced connectivity

60
Conclusions
  • Billions of embedded processing systems / year
  • how to design these systems quickly, cheap,
    correct, low power,.... ?
  • what will their processing platform look like?
  • VLIWs are very powerful and flexible
  • can be easily tuned to application domain
  • TTAs even more flexible, scalable, and lower power

61
Conclusions
  • Compilation for ILP architectures is mature
  • used in commercial compilers
  • However
  • Great discrepancy between available and
    exploitable parallelism
  • Advanced code scheduling techniques needed to
    exploit ILP

62
Bottom line
Do not pay for hardware if
you can do it in software !!
63
Handson-1 (2014)
  • HOW FAR ARE YOU?
  • VLIW processor of Silicon Hive (Intel)
  • Map your algorithm
  • Optimize the mapping
  • Optimize the architecture
  • Perform DSE (Design Space Exploration) trading
    off (gt Pareto curves)
  • Performance,
  • Energy and
  • Area ( Cost)
Write a Comment
User Comments (0)
About PowerShow.com