Title: HighLevel Synthesis: Creating Custom Circuits from HighLevel Code
1High-Level Synthesis Creating Custom
Circuits from High-Level Code
- Greg Stitt
- ECE Department
- University of Florida
2Existing FPGA Tool Flow
- Register-transfer (RT) synthesis
- Specify RT structure (muxes, registers, etc)
- Allows precise specification
- - But, time consuming, difficult, error prone
HDL
RT Synthesis
Technology Mapping
Netlist
Placement
Physical Design
Bitfile
Routing
3Future FPGA Tool Flow?
C/C, Java, etc.
High-level Synthesis
HDL
RT Synthesis
Technology Mapping
Netlist
Placement
Physical Design
Bitfile
Routing
4High-level Synthesis
- Wouldnt it be nice to write high-level code?
- Ratio of C to VHDL developers (100001 ?)
- Easier to specify
- Separates function from architecture
- More portable
- - Hardware potentially slower
- Similar to assembly code era
- Programmers could always beat compiler
- But, no longer the case
- Hopefully, high-level synthesis will catch up to
manual effort
5High-level Synthesis
- More challenging than compilation
- Compilation maps behavior into assembly
instructions - Architecture is known to compiler
- High-level synthesis creates a custom
architecture to execute behavior - Huge hardware exploration space
- Best solution may include microprocessors
- Should handle any high-level code
- Not all code appropriate for hardware
6High-level Synthesis
- First, consider how to manually convert
high-level code into circuit - Steps
- 1) Build FSM for controller
- 2) Build datapath based on FSM
acc 0 for (i0 i lt 128 i) acc ai
7Manual Example
- Build a FSM (controller)
- Decompose code into states
acc 0 for (i0 i lt 128 i) acc ai
acc0, i 0
if (i lt 128)
Done
load ai
acc ai
i
8Manual Example
- Build a datapath
- Allocate resources for each state
acc0, i 0
if (i lt 128)
Done
ai
acc
addr
i
load ai
1
128
1
acc ai
lt
i
acc 0 for (i0 i lt 128 i) acc ai
9Manual Example
- Build a datapath
- Determine register inputs
In from memory
acc0, i 0
a
0
0
if (i lt 128)
2x1
2x1
2x1
Done
ai
acc
addr
i
load ai
1
128
1
acc ai
lt
i
acc 0 for (i0 i lt 128 i) acc ai
10Manual Example
- Build a datapath
- Add outputs
In from memory
acc0, i 0
a
0
0
if (i lt 128)
2x1
2x1
2x1
Done
ai
acc
addr
i
load ai
1
128
1
acc ai
lt
i
acc 0 for (i0 i lt 128 i) acc ai
acc
Memory address
11Manual Example
- Build a datapath
- Add control signals
In from memory
acc0, i 0
a
0
0
if (i lt 128)
2x1
2x1
2x1
Done
ai
acc
addr
i
load ai
1
128
1
acc ai
lt
i
acc 0 for (i0 i lt 128 i) acc ai
acc
Memory address
12Manual Example
- Combine controllerdatapath
In from memory
Controller
a
0
0
2x1
2x1
2x1
ai
acc
addr
i
1
128
1
lt
acc 0 for (i0 i lt 128 i) acc ai
Done
Memory Read
acc
Memory address
13Manual Example
- Alternatives
- Use one adder (plus muxes)
In from memory
a
0
0
2x1
2x1
2x1
ai
acc
addr
i
1
128
lt
MUX
MUX
acc
Memory address
14Manual Example
- Comparison with high-level synthesis
- Determining when to perform each operation
- gt Scheduling
- Allocating resource for each operation
- gt Resource allocation
- Mapping operations onto resources
- gt Binding
15Another Example
x0 for (i0 i lt 100 i) if (ai gt 0)
x else x -- ai
x //output x
- Steps
- 1) Build FSM (do not perform if conversion)
- 2) Build datapath based on FSM
16High-Level Synthesis
Could be C, C, Java, Perl, Python, SystemC,
ImpulseC, etc.
High-level Code
High-Level Synthesis
Custom Circuit
Usually a RT VHDL description, but could as low
level as a bit file
17High-Level Synthesis
acc 0 for (i0 i lt 128 i) acc ai
High-Level Synthesis
18Main Steps
High-level Code
Converts code to intermediate representation -
allows all following steps to use language
independent format.
Front-end
Syntactic Analysis
Intermediate Representation
Optimization
Determines when each operation will execute, and
resources used
Scheduling/Resource Allocation
Back-end
Maps operations onto physical resources
Binding/Resource Sharing
Controller Datapath
19Syntactic Analysis
- Definition Analysis of code to verify syntactic
correctness - Converts code into intermediate representation
- 2 steps
- 1) Lexical analysis (Lexing)
- 2) Parsing
High-level Code
Lexical Analysis
Syntactic Analysis
Parsing
Intermediate Representation
20Lexical Analysis
- Lexical analysis (lexing) breaks code into a
series of defined tokens - Token defined language constructs
x 0 if (y lt z) x 1
Lexical Analysis
ID(x), ASSIGN, INT(0), SEMICOLON, IF, LPAREN,
ID(y), LT, ID(z), RPAREN, ID(x), ASSIGN, INT(1),
SEMICOLON
21Lexing Tools
- Define tokens using regular expressions - outputs
C code that lexes input - Common tool is lex
/ braces and parentheses / "" YYPRINT
return LBRACE "" YYPRINT return RBRACE
"," YYPRINT return COMMA "" YYPRINT
return SEMICOLON "!" YYPRINT return
EXCLAMATION "" YYPRINT return LBRACKET
"" YYPRINT return RBRACKET "-"
YYPRINT return MINUS / integers 0-9
yylval.intVal atoi( yytext ) return INT
22Parsing
- Analysis of token sequence to determine correct
grammatical structure - Languages defined by context-free grammar
Correct Programs
Grammar
x 0 y 1
x 0
Program Exp
if (a lt b) x 10
Exp Stmt SEMICOLON IF LPAREN Cond
RPAREN Exp Exp Exp
if (var1 ! var2) x 10
Cond ID Comp ID
x 0 if (y lt z) x 1
x 0 if (y lt z) x 1 y 5 t 1
Stmt ID ASSIGN INT
Comp LT NE
23Parsing
Incorrect Programs
Grammar
x 3 5
Program Exp
Exp S SEMICOLON IF LPAREN Cond RPAREN
Exp Exp Exp
x 5
x 5
if (x5 gt y) x 2
Cond ID Comp ID
x y
S ID ASSIGN INT
Comp LT NE
24Parsing Tools
- Define grammar in special language
- Automatically creates parser based on grammar
- Popular tool is yacc - yet-another-compiler-comp
iler
program functions 1 functions
function 1 functions function
1 function HEXNUMBER LABEL COLON
code 2
25Intermediate Representation
- Parser converts tokens to intermediate
representation - Usually, an abstract syntax tree
Assign
x 0 if (y lt z) x 1 d 6
x
if
0
assign
cond
assign
y
z
x
lt
1
d
6
26Intermediate Representation
- Why use intermediate representation?
- Easier to analyze/optimize than source code
- Theoretically can be used for all languages
- Makes synthesis back end language independent
C Code
Java
Perl
Syntactic Analysis
Syntactic Analysis
Syntactic Analysis
Intermediate Representation
Scheduling, resource allocation, binding,
independent of source language - sometimes
optimizations too
Back End
27Intermediate Representation
- Different Types
- Abstract Syntax Tree
- Control/Data Flow Graph (CDFG)
- Sequencing Graph
- Etc.
- We will focus on CDFG
- Combines control flow graph (CFG) and data flow
graph (DFG)
28Control flow graphs
- CFG
- Represents control flow dependencies of basic
blocks - Basic block is section of code that always
executes from beginning to end - I.e. no jumps into or out of block
acc0, i 0
acc 0 for (i0 i lt 128 i) acc ai
if (i lt 128)
Done
acc ai i
29Control flow graphs
- Your turn
- Create a CFG for this code
i 0 while (j lt 10) if (x lt 5) y
2 else if (z lt 10) y 6
30Data Flow Graphs
- DFG
- Represents data dependencies between operations
c
b
a
d
x ab y cd z x - y
-
y
z
x
31Control/Data Flow Graph
- Combines CFG and DFG
- Maintains DFG for each node of CFG
acc 0 for (i0 i lt 128 i) acc ai
0
0
acc
i
acc0 i0
if (i lt 128)
acc
ai
i
1
Done
acc ai i
i
acc
32High-Level Synthesis Optimization
33Synthesis Optimizations
- After creating CDFG, high-level synthesis
optimizes graph - Goals
- Reduce area
- Improve latency
- Increase parallelism
- Reduce power/energy
- 2 types
- Data flow optimizations
- Control flow optimizations
34Data Flow Optimizations
- Tree-height reduction
- Generally made possible from commutativity,
associativity, and distributivity
a
b
c
d
c
d
a
b
c
d
b
a
b
a
c
d
35Data Flow Optimizations
- Operator Strength Reduction
- Replacing an expensive (strong) operation with
a faster one - Common example replacing multiply/divide with
shift
0 multiplications
1 multiplication
bi ai ltlt 3
bi ai 8
c b ltlt 2 a b c
a b 5
a b 13
c b ltlt 2 d b ltlt 3 a c d b
36Data Flow Optimizations
- Constant propagation
- Statically evaluate expressions with constants
x 0 y x 15 z y 10
x 0 y 0 z 10
37Data Flow Optimizations
- Function Specialization
- Create specialized code for common inputs
- Treat common inputs as constants
- If inputs not known statically, must include if
statement for each call to specialized function
int f (int x) y x 15 return y
10
int f (int x) y x 15 return y
10
int f_opt () return 10
Treat frequent input as a constant
for (I0 I lt 1000 I) f(0)
for (I0 I lt 1000 I) f_opt(0)
38Data Flow Optimizations
- Common sub-expression elimination
- If expression appears more than once, repetitions
can be replaced
a x y . . . . . . . . . . . . b
c 25 x y
a x y . . . . . . . . . . . . b
c 25 a
x y already determined
39Data Flow Optimizations
- Dead code elimination
- Remove code that is never executed
- May seem like stupid code, but often comes from
constant propagation or function specialization
int f (int x) if (x gt 0 ) a b 15
else a b / 4 return a
int f_opt () a b 15 return a
Specialized version for x gt 0 does not need else
branch - dead code
40Data Flow Optimizations
- Code motion (hoisting/sinking)
- Avoid repeated computation
for (I0 I lt 100 I) z x y bi
ai z
z x y for (I0 I lt 100 I) bi
ai z
41Control Flow Optimizations
- Loop Unrolling
- Replicate body of loop
- May increase parallelism
for (i0 i lt 128 i) ai bi ci
for (i0 i lt 128 i2) ai bi
ci ai1 bi1 ci1
42Control Flow Optimizations
- Function Inlining
- Replace function call with body of function
- Common for both SW and HW
- SW - Eliminates function call instructions
- HW - Eliminates unnecessary control states
for (i0 i lt 128 i) ai f( bi, ci
) . . . . int f (int a, int b) return a b
15
for (i0 i lt 128 i) ai bi ci
15
43Control Flow Optimizations
- Conditional Expansion
- Replace if with logic expression
- Execute if/else bodies in parallel
y ab if (a) x bd else x bd
y ab x a(bd) abd
DeMicheli
Can be further optimized to
y ab x y d(ab)
44Example
x 0 y a b if (x lt 15) z a b -
c else z x 12 output z 12
45High-Level SynthesisScheduling/Resource
Allocation
46Scheduling
- Scheduling assigns a start time to each operation
in DFG - Start times must not violate dependencies in DFG
- Start times must meet performance constraints
- Alternatively, resource constraints
- Performed on the DFG of each CFG node
- gt Cant execute multiple CFG nodes in parallel
47Examples
a
b
c
d
c
d
a
b
Cycle1
Cycle1
Cycle2
Cycle2
Cycle3
Cycle3
c
d
a
b
Cycle1
Cycle2
48Scheduling Problems
- Several types of scheduling problems
- Usually some combination of performance and
resource constraints - Problems
- Unconstrained
- Not very useful, every schedule is valid
- Minimum latency
- Latency constrained
- Mininum-latency, resource constrained
- i.e. find the schedule with the shortest latency,
that uses less than a specified of resources - NP-Complete
- Mininum-resource, latency constrained
- i.e. find the schedule that meets the latency
constraint (which may be anything), and uses the
minimum of resources - NP-Complete
49Minimum Latency Scheduling
- ASAP (as soon as possible) algorithm
- Find a candidate node
- Candidate is a node whose predecessors have been
scheduled and completed (or has no predecessors) - Schedule node one cycle later than max cycle of
predecessor - Repeat until all nodes scheduled
c
d
e
a
b
f
g
h
-
lt
Cycle1
Cycle2
Cycle3
Cycle4
Minimum possible latency - 4 cycles
50Minimum Latency Scheduling
- ALAP (as late as possible) algorithm
- Run ASAP, get minimum latency L
- Find a candidate
- Candidate is node whose successors are scheduled
(or has none) - Schedule node one cycle before min cycle of
predecessor - Nodes with no successors scheduled to cycle L
- Repeat until all nodes scheduled
c
d
e
a
b
f
g
h
-
lt
Cycle1
Cycle4
Cycle3
Cycle2
Cycle3
Cycle4
L 4 cycles
51Minimum Latency Scheduling
- ALAP (as late as possible) algorithm
- Run ASAP, get minimum latency L
- Find a candidate
- Candidate is node whose successors are scheduled
(or has none) - Schedule node one cycle before min cycle of
predecessor - Nodes with no successors scheduled to cycle L
- Repeat until all nodes scheduled
c
d
e
a
b
f
g
h
Cycle1
Cycle2
Cycle3
-
lt
Cycle4
L 4 cycles
52Minimum Latency Scheduling
- ALAP
- Has to run ASAP first, seems pointless
- But, many heuristics need the mobility/slack of
each operation - ASAP gives the earliest possible time for an
operation - ALAP gives the latest possible time for an
operation - Slack difference between earliest and latest
possible schedule - Slack 0 implies operation has to be done in the
current scheduled cycle - The larger the slack, the more options a
heuristic has to schedule the operation
53Latency-Constrained Scheduling
- Instead of finding the minimum latency, find
latency less than L - Solutions
- Use ASAP, verify that minimum latency less than L
- Use ALAP starting with cycle L instead of minimum
latency (dont need ASAP)
54Scheduling with Resource Constraints
- Schedule must use less than specified number of
resources
Constraints 1 ALU (/-), 1 Multiplier
c
d
a
e
f
b
g
Cycle1
-
Cycle2
Cycle3
Cycle4
Cycle5
55Scheduling with Resource Constraints
- Schedule must use less than specified number of
resources
Constraints 2 ALU (/-), 1 Multiplier
c
d
a
e
f
b
g
Cycle1
-
Cycle2
Cycle3
Cycle4
56Mininum-Latency, Resource-Constrained Scheduling
- Definition Given resource constraints, find
schedule that has the minimum latency - Example
Constraints 1 ALU (/-), 1 Multiplier
c
d
e
a
b
f
g
Cycle1
-
Cycle2
Cycle4
Cycle3
Cycle5
Cycle6
57Mininum-Latency, Resource-Constrained Scheduling
- Definition Given resource constraints, find
schedule that has the minimum latency - Example
Constraints 1 ALU (/-), 1 Multiplier
c
d
e
a
b
f
g
Cycle1
-
Cycle2
Cycle3
Cycle4
Cycle5
Different schedules may use same resources, but
have different latencies
58Mininum-Latency, Resource-Constrained Scheduling
- Hus Algorithm
- Assumes one type of resource
- Basic Idea
- Input graph, of resources r
- 1) Label each node by max distance from output
- i.e. Use path length as priority
- 2) Determine C, the set of scheduling candidates
- Candidate if either no predecessors, or
predecessors scheduled - 3) From C, schedule up to r nodes to current
cycle, using label as priority - 4) Increment current cycle, repeat from 2) until
all nodes scheduled
59Mininum-Latency, Resource-Constrained Scheduling
c
d
a
e
f
j
b
g
k
-
-
r 3
60Mininum-Latency, Resource-Constrained Scheduling
- Hus Algorithm
- Step 1 - Label each node by max distance from
output - i.e. use path length as priority
c
d
a
e
f
j
b
g
k
3
1
4
4
2
3
2
1
r 3
61Mininum-Latency, Resource-Constrained Scheduling
- Hus Algorithm
- Step 2 - Determine C, the set of scheduling
candidates
c
d
a
e
f
j
b
g
k
C
3
1
4
4
2
3
2
Cycle 1
1
r 3
62Mininum-Latency, Resource-Constrained Scheduling
- Hus Algorithm
- Step 3 - From C, schedule up to r nodes to
current cycle, using label as priority
c
d
a
e
f
j
b
g
k
3
Cycle1
1
4
4
2
3
Not scheduled due to lower priority
2
Cycle 1
1
r 3
63Mininum-Latency, Resource-Constrained Scheduling
c
d
a
e
f
j
b
g
k
3
Cycle1
1
4
4
2
3
C
2
Cycle 2
1
r 3
64Mininum-Latency, Resource-Constrained Scheduling
c
d
a
e
f
j
b
g
k
3
Cycle1
1
4
4
2
3
Cycle2
2
Cycle 2
1
r 3
65Mininum-Latency, Resource-Constrained Scheduling
- Hus Algorithm
- Skipping to finish
c
d
a
e
f
j
b
g
k
3
Cycle1
1
4
4
2
3
Cycle2
2
Cycle3
Cycle4
1
r 3
66Mininum-Latency, Resource-Constrained Scheduling
- Hus is simplified problem
- Common Extensions
- Multiple resource types
- Multi-cycle operation
c
d
a
b
Cycle1
-
Cycle2
/
67Mininum-Latency, Resource-Constrained Scheduling
- List Scheduling - (minimum latency,
resource-constrained version) - Extension for multiple resource types
- Basic Idea - Hus algorithm for each resource
type - Input graph, set of constraints R for each
resource type - 1) Label nodes based on max distance to output
- 2) For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled predecessors ) - 4) Schedule up to Rt operations from C based on
priority, to current cycle - Rt is the constraint on resource type t
- 3) Increment cycle, repeat from 2) until all
nodes scheduled
68Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - minimum latency
- Step 1 - Label nodes based on max distance to
output (not shown, so you can see operations) - nodes given IDs for illustration purposes
2 ALUs (/-), 2 Multipliers
c
d
a
e
f
j
b
g
k
4
-
3
2
1
6
5
7
-
8
69Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - minimum latency
- For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled predecessors) - 4) Schedule up to Rt operations from C based on
priority, to current cycle - Rt is the constraint on resource type t
2 ALUs (/-), 2 Multipliers
Candidates
c
d
a
e
f
j
b
g
k
Cycle
ALU
Mult
1 2,3,4 1
4
-
3
2
1
6
5
7
-
8
70Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - minimum latency
- For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled predecessors) - 4) Schedule up to Rt operations from C based on
priority, to current cycle - Rt is the constraint on resource type t
2 ALUs (/-), 2 Multipliers
Candidates
c
d
a
e
f
j
b
g
k
Cycle
ALU
Mult
1 2,3,4 1
4
-
3
2
1
Cycle1
6
Candidate, but not scheduled due to low priority
5
7
-
8
71Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - minimum latency
- For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled predecessors) - 4) Schedule up to Rt operations from C based on
priority, to current cycle - Rt is the constraint on resource type t
2 ALUs (/-), 2 Multipliers
Candidates
c
d
a
e
f
j
b
g
k
Cycle
ALU
Mult
4
-
3
2
1
Cycle1
6
5
7
-
8
72Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - minimum latency
- For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled predecessors) - 4) Schedule up to Rt operations from C based on
priority, to current cycle - Rt is the constraint on resource type t
2 ALUs (/-), 2 Multipliers
Candidates
c
d
a
e
f
j
b
g
k
Cycle
ALU
Mult
4
-
3
2
1
Cycle1
6
5
Cycle2
7
-
8
73Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - minimum latency
- For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled predecessors) - 4) Schedule up to Rt operations from C based on
priority, to current cycle - Rt is the constraint on resource type t
2 ALUs (/-), 2 Multipliers
Candidates
c
d
a
e
f
j
b
g
k
Cycle
ALU
Mult
4
-
3
2
1
Cycle1
6
5
Cycle2
7
-
8
74Mininum-Latency, Resource-Constrained Scheduling
- List scheduling - (minimum latency)
- Final schedule
- Note - ASAP would require more resources
- ALAP wouldnt but in general, it would
2 ALUs (/-), 2 Multipliers
c
d
a
e
f
j
b
g
k
Cycle1
3
2
1
4
-
Cycle2
6
5
Cycle3
7
-
Cycle4
8
75Mininum-Latency, Resource-Constrained Scheduling
- Extension for multicycle operations
- Same idea (differences shown in red)
- Input graph, set of constraints R for each
resource type - 1) Label nodes based on max cycle latency to
output - 2) For each resource type t
- 3) Determine candidate nodes, C (those w/ no
predecessors or w/ scheduled and completed
predecessors) - 4) Schedule up to (Rt - nt) operations from C
based on priority, one cycle after predecessor - Rt is the constraint on resource type t
- nt is the number of resource t in use from
previous cycles - Repeat from 2) until all nodes scheduled
76Mininum-Latency, Resource-Constrained Scheduling
2 ALUs (/-), 2 Multipliers
c
d
e
a
b
f
j
g
k
Cycle1
3
2
1
Cycle2
4
6
Cycle3
5
Cycle4
Cycle5
Cycle6
7
-
Cycle7
8
77List Scheduling (Min Latency)
- Your turn (2 ALUs, 1 Mult)
- Steps (will be on test)
- 1) Label nodes with priority
- 2) Update candidate list for each cycle
- 3) Redraw graph to show schedule
6
-
5
3
2
4
1
8
7
-
10
9
-
11
78List Scheduling (Min Latency)
- Your turn (2 ALUs, 1 Mult, Mults take 2 cycles)
c
d
a
e
f
b
g
3
2
4
1
5
7
79Minimum-Resource, Latency-Constrained
- Note that if no resource constraints given,
schedule determines number of required resources - Max of each resource type used in a single cycle
c
d
a
e
f
b
g
3 ALUs
Cycle1
-
2 Mults
Cycle2
Cycle3
Cycle4
80Minimum-Resource, Latency-Constrained
- Minimum-Resource Latency-Constrained Scheduling
- For all schedules that have latency less than the
constraint, find the one that uses the fewest
resources
Latency Constraint lt 4
Latency Constraint lt 4
c
d
e
a
b
f
g
c
d
e
a
b
f
g
-
Cycle1
-
Cycle1
Cycle2
Cycle2
Cycle3
Cycle3
Cycle4
Cycle4
2 ALUs, 1 Mult
3 ALUs, 2 Mult
81Minimum-Resource, Latency-Constrained
- List scheduling (Minimum resource version)
- Basic Idea
- 1) Compute latest start times for each op using
ALAP with specified latency constraint - Latest start times must include multicycle
operations - 2) For each resource type
- 3) Determine candidate nodes
- 4) Compute slack for each candidate
- Slack current cycle - latest possible cycle
- 5) Schedule ops with 0 slack
- Update required number of resources (assume 1 of
each to start with) - 6) Schedule ops that require no extra resources
- 7) Repeat from 2) until all nodes scheduled
82Minimum-Resource, Latency-Constrained
c
d
e
a
b
f
g
j
k
4
-
2
3
Last Possible Cycle
1
6
5
LPC
Node
-
7
Latency Constraint 3 cycles
c
d
e
a
b
f
g
j
k
Cycle1
2
3
1
Cycle2
6
5
4
Cycle3
-
-
7
Defines last possible cycle for each operation
83Minimum-Resource, Latency-Constrained
- 2) For each resource type
- 3) Determine candidate nodes C
- 4) Compute slack for each candidate
- Slack current cycle - latest possible cycle
Cycle
LPC
Slack
Node
Candidates 1,2,3,4
0 0 0 2
Initial Resources 1 Mult, 1 ALU
c
d
e
a
b
f
g
j
k
3
4
-
2
1
6
5
-
7
Cycle 1
84Minimum-Resource, Latency-Constrained
- 5)Schedule ops with 0 slack
- Update required number of resources
- 6) Schedule ops that require no extra resources
Slack
Cycle
LPC
Node
Candidates 1,2,3,4
0 1 0 1 0 1 2
X
Resources 1 Mult, 2 ALU
c
d
e
a
b
f
g
j
k
3
4
-
2
1
6
5
4 requires 1 more ALU - not scheduled
-
7
Cycle 1
85Minimum-Resource, Latency-Constrained
- 2)For each resource type
- 3) Determine candidate nodes C
- 4) Compute slack for each candidate
- Slack current cycle - latest possible cycle
Slack
Cycle
LPC
Node
Candidates 4,5,6
1 1 1 1 0 0
Resources 1 Mult, 2 ALU
c
d
e
a
b
f
g
j
k
3
4
-
2
1
6
5
-
7
Cycle 2
86Minimum-Resource, Latency-Constrained
- 5)Schedule ops with 0 slack
- Update required number of resources
- 6) Schedule ops that require no extra resources
Slack
Cycle
LPC
Node
Candidates 4,5,6
1 1 1
1 2 0 2 0 2
Resources 2 Mult, 2 ALU
c
d
e
a
b
f
g
j
k
3
4
-
2
1
6
5
-
7
Already 1 ALU - 4 can be scheduled
Cycle 2
87Minimum-Resource, Latency-Constrained
- 2)For each resource type
- 3) Determine candidate nodes C
- 4) Compute slack for each candidate
- Slack current cycle - latest possible cycle
Slack
Cycle
LPC
Node
Candidates 7
1 1 1
2 2 2 0
Resources 2 Mult, 2 ALU
c
d
e
a
b
f
g
j
k
3
4
-
2
1
6
5
-
7
Cycle 3
88Minimum-Resource, Latency-Constrained
Required Resources 2 Mult, 2 ALU
Slack
Cycle
LPC
Node
c
d
e
1 1 1
2 2 2 3
a
b
f
g
j
k
Cycle1
2
3
1
4
Cycle2
-
6
5
Cycle3
-
7
89Other extensions
- Chaining
- Multiple operations in a single cycle
- Pipelining
- Input DFG, data delivery rate
- For fully pipelined circuit, must have one
resource per operation (remember systolic arrays)
c
d
a
b
e
f
Multiple adds may be faster than 1 divide -
perform adds in one cycle
/
-
90Summary
- Scheduling assigns each operation in a DFG a
start time - Done for each DFG in the CDFG
- Different Types
- Minimum Latency
- ASAP, ALAP
- Latency-constrained
- ASAP, ALAP
- Minimum-latency, resource-constrained
- Hus Algorithm
- List Scheduling
- Minimum-resource, latency-constrained
- List Scheduling
91High-level Synthesis Binding/Resource Sharing
92Binding
- During scheduling, we determined
- When ops will execute
- How many resources are needed
- We still need to decide which ops execute on
which resources - gt Binding
- If multiple ops use the same resource
- gtResource Sharing
93Binding
- Basic Idea - Map operations onto resources such
that operations in same cycle dont use same
resource
2 ALUs (/-), 2 Multipliers
Cycle1
2
3
1
4
-
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALU2
Mult2
Mult1
ALU1
94Binding
- Many possibilities
- Bad binding may increase resources, require huge
steering logic, reduce clock, etc.
2 ALUs (/-), 2 Multipliers
Cycle1
2
3
1
4
-
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALU2
Mult2
Mult1
ALU1
95Binding
- Cant do this
- 1 resource cant perform multiple ops
simultaneously!
2 ALUs (/-), 2 Multipliers
Cycle1
2
3
1
4
-
Cycle2
6
5
Cycle3
7
-
Cycle4
8
96Binding
- How to automate?
- More graph theory
- Compatibility Graph
- Each node is an operation
- Edges represent compatible operations
- Compatible - if two ops can share a resource
- I.e. Ops that use same type of resource (ALU,
etc.) and are scheduled to different cycles
97Compatibility Graph
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALUs
Mults
1
6
2
8
5 and 6 not compatible (same cycle)
5
7
4
2 and 3 not compatible (same cycle)
3
98Compatibility Graph
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALUs
Mults
1
6
2
8
5
Note - Fully connected subgraphs can share a
resource (all involved nodes are compatible)
7
4
3
99Compatibility Graph
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALUs
Mults
1
6
2
8
5
Note - Fully connected subgraphs can share a
resource (all involved nodes are compatible)
7
4
3
100Compatibility Graph
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALUs
Mults
1
6
2
8
5
Note - Fully connected subgraphs can share a
resource (all involved nodes are compatible)
7
4
3
101Compatibility Graph
- Binding Find minimum number of fully connected
subgraphs that cover entire graph - Well-known problem Clique partitioning
(NP-complete) - Cliques 2,8,7,4,3,1,5,6
- ALU1 executes 2,8,7,4
- ALU2 executes 3
- MULT1 executes 1,5
- MULT2 executes 6
1
6
2
8
5
7
4
3
102Compatibility Graph
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALUs
Mults
1
6
2
8
5
7
4
3
103Compatibility Graph
- Alternative Final Binding
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
ALUs
Mults
1
6
2
8
5
7
4
3
104Translation to Datapath
a
b
c
d
e
f
g
h
i
Cycle1
2
3
1
-
4
Cycle2
6
5
Cycle3
7
-
Cycle4
8
- Add resources and registers
- Add mux for each input
- Add input to left mux for each left input in DFG
- Do same for right mux
- If only 1 input, remove mux
a
f
e
g
b
d
e
i
h
c
Mux
Mux
Mux
Mux
Mult(1,5)
Mult(6)
ALU(2,7,8,4)
ALU(3)
Reg
Reg
Reg
Reg
105Left Edge Algorithm
- Alternative to clique partitioning
- Take scheduled DFG, rotate it 90 degrees
2 ALUs (/-), 2 Multipliers
106Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
107Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
108Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
109Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
110Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
111Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
112Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
113Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2
5
1
Cycle7
Cycle6
Cycle4
Cycle1
Cycle5
Cycle3
Cycle2
114Left Edge Algorithm
2 ALUs (/-), 2 Multipliers
4
- Initialize right_edge to 0
- Find a node N whose left edge is gt right_edge
- Bind N to a particular resource
- Update right_edge to the right edge of N
- Repeat from 2) for nodes using the same resource
type until right_edge passes all nodes - Repeat from 1) until all nodes bound
right_edge
6
3
-
8
7
2