Title: DAOmap: A Depthoptimal Area Optimization Mapping Algorithm for FPGA Designs
1DAOmap A Depth-optimal Area Optimization Mapping
Algorithm for FPGA Designs
- Deming Chen, Jason Cong
- Computer Science Department
- University of California, Los Angeles
This work is partially supported by the
California MICRO program and the NSF Grant
CCR-0306682
2Outline
- Introduction
- Related Works
- Definitions and Problem Formulation
- Algorithm Description
- Cut Enumeration
- Delay and Area Propagation
- Cost Function for a Cut
- Global and Local Cost Adjustments
- Cut Selection
- Experimental Results
- Conclusions and Future Work
3Introduction
- Field Programmable Gate Array (FPGA) has become
increasingly popular - Fast to market
- No or very low NRE (non-recurring expenses)
- The LUT-based FPGA architecture dominates the
existing programmable chip industry - FPGA technology mapping converts a given Boolean
circuit into a functionally equivalent network
comprised only of LUTs - FPGA technology mapping is a crucial optimization
step in the FPGA design flow
4Related Works on FPGA Mapping
- Area Minimization
- Chortle-crf, Francis, et al, DAC91
- MIS-pga, Murgai, et al, ICCAD91
- Praetor, Cong, et al, FPGA99
- Anti-fuse FPGA Mapper, Kang, et al, ASPDAC04
- Delay Minimization
- DAG-Map, Chen, et al, DTC92
- FlowMap, Cong, et al, ICCAD92
- Edge-map, Yang, et al, ICCAD94
- Power Minimization
- PowerMinMap, Li, et al, ASPDAC03
- Emap, Lamoureux, et al, ICCAD03
- DVmap, Chen, et al, FPGA04
- Simultaneous Delay and Area Minimization
- FlowMap-r, Cong, et al, TVLSI94
- CutMap, Cong, et al, FPGA95
- BoolMap-D, Legl, et al, DAC96
5Definitions
- DAG Boolean network
- Cone Cv sub-network rooted on node v
- K-feasible cone input(Cv) ? K
- Fanin Cone Fv the largest Cv
- K-feasible cut a K-feasible Cv
- Unit delay model
- One LUT contributes one unit delay
- No edge delay
PIs
a
c
b
d
e
v
6Problem Formulation
- Delay-optimal Area Optimization problem
- Given a Boolean network an integer K
- Goal cover the network with K-feasible cones
(K-LUTs), such that - Optimal mapping depth
- Area (number of LUTs) is minimized
- NP-hard problem on area minimization
7Highlights of Our Algorithm
- Consider potential node duplications and make
mapping-area estimation close to reality - Search solution space considering both global and
local optimality information - Carry out an iterative cut selection procedure on
top of cost adjustment to further improve
solution quality - Techniques used are simple and intuitive
- The key is the right combination of them
8Cut Enumeration
z
w
y
x
c
a
b
d
Combine sub-cuts on the inputs of the
gate Process each gate in topological order from
PIs to POs
9Complexity Analysis
- The number of cuts on a node for the worst case
is O(nK) - Practically, it is a small constant for small K
Both Max. and Ave. numbers are obtained averaging
over 20 largest MCNC benchmarks
10Delay and Area Propagation
z
w
y
x
b
Optimal Delay 1 Area 1
Optimal Delay 1 Area 1
a
c
Optimal Delay 1 Area 1
d
e
g
f
Optimal Delay 2 Area 2
Propagation process visits cuts and nodes
iteratively The longest best delay on the POs is
the optimal mapping delay
11Area Estimation
Ap
- AC ? Ai / f(i) UC
- i input(C)
- Ai estimated area of the fanin cone on signal i
- f(i) fanout number of i
- Uc area of the cut itself
- Try to estimate area considering fanout effect
- Praetor, Cong, et al, FPGA99
- Can under-estimate the area because of node
duplications
p
n
m
o
f(p) 2
q
r
Cut C
s
X
u
t
Cut Ct
Cut Cu
12Cost (Area) Function of a Cut
- Some Key parameters
- IC cutsize of C
- NC number of nodes covered by C
- f(v) fanout number of the root node v
- Pf duplication cost
a
C1
c
b
C2
d
e
v
13Duplication Cost Adjustment
- Consider potential node duplications
- Check the sub-cuts for multiple fanouts
- Propagate adjusted cost globally
- Duplication Cost
- NCf number of nodes the subcut Cf contains
- IC cutsize of C
p
n
m
o
q
r
Subcut Cf2 NCf2 1
Subcut Cf1
s
New cut C IC 4
Multiple fanouts
14Cut Selection Mapping Generation
- From POs to PIs
- Critical paths optimal delay best area
available - Non-critical paths relaxed delay better area
z
w
y
x
b
a
c
d
e
LUT roots in list L L f, g L g, e, d
L e, d L b
g
f
15Techniques for Better Cut Selection
- Cut selection is equivalent to the min-cover
problem - Greedy approach will not work well
- Use heuristics to guide the selection procedure
- Iterative Cut Selection Procedure
- Local Cost Adjustment
- Input Sharing
- Slack Distribution
- Cut Probing
16Iterative Cut Selection (ICS)
- Some valuable information on area is unknown
until after mapping - mapped LUT root nodes
- duplicated nodes
- ICS carries out multiple mapping iterations
17Local Cost Adjustment Input Sharing
- Takes advantage of existing resources
- Considers roots from previous iterations
- The more a cut shares inputs with others, the
better for the cut
d
e
g
f
18Local Cost Adjustment Slack Distribution
- SlackC Reqv 1 MAX (Arri)
- i ? input(C)
- If SlackC lt 0, C is not a timing_feasible cut
- The larger the SlackC, the better for C in terms
of slack distribution effect
z
w
y
x
b
Largest arrival time among inputs
a
c
C
d
Reqd Required time of the root
19Local Cost Adjustment Cut Probing
- Probe the amount of area gain locally before
making decisions about a cut - Reduce connections between LUTs
- Reduce potential node duplications based on
previous duplication profiling - Reconvergent paths handling
Use Cfinal to guide cut selection
20Experimental Results Settings
- DAOmap is implemented using C language within the
UCLA RASP system - We compare LUT counts and runtime to CutMap
Cong, FPGA95, a state-of-the-art delay-optimal
area minimization algorithm - Run on a 750 MHz SunBlade 1000 Solaris machine
- Use the largest 20 MCNC benchmarks and a set of
industrial benchmarks - Test on LUT input numbers from 4 to 6
21Experimental Results of DAOmap over CutMap on
MCNC Benchmarks
After mapping
After mapping packing (cutmap x mpack)
22Detailed Experimental Results on Industrial
Benchmarks
After mapping into 5-LUTs
23Individual Technique Analysis
24Mapping Iteration Analysis
2.5
2.0
1.5
Improvement
1.0
0.5
0.0
1
2
3
4
5
6
Mapping Iterations
- For single iteration only (the base case), use
manual profiling Chen, FPGA04 - When the iteration number is more than 3, it is
no longer helpful
25Conclusions and Future Work
- We presented a new mapping algorithm, DAOmap, to
minimize FPGA delay and area - We built novel cost-adjustment heuristics and
used an iterative mapping procedure - DAOmap gained significant amount of area and
runtime reduction over a state-of-the-art
algorithm CutMap - Future works include adding cut-pruning
techniques for mapping with larger K values