DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs

Description:

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs Deming Chen and Jason Cong Computer Science Department University of California, Los Angeles – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 26

Provided by: Hidet

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs

1
DAOmap A Depth-optimal Area Optimization Mapping
Algorithm for FPGA Designs

Deming Chen and Jason Cong
Computer Science Department
University of California, Los Angeles

This work is partially supported by the
California MICRO program and the NSF Grant
CCR-0306682
2
Outline

Introduction
Related Works
Definitions and Problem Formulation
Algorithm Description
Cut Enumeration
Delay and Area Propagation
Cost Function for a Cut
Global and Local Cost Adjustments
Iterative Cut Selection
Experimental Results
Conclusions and Future Work

3
Introduction

Field Programmable Gate Array (FPGA) has become
increasingly popular
Fast to market
No or very low NRE (non-recurring expenses)
The LUT-based FPGA architecture dominates the
existing programmable chip industry
FPGA technology mapping converts a given Boolean
circuit into a functionally equivalent network
comprised only of LUTs
FPGA technology mapping is a crucial optimization
step in the FPGA design flow

4
Related Works on FPGA Mapping

Area Minimization
Chortle-crf, Francis, et al, DAC91
MIS-pga, Murgai, et al, ICCAD91
Praetor, Cong, et al, FPGA99
Anti-fuse FPGA Mapper, Kang, et al, ASPDAC04
Delay Minimization
DAG-Map, Chen, et al, DTC92
FlowMap, Cong, et al, ICCAD92
Edge-map, Yang, et al, ICCAD94
Power Minimization
PowerMinMap, Li, et al, ASPDAC03
Emap, Lamoureux, et al, ICCAD03
DVmap, Chen, et al, FPGA04
Simultaneous Delay and Area Minimization
FlowMap-r, Cong, et al, TVLSI94
CutMap, Cong, et al, FPGA95
BoolMap-D, Legl, et al, DAC96

5
Definitions

DAG a Boolean network
Cone Cv a sub-network rooted on a node v
K-feasible cone
input(Cv) ? K
Fanin Cone Fv the largest Cv
K-feasible cut
A K-feasible Cv
Occupies a K-LUT
Unit delay model
One LUT contributes one unit delay
No edge delay

PIs
a
c
b
d
e
v
6
Problem Formulation

Delay-optimal Area Optimization problem
Given a Boolean network an integer K
Goal cover the network with K-feasible cones
(K-LUTs), such that
Optimal mapping depth
Area (number of LUTs) is minimized
NP-hard problem on area minimization

7
Highlights of Our Algorithm

Consider potential node duplications and make
mapping-area estimation close to reality
Search solution space considering both global and
local optimality information
Carry out an iterative cut selection procedure on
top of cost adjustment to further improve
solution quality
Each technique used is simple and intuitive
The key is the right combination of them

8
Cut Enumeration
z
w
y
x
c
a
b
d
Combine sub-cuts on the inputs of the
gate Process each gate in topological order from
PIs to POs
9
Complexity Analysis

Number of cuts on a node for the worst case is
O(nK)
Practically, it is a small constant for small K

Average over 20 largest MCNC benchmarks
10
Delay and Area Propagation
z
w
y
x
b
Delay 1 Area 1
Delay 1 Area 1
a
c
Delay 1 Area 1
d
e
g
f
Delay 2 Area 2
Propagation process visits cuts and nodes
iteratively The longest best delay on the POs is
the optimal mapping delay
11
Area Estimation
Ap

AC ? Ai / f(i) UC
i input(C)
Ai estimated area of the fanin cone on signal i
f(i) fanout number of i
Uc area of the cut itself
Try to estimate area considering fanout effect
Praetor, Cong, et al, FPGA99
Can under-estimate the area because of node
duplications

p
n
m
o
f(p) 2
q
r
Cut C
s
u
t
Cut Ct
Cut Cu
12
Cost (Area) Function of a Cut

Some Key parameters
IC cutsize of C
NC number of nodes covered by C
f(v) fanout number of the root node v
Pf duplication cost

a
C1
c
b
C2
d
e
v
13
Duplication Cost Adjustment

Consider potential node duplications
Check the sub-cuts for multiple fanouts
Propagate adjusted cost globally

Duplication Cost
NCf number of nodes the subcut Cf contains
IC cutsize of C

p
n
m
o
q
r
Subcut Cf2 NCf2 1
Subcut Cf1
s
New cut C IC 4
Multiple fanouts
14
Cut Selection Mapping Generation

From POs to PIs
Critical paths optimal delay best area
available
Non-critical paths relaxed delay better area

z
w
y
x
b
a
c
d
e
g
f
15
Techniques for Better Cut Selection

Cut selection equivalent to min-cover problem
Greedy approach will not work well
Use heuristics to guide the selection
Iterative Cut Selection Procedure
Local Cost Adjustment
Input Sharing
Slack Distribution
Cut Probing

16
Iterative Cut Selection (ICS)

Some valuable information on area is unknown
until after mapping
mapped LUT root nodes
duplicated nodes
ICS carries out multiple mapping iterations

17
Local Cost Adjustment Input Sharing

Takes advantage of existing resources
Considers roots from previous iterations
The more a cut shares inputs with others, the
better for the cut

d
e
g
f
18
Local Cost Adjustment Slack Distribution

SlackC Reqv 1 MAX (Arri)
i ? input(C)
If SlackC lt 0, C is not a timing_feasible cut
The larger the SlackC, the better for C in terms
of slack distribution effect

z
w
y
x
b
Largest arrival time among inputs
a
c
C
d
Reqd Required time of the root
19
Local Cost Adjustment Cut Probing

Probe the amount of area gain locally before
making decisions about a cut
Reduce connections between LUTs
Reduce potential node duplications based on
previous duplication profiling
Reconvergent paths handling

Use Cfinal to guide cut selection
20
Experimental Results Settings

DAOmap is implemented using C language within the
UCLA RASP system
Compare LUT counts and runtime to CutMap Cong et
al, FPGA95
Use a 750 MHz SunBlade-1000 Solaris machine
Test on LUT input numbers from 4 to 6
Benchmarks
20 largest MCNC benchmarks
A set of large industrial benchmarks

21
Experimental Results of DAOmap over CutMap on
MCNC Benchmarks
After mapping
Average Area Reduction Average Run Time Improvement
4-LUT -13.98 13.2X
5-LUT -16.02 24.2X
6-LUT -12.44 4.7X
After mapping packing (daomap mpack) vs.
(cutmap x mpack)
Average Area Reduction Average Run Time Improvement
4-LUT -7.50 57.7X
5-LUT -11.31 38.7X
6-LUT -7.90 10.1X
22
Detailed Experimental Results on Industrial
Benchmarks
CutMap CutMap DAOmap DAOmap Comparison Comparison
Bench marks LUT No. Run Time (s) LUT No. Run Time (s) LUT (Reduce) Run Time (Improve)
big1 9928 301 9169 93 -7.6 3.2
big2 - gt10H. 14625 708 - -
big3 10005 28926 9031 106 -9.7 272.9
big4 11800 583 9364 156 -20.6 3.7
big5 - gt10H. 32230 3377 - -
big6 39000 14437 32028 402 -17.9 35.9
Ave. -13.98 78.9X
After mapping into 5-LUTs
23
Individual Technique Analysis
Techniques dropped
Cut Enumeration
Min-cost propagation 4.35
Global cost adjustment 2.68
Cut Selection
Input sharing 4.55
Iterative cut selection (ICS) 2.04
Others lt1
24
Mapping Iteration Analysis
2.5
2.0
1.5
Improvement
1.0
0.5
0.0
1
2
3
4
5
6
Mapping Iterations

For single iteration only (the base case), use
manual profiling Chen et al, FPGA04
When the iteration number is more than 3, it is
no longer helpful

25
Conclusions and Future Work

We presented a new mapping algorithm, DAOmap, to
minimize FPGA delay and area
We built several cost-adjustment heuristics and
used an iterative mapping procedure
DAOmap gained significant amount of area and
runtime reduction over a state-of-the-art
algorithm CutMap
Future works include adding cut-pruning
techniques for mapping with larger K values

Write a Comment

User Comments (0)