Title: HWSW CoDesign of Embedded Reconfigurable Architectures
1HW/SW Co-Design of Embedded Reconfigurable
Architectures
- Yanbing Li, Tim Callahan, Ervan
Darnell,Randolph Harr, Uday Kurkure, Jon
Stockwood - Synopsys Inc.
- EECS, Univ. of California, Berkeley
- Silicon Spice
2Outline
- Nimble compiler overview
- HW/SW partitioning algorithm
- Results
- Conclusions
3Nimble Project
Retargetable compiler for Agileware
Configurable systems using intelligent tools
- Agileware
- CPU Reconfigurable Datapath
- Memory bus interface
- Quick configuration
- Example GARP, ACE
Embedded CPU
SWHW SRAM / CACHE
Reconfigurable Datapath
4Nimble Project
Retargetable compiler for Agileware
Configurable systems using intelligent tools
- Off-shelf ANSI C to Agileware
- Automatic, quick HW/SW partition and compilation
- Embedded DSP target market
- To show Agileware provides better compilation
model and better performance than existing DSP
/ VLIW architectures
- Agileware
- CPU Reconfigurable Datapath
- Memory bus interface
- Quick configuration
- Example GARP, ACE
Embedded CPU
SWHW SRAM / CACHE
Reconfigurable Datapath
5Nimble Compiler Overview
C code
Agileware Description Language
Front-end Compiler profiling, optimizations,
auto HW/SW partition
HW Kernels as DFG
6Nimble Compiler Overview
C code
Agileware Description Language
Front-end Compiler profiling, optimizations,
auto HW/SW partition
HW Kernels as DFG
ADAPT Datapath Synthesis scheduling, mapping,
floorplanning
F L A M E
Generator Libraries
Auto P R Vendor Tools
Config bit stream
7Nimble Compiler Overview
C code
Agileware Description Language
Front-end Compiler profiling, optimizations,
auto HW/SW partition
HW Kernels as DFG
ADAPT Datapath Synthesis scheduling, mapping,
floorplanning
F L A M E
Generator Libraries
Auto P R Vendor Tools
C code to run on CPU
Config bit stream
Embedded GCC
Mixed HW/SW executable
8HW/SW Partitioning the Nimble Approach
- Spatial vs. temporal partition
k1
k2
k1
Spatial partition
Temporal partition
k2
Reconfigurable HW
9HW/SW Partitioning the Nimble Approach
- Spatial vs. temporal partition
k1
k2
k1
Spatial partition
Temporal partition
k2
Reconfigurable HW
- Loops as HW candidates
- Focus on small number of dominating loops
- MPEG2 encoder total 180 loops,top20 contributes
gt90 time - Instruction-level parallelism
- Compiler transformations generates multiple
versions - Loop transformations, pipelining etc
10Related Work
- HW/SW partitioning spatial partitioning
- Single CPU ASIC
- COSYMA97, Kalavade et al.94, Wolf94
- Heterogeneous architectures
- SOS92, Jha et al. 97, Li Wolf98
- HW/SW partitioning and compilation for
reconfigurable architectures - Callahan98, Kaul et al. 99, Luk et al.
11Problem Specification
- Partition application onto CPU and configurable
DP - Under area constraint
- Goal maximize overall application performance
- Assumptions
- One HW kernel (loop) per configuration
- HW/SW serial execution
12Problem Specification
- Partition application onto CPU and configurable
DP - Under area constraint
- Goal maximize overall application performance
- Assumptions
- One HW kernel (loop) per configuration
- HW/SW serial execution
Partitioning result
SW version
HW versions
SW
HW
hw1.3
sw1
hw1.2
hw1.1
sw1
Loop1
sw2
hw2.1
hw2.1
Loop2
sw3
hw3.2
hw3.1
hw3.2
Loop3
13Problem Formulation
HW SW time
HW / SW interface time
Config time
14Problem Formulation
HW SW time
HW / SW interface time
Config time
- Need global optimization
- A loops config time depends on HW/SW
partitionsof other loops
for (i1,100) /loop A/ for ()
/loop B/ for ()
100
A
B
15Overview Our Partitioning Alg
Interesting loop selection (gt1)
SW performance profiling
Loop entry trace profiling
Compiler transformations
Loop entry trace for config times
SW times
Quick HW synthesis
HW times /areas
Preprocessing
HW/SW Partitioning
16Overview Our Partitioning Alg
Interesting loop selection (gt1)
SW performance profiling
Loop entry trace profiling
Compiler transformations
Loop entry trace for config times
SW times
Quick HW synthesis
HW times /areas
Preprocessing
HW/SW Partitioning
Local opt For each loop, intra-loop selection
Global Opt For all loops, inter-loop selection
17Nimble HW/SW Partitioning Process
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
AllLoops(99)
18Nimble HW/SW Partitioning Process
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
sw1
sw2
sw3
AllLoops(99)
Top 90Loops
19Nimble HW/SW Partitioning Process
Transform
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
HW
SW
sw1
hw1.3
sw1
hw1.2
hw1.1
sw2
hw2.1
sw2
sw3
hw3.2
hw3.1
sw3
Heuristics to CreateMultiple Kernel Versions
AllLoops(99)
Top 90Loops
20Nimble HW/SW Partitioning Process
Intra loop selection
Transform
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
HW
SW
HW
SW
sw1
hw1.3
sw1
hw1.2
hw1.1
sw1
sw2
hw2.1
hw2.1
sw2
sw2
sw3
hw3.2
hw3.1
sw3
hw3.2
sw3
Heuristics to CreateMultiple Kernel Versions
Per-loop Selection
AllLoops(99)
Top 90Loops
21Nimble HW/SW Partitioning Process
Intra loop selection
Inter loopselection
Transform
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
HW
SW
HW
SW
sw1
sw1
hw1.3
sw1
hw1.2
hw1.1
sw1
sw2
hw2.1
hw2.1
sw2
hw2.1
sw2
sw3
sw3
hw3.2
hw3.1
sw3
hw3.2
sw3
ApplicationLevelSelection
Heuristics to CreateMultiple Kernel Versions
Per-loop Selection
AllLoops(99)
Top 90Loops
22Intra-Loop Selection
Delay
Design Space for a loop
0
HW area
Area available
- Quick synthesis of multiple versions of each loop
- Select fastest HW loop version within area
constraint
23Intra-Loop Selection
Delay
Design Space for a loop
sw
hw1
0
HW area
Area available
- Quick synthesis of multiple versions of each loop
- Select fastest HW loop version within area
constraint
24Intra-Loop Selection
Delay
Design Space for a loop
sw
hw1
hw3
hw2
hw4
0
HW area
Area available
- Quick synthesis of multiple versions of each loop
- Select fastest HW loop version within area
constraint
25Intra-Loop Selection
Delay
Design Space for a loop
sw
hw1
hw5
hw3
hw2
hw6
hw4
0
HW area
Area available
- Quick synthesis of multiple versions of each loop
- Select fastest HW loop version within area
constraint
26Intra-Loop Selection
Delay
Design Space for a loop
sw
hw6
0
HW area
Area available
- Quick synthesis of multiple versions of each loop
- Select fastest HW loop version within area
constraint
27Inter-Loop Selection
- Select what loops execute in HW, what in SW
- Approach
- Divide loops into small clusters, and performs
optimal selection in each loop cluster. - Loop clustering is based on loop-procedure
hierarchy graph.
28Inter-Loop Selection
Loop-procedure hierarchy graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
Wavelet benchmark shown
29Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
30Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
31Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
32Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
33Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
34Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
35Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
36Optimal Selection in Loop Cluster
- Inter-loop selection inside each loop cluster
- Compute configuration cost of all loops in each
partitioning possibility - Configuration cache considered
- Example
37Optimal Selection in Loop Cluster
- Inter-loop selection inside each loop cluster
- Compute configuration cost of all loops in each
partitioning possibility - Configuration cache considered
- Example
38Results
normalized application exec time
(most optimistic assumptions made)
- Our partitioning alg finds optimal or close-to
optimal results for benchmarks tested
39Algorithm Exec Performance
Alg CPU time (sec)
- KS algorithm fast (lt 2 secs for benchmarks
tested),not bottleneck of Nimble flow
40Conclusions
- Implemented HW/SW partitioning in contextof
Nimble flow - Fully automatic flow
- Our algorithm is efficient and effective in
findingclose to optimal HW/SW partitions. - Global time optimization
- SW time, HW time, config time, HW/SW interface
time - SW time profiling and HW quick synthesis are
essential for evaluate HW/SW tradeoffs. - More compiler transformations and heuristics to
drive transformations.
41Algorithm Optimality
- Overall KS algorithm optimality
- Depends on accuracy of HW and SW time estimation
- Unimportant loops eliminated
- Depends on level of loop clustering
- 1st level guarantees optimality
- Complexity
- Select dominating loops
- Loop entry trace profiling
- Partitioning
O(l) O(P) O(kn)O(n2)O(2C)
l total loops P app exec time k versions
per loop n dominating loops C max cluster
size, constant
42Loop Trace Profiling
- Record entry trace of all HW-feasible loops
- Lossless information for computing configuration
time - Online trace compression
- Reduce storage size for trace
- Analysis based on compressed trace
- Example MPEG2 encoder
- 200MB -gt 2KB
Example Wavelet Original trace EDEDED CBCBCB
EDED CBCB EDED CBCB Compressed trace (ED)
64 (CB) 64 (ED) 32 (CB) 32 (ED) 16 (CB) 16
3
E
D
C
B