Title: Increasing Hardware Efficiency with Multifunction Loop Accelerators
1Increasing Hardware Efficiency with Multifunction
Loop Accelerators
- Kevin Fan, Manjunath Kudlur,
- Hyunchul Park, Scott Mahlke
- Advanced Computer Architecture Laboratory
- University of Michigan
- October 25, 2006
2Introduction
- Emerging applications have high performance,
cost, energy demands - H.264, wireless, software radio, signal
processing - 10-100 Gops required
- 200 mW power budget
- Applications dominated by tight loops processing
large amounts of streaming data
CPU
Accelerators
3Loop Accelerators
- Order-of-magnitude performance and efficiency
wins - Viterbi 100x speedup vs. ARM9
4Prescribed Throughput Accelerators
- Traditional behavioral synthesis
- Directly translate C operatorsinto gates
Operation graph
Datapath
5Outline
- Loop accelerator schema and design flow
- Cost sensitive scheduling
- Designing multifunction accelerators
- Naïve
- Joint scheduling
- Datapath union
- Synthesis results
6Loop Accelerator Template
- Hardware realization of modulo scheduled loop
- Parameterized execution resources, storage,
connectivity
7Loop Accelerator Design Flow
FU Alloc
FU
FU
.c
RF
C Code, Performance (Throughput)
Abstract Arch
8Datapath Derived from Schedule
- Schedule to abstract architecture (FUs)
- Determine register and interconnect requirements
from schedule
r1 Memr2 r3 r1 12
Source Code
9Cost Sensitive Scheduling
- Traditional scheduling is hardware unaware
- Intelligent scheduling needed to reduce hardware
cost
FU1
FU2
FU3
0
1
2
FU1
FU2
FU3
1
time
LD1
1
2
2
LD2
LD1
LD2
- 27 cost reduction with same performance MICRO
05
10Multifunction Accelerator
- Map multiple loops to single accelerator
- Improve hardware efficiency via reuse
- Opportunities for sharing
- Disjoint stages(loops 2, 3)
- Pipeline slack(loops 4, 5)
Loop 1
Frame Type?
Loop 2
Loop 3
Loop 4
Block 5
Application
11Design Strategies
- Naïve method Design single function
accelerators, place side by side - Misses potential hardware sharing of FUs,
storage, interconnect
Cost SensitiveModulo Scheduler
FU
FU
Loop 1
FU
FU
FU
FU
Cost SensitiveModulo Scheduler
FU
FU
Loop 2
Multifunction datapath
12Joint Scheduling
Loop 1
JointCost SensitiveModulo Scheduler
Loop 2
- Loops are independent possible schedules
exponential in of loops! - Infeasible for modest problems
13Multifunction Gate Costs
A
B
C
D
E
F
G
H
I
J
- 43 average savings over sum of accelerators
14Datapath Union
Cost SensitiveModulo Scheduler
FU
FU
Loop 1
Cost SensitiveModulo Scheduler
Loop 2
FU
FU
15Datapath Union
- Combine similar components? better hardware
sharing? lower cost - Trade off FU and register cost
- Combining dissimilar FUs can enable register cost
savings - ILP formulation minimizes FU and register cost
-
M
M
Accel 1
Accel 2
Multi- function accel
16Multifunction Gate Costs
A
B
C
D
E
F
G
H
I
J
- Smart union within 3 of joint scheduling solution
17Conclusion
- Multifunction accelerators highly effective in
exploiting coarse grained hardware sharing - Joint scheduling achieves 43 average cost
savings, but is impractical - Smart union of independent accelerators achieves
40 average savings - Compile times of 5 minutes 1 hour
18Questions?