Title: Integrated Coupling and Clock Frequency Assignment of Accelerators during HardwareSoftware Partition
1Integrated Coupling and Clock Frequency
Assignment of Accelerators during
Hardware/Software Partitioning
- Scott Sirowy and Frank Vahid
- Department of Computer Science and Engineering
- University of California, Riverside
- ssirowy,vahid_at_cs.ucr.edu
- Also with the Center for Embedded Computer
Systems at UC Irvine - This work was supported in part by the National
Science Foundation and the Semiconductor Research
Corporation
2IntroductionHW/SW Partitioning
- Speedups of 2X to 10X common
- Balboni, Fornaciari, Sciuto CODES96 Eles, Peng,
Kuchchinski, Doboli DAES97 Gajski, Vahid,
Narayan many others - Speedups of 1000X possible
- E.g., Cameron project, FCCM02
SW ______ ______ ______ ______ ______
Accelerator A
Accelerator B
Accelerator C
3IntroductionMultiple Clock Domains
ASIC/FPGA
4IntroductionTwo-Level Architecture
Accelerator A
Memory
DMA
Accelerator B
Accelerator C
Require single cycle access to memory
500 MHz
200 MHz
1000 MHz
Requires single clock, meaning every accelerator
must run at their slowest frequency
5IntroductionTwo-Level Architecture
Accelerator A
Memory
DMA
Accelerator B
Accelerator C
System Bus
Accelerator B
Accelerator A
Clock 1
200 MHz
Peripheral Bus
Accelerator D
Accelerator E
Clock 2
Clock 3
Accelerator can now run at its maximum clock
frequency
1000 MHz
Clock 4
6Previous WorkTwo-Level Accelerator Partitioning
Accelerator A
Single cycle access to memory, but every
accelerator runs at same frequency
Accelerator B
Memory
DMA
Accelerator C
System Bus
Tightly Coupled
Bridge
1 clock
High memory access penalty, but each accelerator
can run at its own maximum frequency
Peripheral Bus
Clock 2
Loosely Coupled
Clock 3
Clock n
7Two-Level Accelerator PartitioningProblem
Definition
Accelerator A
Accelerator B
Accelerator C
Given a set of Accelerators
Each with its own maximum frequency
1000 MHz
500 MHz
200 MHz
Computation cycles for each
52
10
21
Memory access cycles for each
10
5
21
And area (in LUTs) for each
85
95
25
Loosely Coupled
Clock 1
Tightly Coupled
Clock
Clock 2
Clock n
Bridge
Loosely coupled memory access penalty 4
Find a mapping of the set of accelerators to the
tightly and loosely coupled sets so the
application execution time is minimized
8N-Knapsacks Dynamic ProgrammingNKDP
Tightly Coupled
Accelerator A
Sort from fastest to slowest
500 MHz
Accelerator A
500 MHz
Loosely Coupled
Accelerator B
Accelerator B
200 MHz
Accelerator C
1000 MHz
Pick any accelerator and assume it will be
tightly coupled
Accelerator N
1200 MHz
All slower accelerators have to be mapped to the
loosely coupled set
9N-Knapsacks Dynamic ProgrammingNKDP
Tightly Coupled
Accelerator A
Sort from fastest to slowest
500 MHz
Accelerator N
Accelerator N
Loosely Coupled
Accelerator B
Accelerator C
Accelerator C
Accelerator A
0 1 2 3 1000, 1001, , Area Constraint
Accelerator B
10IntroductionBut ASIC/FPGAs have limited clock
resources
- Accelerators may have to share clock
- Some may run slower than their max frequency
500 MHz
233 MHz
178 MHz
145 MHz
ASIC/FPGA
- Clock Frequency Module
11Clock Frequency AssignmentProblem Definition
Accelerator A
Accelerator B
Accelerator C
Given a set of Accelerators
1000 MHz
500 MHz
200 MHz
Each with its own maximum frequency
5
10
2
And total clock cycles for each
Also given of available clock frequencies 2
For every accelerator, find frequency that is
accelerator's max freq, with number of distinct
frequency values available freq, such that
execution time E is minimized ? Clock Frequency
Assignment problem
12Dynamic Programming Solution
.005
.005
Optimal Execution Time
.040 s
.030
.025
.040
.085
13Coupling and Clock Assignment Integration
Heuristics
- How do we combine these two optimal solutions?
- Sequential Search
- Straightforward
- Returns suboptimal solutions
- Need a better heuristic
- No Penalty Migration
- Nested Dynamic Programming
14No Penalty Migration
Only two clocks available for entire design
aa
ab
ac
an
Available Clocks 2
500 MHz
200 MHz
1000 MHz
1200 MHz
Tightly Coupled
Loosely Coupled
Clk_1000 MHz
Clk_200 MHz
Accelerator C
Accelerator N
Accelerator A
Accelerator B
Clk_500 MHz
- Run NKDP to determine optimal two-level coupling
15No Penalty Migration
Only two clocks available for entire design
aa
ab
ac
an
Available Clocks 2
500 MHz
200 MHz
1000 MHz
1200 MHz
Tightly Coupled
Loosely Coupled
Accelerator C
Accelerator N
Accelerator A
Accelerator B
Clk_500 MHz
Clk_200 MHz
- Run NKDP to determine optimal two-level coupling
- Determine if any loosely coupled accelerators
should migrate back to tightly coupled set - Rerun clock partitioning on loosely coupled set
with modified accelerator set
16Nested Dynamic Programming
Tightly Coupled
Clk_500MHz
Available Clocks 2
Accelerator A
Sort from fastest to slowest
.005
.005
Accelerator N
Accelerator N
1200 MHz
Loosely Coupled
.030
.025
Accelerator B
Accelerator C
Accelerator C
1000 MHz
Accelerator A
500 MHz
Accelerator B
200 MHz
17Nested Dynamic Programming
Clock Frequency Assignment
Tightly Coupled
Available Clocks 2
Sort from fastest to slowest
Accelerator N
Loosely Coupled
Accelerator C
Accelerator A
Accelerator B
N Knapsacks Dynamic Programming Iteration
18Results H264 Decoder
Heuristics find the same solutions for a larger
available frequencies
Over 3x speedup from a single frequency, single
coupling implementation
19Results Synthetic Benchmarks
Nested Dynamic Programming consistently finds a
better partitioning than other heuristics
On average, No Penalty Migration yielded 15
better solutions than Sequential Search, and
Nested Dynamic Programming yielded a 15 better
solution than No Penalty Migration
2 available clock frequencies
20Results Synthetic Benchmarks
Nested Dynamic Programming consistently finds a
better partitioning than other heuristics
Average 5x speedup
2 available clock frequencies
8 available clock frequencies
21Conclusions
- Consideration of both coupling and multiple clock
domains can lead to substantial speedup over
implementations that dont consider either - 5x speedup
- On top of already gained hw/sw speedups
- Developed heuristics that integrate coupling and
clock frequency assignment - Heuristics run in seconds