Title: ClockFrequency Assignment for Multiple Clock Domain SystemsonaChip
1Clock-Frequency Assignment for Multiple Clock
Domain Systems-on-a-Chip
- Scott Sirowy, Yonghui Yu, Stefano Lonardi, Frank
Vahid - Department of Computer Science and Engineering
- University of California, Riverside
- ssirowy, yonghui, stelo, vahid_at_cs.ucr.edu
- Also with the Center for Embedded Computer
Systems at UC Irvine - This work was supported in part by the National
Science Foundation and the Semiconductor Research
Corporation
2IntroductionHW/SW Partitioning
- Speedups of 2X to 10X common
- Balboni, Fornaciari, Sciuto CODES96 Eles, Peng,
Kuchchinski, Doboli DAES97 Gajski, Vahid,
Narayan, many others - Speedups of 1000X possible
- E.g., Cameron project, FCCM02
SW ______ ______ ______ ______ ______
Accelerator A
Accelerator B
Accelerator C
3IntroductionMultiple Clock Domains
ASIC/FPGA
4IntroductionBut ASIC/FPGAs have limited clock
resources
- Accelerators may have to share clock
- Some may run slower than their max frequency
500 MHz
233 MHz
178 MHz
145 MHz
ASIC/FPGA
- Clock Frequency Module
5Clock Frequency AssignmentProblem Definition
Accelerator A
Accelerator B
Accelerator C
Given a set of Accelerators
1000 MHz
500 MHz
200 MHz
Each with its own maximum frequency
5
10
2
And total clock cycles for each
Also given of available clock frequencies 2
For every accelerator, find frequency that is
accelerator's max freq, with number of distinct
frequency values available freq, such that
execution time E is minimized ? Clock Frequency
Assignment problem
6Heuristics vs. Optimal Solution
- Could develop heuristic, but
- Clock Partitioning likely is subpart of larger
(iterative) exploration - Suboptimal solutions to sub-parts could
accumulate - Is there a fast optimal solution?
- We developed a fast dynamic programming approach
Accelerator A
Accelerator B
Accelerator C
Clock Frequency Assignment
Optimal Mapping
7Dynamic Programming Solution
Clocks
Accelerators
8Dynamic Programming Solution
Clocks
Accelerators
X(1,1) 5/1000 .005
.005
X(2,1) (510)/500 .030
.030
X(3,1) (5102)/200 .085
.085
9Dynamic Programming Solution
Clocks
Accelerators
X(1,1) 5/1000 .005
.005
X(2,1) (510)/500 .030
.030
X(3,1) (5102)/200 .085
.085
10Dynamic Programming Solution
Clocks
Algorithm Time Complexity O(nF2)
Accelerators
.005
.005
X(1,2) 5/1000 .005
X(2,2) 5/1000 10/500 .025
.025
.030
X(3,2) Min of X(2,1) 2/200 .040 X(1,1)
(102)/200 .065
.040
.085
E .040 seconds
11Dynamic Programming Solution
Clocks
Accelerators
.005
.005
.025
.030
.040
.085
E .040 seconds
12Example H.264 Decoder
- Function SW Time(s) hw cycles
hw max clk freq (MHz) - ltMotionComp_00gt 0.040733 1 281
- ltInvTransform4x4gt 0.034787 8 194
- ltFindHorizontalBSgt 0.025026 1 140
- ltGetBitsgt 0.024681 1 200
- ltFindVerticalBSgt 0.02366 1 140
- ltMotionCompChromaFullXFullYgt 0.023577 1 285
- ltFilterHorizontalLumagt 0.023559 4 134
- ltFilterVerticalLumagt 0.020008 4 138
- ltFilterHorizontalChromagt 0.018803 4 134
- ltCombineZerosInvQuantScangt 0.018438 1 120
- ltMotionCompensategt 0.016822 10 40
- ltFilterVerticalChromagt 0.016035 4 138
- ltMotionChromaFracXFracYgt 0.016023 32 78
- ltReadLeadingZerosAndOnegt 0.015665 1 106
13H.264 Results
Over 2.7x speedup over a design with only one
clock frequency
Only a few distinct frequencies give most speedup
14Results- Synthetic Benchmarks
lt1 sec even for largest examples
Over 3.5x Speedup
Algorithm runtimes
Application runtimes
15Conclusions
- Multiple clock domains can yield significant
speedups during partitioning - 1.5x to 3x speedups using just three clock
frequencies - On top of already gained from hw/sw partitioning
- Efficient optimal algorithm is possible
- Dynamic programming approach
- Likely applies to other clock-frequency
assignment problems