Integrated Coupling and Clock Frequency Assignment of Accelerators during HardwareSoftware Partition - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Integrated Coupling and Clock Frequency Assignment of Accelerators during HardwareSoftware Partition

Description:

Department of Computer Science and Engineering. University of California, Riverside ... This work was supported in part by the National Science Foundation and the ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Integrated Coupling and Clock Frequency Assignment of Accelerators during HardwareSoftware Partition


1
Integrated Coupling and Clock Frequency
Assignment of Accelerators during
Hardware/Software Partitioning
  • Scott Sirowy and Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • ssirowy,vahid_at_cs.ucr.edu
  • Also with the Center for Embedded Computer
    Systems at UC Irvine
  • This work was supported in part by the National
    Science Foundation and the Semiconductor Research
    Corporation

2
IntroductionHW/SW Partitioning
  • Speedups of 2X to 10X common
  • Balboni, Fornaciari, Sciuto CODES96 Eles, Peng,
    Kuchchinski, Doboli DAES97 Gajski, Vahid,
    Narayan many others
  • Speedups of 1000X possible
  • E.g., Cameron project, FCCM02

SW ______ ______ ______ ______ ______
Accelerator A
Accelerator B
Accelerator C
3
IntroductionMultiple Clock Domains
ASIC/FPGA
4
IntroductionTwo-Level Architecture
Accelerator A
Memory
DMA
Accelerator B
Accelerator C
Require single cycle access to memory
500 MHz
200 MHz
1000 MHz
Requires single clock, meaning every accelerator
must run at their slowest frequency
5
IntroductionTwo-Level Architecture
Accelerator A
Memory
DMA
Accelerator B
Accelerator C
System Bus
Accelerator B
Accelerator A
Clock 1
200 MHz
Peripheral Bus
Accelerator D
Accelerator E
Clock 2
Clock 3
Accelerator can now run at its maximum clock
frequency
1000 MHz
Clock 4
6
Previous WorkTwo-Level Accelerator Partitioning
Accelerator A
Single cycle access to memory, but every
accelerator runs at same frequency
Accelerator B
Memory
DMA
Accelerator C
System Bus
Tightly Coupled
Bridge
1 clock
High memory access penalty, but each accelerator
can run at its own maximum frequency
Peripheral Bus
Clock 2
Loosely Coupled

Clock 3
Clock n
7
Two-Level Accelerator PartitioningProblem
Definition
Accelerator A
Accelerator B
Accelerator C
Given a set of Accelerators
Each with its own maximum frequency
1000 MHz
500 MHz
200 MHz
Computation cycles for each
52
10
21
Memory access cycles for each
10
5
21
And area (in LUTs) for each
85
95
25
Loosely Coupled
Clock 1
Tightly Coupled
Clock
Clock 2

Clock n
Bridge
Loosely coupled memory access penalty 4
Find a mapping of the set of accelerators to the
tightly and loosely coupled sets so the
application execution time is minimized
8
N-Knapsacks Dynamic ProgrammingNKDP
  • 0-1 Knapsack Variant

Tightly Coupled
Accelerator A
Sort from fastest to slowest
500 MHz
Accelerator A
500 MHz
Loosely Coupled
Accelerator B
Accelerator B
200 MHz
Accelerator C
1000 MHz

Pick any accelerator and assume it will be
tightly coupled
Accelerator N
1200 MHz
All slower accelerators have to be mapped to the
loosely coupled set
9
N-Knapsacks Dynamic ProgrammingNKDP
  • 0-1 Knapsack Variant

Tightly Coupled
Accelerator A
Sort from fastest to slowest
500 MHz
Accelerator N
Accelerator N
Loosely Coupled
Accelerator B
Accelerator C
Accelerator C
Accelerator A
0 1 2 3 1000, 1001, , Area Constraint


Accelerator B
10
IntroductionBut ASIC/FPGAs have limited clock
resources
  • Accelerators may have to share clock
  • Some may run slower than their max frequency

500 MHz
233 MHz
178 MHz
145 MHz
ASIC/FPGA
- Clock Frequency Module
11
Clock Frequency AssignmentProblem Definition
Accelerator A
Accelerator B
Accelerator C
Given a set of Accelerators
1000 MHz
500 MHz
200 MHz
Each with its own maximum frequency
5
10
2
And total clock cycles for each
Also given of available clock frequencies 2

For every accelerator, find frequency that is
accelerator's max freq, with number of distinct
frequency values available freq, such that
execution time E is minimized ? Clock Frequency
Assignment problem
12
Dynamic Programming Solution
.005
.005
Optimal Execution Time
.040 s
.030
.025
.040
.085
13
Coupling and Clock Assignment Integration
Heuristics
  • How do we combine these two optimal solutions?
  • Sequential Search
  • Straightforward
  • Returns suboptimal solutions
  • Need a better heuristic
  • No Penalty Migration
  • Nested Dynamic Programming

14
No Penalty Migration
Only two clocks available for entire design
aa
ab
ac
an
Available Clocks 2
500 MHz
200 MHz
1000 MHz
1200 MHz
Tightly Coupled
Loosely Coupled
Clk_1000 MHz
Clk_200 MHz
Accelerator C
Accelerator N
Accelerator A
Accelerator B
Clk_500 MHz
  • Run NKDP to determine optimal two-level coupling

15
No Penalty Migration
Only two clocks available for entire design
aa
ab
ac
an
Available Clocks 2
500 MHz
200 MHz
1000 MHz
1200 MHz
Tightly Coupled
Loosely Coupled
Accelerator C
Accelerator N
Accelerator A
Accelerator B
Clk_500 MHz
Clk_200 MHz
  • Run NKDP to determine optimal two-level coupling
  • Determine if any loosely coupled accelerators
    should migrate back to tightly coupled set
  • Rerun clock partitioning on loosely coupled set
    with modified accelerator set

16
Nested Dynamic Programming
Tightly Coupled
Clk_500MHz
Available Clocks 2
Accelerator A
Sort from fastest to slowest
.005
.005
Accelerator N
Accelerator N
1200 MHz
Loosely Coupled
.030
.025
Accelerator B
Accelerator C
Accelerator C
1000 MHz
Accelerator A
500 MHz

Accelerator B
200 MHz
17
Nested Dynamic Programming
Clock Frequency Assignment
Tightly Coupled
Available Clocks 2
Sort from fastest to slowest
Accelerator N
Loosely Coupled
Accelerator C
Accelerator A

Accelerator B
N Knapsacks Dynamic Programming Iteration
18
Results H264 Decoder
Heuristics find the same solutions for a larger
available frequencies
Over 3x speedup from a single frequency, single
coupling implementation
19
Results Synthetic Benchmarks
Nested Dynamic Programming consistently finds a
better partitioning than other heuristics
On average, No Penalty Migration yielded 15
better solutions than Sequential Search, and
Nested Dynamic Programming yielded a 15 better
solution than No Penalty Migration
2 available clock frequencies
20
Results Synthetic Benchmarks
Nested Dynamic Programming consistently finds a
better partitioning than other heuristics
Average 5x speedup
2 available clock frequencies
8 available clock frequencies
21
Conclusions
  • Consideration of both coupling and multiple clock
    domains can lead to substantial speedup over
    implementations that dont consider either
  • 5x speedup
  • On top of already gained hw/sw speedups
  • Developed heuristics that integrate coupling and
    clock frequency assignment
  • Heuristics run in seconds
Write a Comment
User Comments (0)
About PowerShow.com