Data Partitioning for Reconfigurable Architectures with Distributed Block RAM - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM

Description:

... traditional data partitioning in parallelizing compilation for NUMA machines ... parallelize input designs. dramatically improve system performance. Future work ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 27

Provided by: wen148

Category:

more less

Transcript and Presenter's Notes

Title: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM

1
Data Partitioning for Reconfigurable
Architectures with Distributed Block RAM

Wenrui Gong Gang Wang Ryan
KastnerDepartment of Electrical and Computer
EngineeringUniversity of California, Santa
Barbara
gong, wanggang, kastner_at_ece.ucsb.edu
http//express.ece.ucsb.edu
June 10, 2005

2
What are we dealing with?

Mapping high-level programs into FPGA-based
reconfigurable computing architectures with
distributed block RAM modules
Objective Improve utilizations of available
storage resources, optimize system performance,
and meet design goals

3
Outline

Target architectures
Data partitioning problem
Memory optimizations
Experimental results
Concluding remarks

4
Outline

Target architectures
Data partitioning problem
Memory optimizations
Experimental results
Concluding remarks

5
Target Architecture

FPGA-based fine-grained reconfigurable computing
architecture with distributed block RAM modules

6
Memory Access Latencies

Memory access delay including access delay and
propagation delays. Propagation delays are
variables.
One clock cycle to access near data, or two or
even more to access data far away from the CLB.
Difficult to distinguish which ones are near and
which ones are remote before physical synthesis
More difficult than traditional data partitioning
in parallelizing compilation for NUMA machines

7
Outline

Target architectures
Data partitioning problem
Problem formulation
Data partitioning algorithm
Memory optimizations
Experimental results
Concluding remarks

8
Problem Formulation

Inputs
An l-level nested loop L
A set of n data arrays N
An architecture with m BRAM modules M.
Assumptions
Index expressions of array references are affine
functions of loop indices
No indirect array references, or other similar
pointer operations
All data arrays are assigned to block RAM modules
No duplicate data.

9
Problem Formulation (contd)

Partitioning problem partition data arrays N
into a set of data portions P, and seek an
assignment from P to block RAM modules M.
Constraints
1) hardware resource constraint
2) capacity constraint of each block RAM module
3) all data arrays are assigned to block RAM and
each data element is assigned to one and only one
block RAM module.
Objective minimize the total execution time (or
maximize the system throughput) under the above
constraints.

10
Overview of Data Partitioning Algorithm

Code analysis to determine possible partitioning
directions
Architectural-level synthesis discover the design
properties
Resource allocation, scheduling and binding
Granularity adjustment
Use experiential cost function to estimate
performances

11
Code Analysis

Calculate the iteration space IS(L)
Calculate the data space DS(Ni)
Obtain data access footprint F using the affine
functions of loop indices
Analyze F and IS(L) to obtain a set of possible
partitioning directions.

12
Architectural-level Synthesis

Synthesize and pipeline the innermost iteration
body, and collect execution time T, initial
intervals II, and resource utilization um, ur,
and ua

13
Granularity Adjustment

For each possible partitioning direction, check
different granularity to obtain the best
performance
Calculate the finest and coarsest grain for a
homogeneous partitioning
Finest as less iterations as possible in one
block RAM module, use all block RAM modules
Coarsest use as less block RAM modules as
possible
Estimate global memory accesses mr and total
memory accesses mt, and their ratio
Use cost function to estimate the execution time

14
Cost Function

An experiential formulation based our
architectural-level synthesis results.
Estimate initial intervals for pipelined designs
Benefit memory accesses to nearby block RAM
modules
Different resource utilizations and granularities
affect the initial intervals

15
Outline

Target architectures
Data partitioning problem
Memory optimizations
Scalar replacement
Data prefetching
Experimental results
Concluding remarks

16
Scalar Replacement

Scalar replacement increases data reuses and
reduces memory access
Memory are accessed in the previous iteration
Use contents already in registers rather than
access it again

17
Data Prefetching and Buffer Insertion

Buffer insertion reduces critical paths, and
optimizes clock frequencies.
Schedule the global memory access one cycle
earlier
Reduce the length of critical paths

18
Outline

Target architectures
Data partitioning problem
Memory optimizations
Experimental results
Concluding remarks

19
Experimental Setup

Target architecture Xilinx Virtex II FPGA.
Target frequency 150 MHz.
Benchmarks image processing applications and DSP
SOBEL edge detection
Bilinear filtering
2D Gauss blurring
1D Gauss filter
SUSAN principle.

20
Results Architectural Exploration

Correlation bank
Different partitions of the array S deliver a
wide variety of candidate solutions
With quite different overall performance after
synthesis and physical design.

21
Results Execution Time

The average speedup 2.75 times, and after
further optimizations, the average speedup is
4.80 times faster.

22
Results Achievable Clock Frequencies

About 10 percent slower than the original ones.
After optimizations, about 7 percent faster than
those of partitioned ones.

23
Outline

Target architectures
Data partitioning problem
Memory optimizations
Experimental results
Concluding remarks

24
Concluding Remarks

A data and iteration space partitioning approach
for homogeneous block RAM modules
integrated with existing architectural-level
synthesis techniques
parallelize input designs
dramatically improve system performance
Future work
Irregular memory access
Heterogeneous block RAM modules

25
Thank You