Data Partitioning for Reconfigurable Architectures with Distributed Block RAM - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM

Description:

... traditional data partitioning in parallelizing compilation for NUMA machines ... parallelize input designs. dramatically improve system performance. Future work ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 27
Provided by: wen148
Category:

less

Transcript and Presenter's Notes

Title: Data Partitioning for Reconfigurable Architectures with Distributed Block RAM


1
Data Partitioning for Reconfigurable
Architectures with Distributed Block RAM
  • Wenrui Gong Gang Wang Ryan
    KastnerDepartment of Electrical and Computer
    EngineeringUniversity of California, Santa
    Barbara
  • gong, wanggang, kastner_at_ece.ucsb.edu
  • http//express.ece.ucsb.edu
  • June 10, 2005

2
What are we dealing with?
  • Mapping high-level programs into FPGA-based
    reconfigurable computing architectures with
    distributed block RAM modules
  • Objective Improve utilizations of available
    storage resources, optimize system performance,
    and meet design goals

3
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

4
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

5
Target Architecture
  • FPGA-based fine-grained reconfigurable computing
    architecture with distributed block RAM modules

6
Memory Access Latencies
  • Memory access delay including access delay and
    propagation delays. Propagation delays are
    variables.
  • One clock cycle to access near data, or two or
    even more to access data far away from the CLB.
  • Difficult to distinguish which ones are near and
    which ones are remote before physical synthesis
  • More difficult than traditional data partitioning
    in parallelizing compilation for NUMA machines

7
Outline
  • Target architectures
  • Data partitioning problem
  • Problem formulation
  • Data partitioning algorithm
  • Memory optimizations
  • Experimental results
  • Concluding remarks

8
Problem Formulation
  • Inputs
  • An l-level nested loop L
  • A set of n data arrays N
  • An architecture with m BRAM modules M.
  • Assumptions
  • Index expressions of array references are affine
    functions of loop indices
  • No indirect array references, or other similar
    pointer operations
  • All data arrays are assigned to block RAM modules
  • No duplicate data.

9
Problem Formulation (contd)
  • Partitioning problem partition data arrays N
    into a set of data portions P, and seek an
    assignment from P to block RAM modules M.
  • Constraints
  • 1) hardware resource constraint
  • 2) capacity constraint of each block RAM module
  • 3) all data arrays are assigned to block RAM and
    each data element is assigned to one and only one
    block RAM module.
  • Objective minimize the total execution time (or
    maximize the system throughput) under the above
    constraints.

10
Overview of Data Partitioning Algorithm
  • Code analysis to determine possible partitioning
    directions
  • Architectural-level synthesis discover the design
    properties
  • Resource allocation, scheduling and binding
  • Granularity adjustment
  • Use experiential cost function to estimate
    performances

11
Code Analysis
  • Calculate the iteration space IS(L)
  • Calculate the data space DS(Ni)
  • Obtain data access footprint F using the affine
    functions of loop indices
  • Analyze F and IS(L) to obtain a set of possible
    partitioning directions.

12
Architectural-level Synthesis
  • Synthesize and pipeline the innermost iteration
    body, and collect execution time T, initial
    intervals II, and resource utilization um, ur,
    and ua

13
Granularity Adjustment
  • For each possible partitioning direction, check
    different granularity to obtain the best
    performance
  • Calculate the finest and coarsest grain for a
    homogeneous partitioning
  • Finest as less iterations as possible in one
    block RAM module, use all block RAM modules
  • Coarsest use as less block RAM modules as
    possible
  • Estimate global memory accesses mr and total
    memory accesses mt, and their ratio
  • Use cost function to estimate the execution time

14
Cost Function
  • An experiential formulation based our
    architectural-level synthesis results.
  • Estimate initial intervals for pipelined designs
  • Benefit memory accesses to nearby block RAM
    modules
  • Different resource utilizations and granularities
    affect the initial intervals

15
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Scalar replacement
  • Data prefetching
  • Experimental results
  • Concluding remarks

16
Scalar Replacement
  • Scalar replacement increases data reuses and
    reduces memory access
  • Memory are accessed in the previous iteration
  • Use contents already in registers rather than
    access it again

17
Data Prefetching and Buffer Insertion
  • Buffer insertion reduces critical paths, and
    optimizes clock frequencies.
  • Schedule the global memory access one cycle
    earlier
  • Reduce the length of critical paths

18
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

19
Experimental Setup
  • Target architecture Xilinx Virtex II FPGA.
  • Target frequency 150 MHz.
  • Benchmarks image processing applications and DSP
  • SOBEL edge detection
  • Bilinear filtering
  • 2D Gauss blurring
  • 1D Gauss filter
  • SUSAN principle.

20
Results Architectural Exploration
  • Correlation bank
  • Different partitions of the array S deliver a
    wide variety of candidate solutions
  • With quite different overall performance after
    synthesis and physical design.

21
Results Execution Time
  • The average speedup 2.75 times, and after
    further optimizations, the average speedup is
    4.80 times faster.

22
Results Achievable Clock Frequencies
  • About 10 percent slower than the original ones.
    After optimizations, about 7 percent faster than
    those of partitioned ones.

23
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

24
Concluding Remarks
  • A data and iteration space partitioning approach
    for homogeneous block RAM modules
  • integrated with existing architectural-level
    synthesis techniques
  • parallelize input designs
  • dramatically improve system performance
  • Future work
  • Irregular memory access
  • Heterogeneous block RAM modules

25
Thank You
  • Prof Ryan Kastner and Gang Wang
  • Reviewers
  • All audiences

26
Questions
Write a Comment
User Comments (0)
About PowerShow.com