MAPLD 2004 - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

MAPLD 2004

Description:

design of fast and efficient hybrid-fpgas for numerically intensive applications in fluid dynamics, molecular modeling and image/video processing – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 2
Provided by: AliAk7
Category:

less

Transcript and Presenter's Notes

Title: MAPLD 2004


1
DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR
NUMERICALLY INTENSIVE APPLICATIONS IN FLUID
DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO
PROCESSING A. Akoglu, A. Dasu, S. Panchanathan
Center for Ubiquitous Computing, Arizona State
University, Tempe, AZ akoglu,dasu,panch_at_asu.edu
Abstract This research work presents results
obtained from hybrid FPGA architecture design
methodology proposed in earlier work. Hybrid
architecture is formed of ASIC units and LUT
based processing elements. ASIC units represent
tasks or core clusters obtained through common
sub-graph analysis between basic blocks within
and across routines of computation intensive
applications and are basically recurring
patterns. Results show that partial
reconfiguration with the use of computation cores
embedded in a sea of LUTs offer potential for
massive savings in gate density by eliminating
the need for redundant sub-circuit pattern
configurations. Since ASICs cover only parts of
data flow graphs, remaining computations are
implemented on LUT based reconfigurable hardware.
A new packing function is proposed to form LUT
based processing elements. Packing cost function
prioritizes reduction of input/output pins of the
clusters being formed. Results show that
significant savings in number of nets to be
routed are obtained through proposed method.
applications. We have conducted experiments on
MPEG-4 VVM, Gnu Scientific Library (GSL) and NAMD
molecular modeling library. A map report based on
Spartan 2E architecture was obtained based on the
synthesis report. Results show that partial
reconfiguration with the use of computation cores
embedded in a sea of LUTs offer the potential for
massive savings in gate density by eliminating
the need for redundant sub-circuit pattern
configurations (Table 1). S
ince CIPEs cover only parts of DFGs, remaining
computations (reconfigurable data flow
computations) are then implemented on LUT based
reconfigurable hardware. Methodology in Figure 4
proposes to provide optimum interconnection
pathways between different hierarchy levels with
variable size processing elements, allocating
just enough switching and wiring resources as a
result of profiling the computational
characteristics of the application domains. In
existing approaches packing threats number of
intersecting nets as positive gain and doesnt
address how wiring requirement grows after
including an LUT into a cluster. We also argue
that cost function should give priority to the
nets causing a decrease in the number of input or
output pins of the target cluster. Rents rule
based packing mechanism (Figure 5) designed to
improve the routing architecture by reducing the
number of nets to be routed has been implemented.
This mechanism prioritizes (Figure 6,7) nets
that lead to reduction of number of input/output
pins during packing in addition to routability
driven cost metrics defined by other researchers.
Table 2 presents the performance of packing
compared to V-Pack ( Rose et. al) and R-Pack(
Sarrafzadeh et. al)
analysis between basic blocks within
and across routines are basically recurring
computation patterns implemented as ASICs on
non-reconfigurable area(CIPEs). Designing
processing elements based on identifying
correlated compute intensive regions within each
application and between applications result in
large amounts of processing in localized regions
of the chip. This reduces the amount of
reconfigurations and on-chip communication hence
results with faster application switching and
reduced power consumption. This task comprises of
finding the Common Sub-graphs (Figure 3), which
is closely related to the Largest Common
Sub-Graph problem (a proven NP complete problem).
Core reusable regions that have been detected as
common within or across applications by peer
research efforts, have either been at the
granularity of MAC units (2 nodes) or at the
granularity of entire function modules. There has
been no reported work that has detected core
reusable regions consisting of several operation
(multiply, add, divide etc) nodes between basic
blocks in applications, with emphasis on
accelerating data flow on hardware. Our method
generates ASIC cores of higher granularity by
specifically focusing on Dataflow graphs of
Hardware Computations and taking advantages of
the restrictions that they offer. In Hybrid-FPGA
model, we propose that the CIPE region be
constrained and mapped onto a slab (a region of
LUTs isolated by MacroBus as in the Virtex
architecture) or implemented as gates in ASIC
technology. Even though a large ASIC on chip
increases the costs of mask design, it offers the
maximum amount of gate savings. Remaining slabs
are implemented on LUT based reconfigurable. To
the best of our knowledge currently there exists
no known technology that maps regions within a
single DFG into multiple Slabs for Partial
Reconfiguration We have conducted experiments on
several complex routines from the target
Methodology (Figure 1) involves extraction of
tasks or core clusters in Control Data Flow
Graphs (CDFGs) of applications followed by
designing the architecture to embed them in
Hybrid-FPGA environments (Figure 2a,2b). By
Hybrid, we mean that the proposed FPGA
architectures will involve LUT and ASIC regions.
Tasks or core clusters obtained through the
common-sub-graph
Figure 2b. Hybrid-FPGA
Figure 6. How to prioritize net reduction
Introduction
Numerical simulations in Computational Fluid
Dynamics, Molecular Modeling have some common
computation features, and performed
iteratively. Similarly video and image
processing applications involve tasks (mosaic
building to compress video into images, image
compression such as DCT, DWT etc.) which also
have some common computation features and require
iterative processing. It has been shown by
several researchers that these applications are
well suited to be executed on spatially parallel
processor architectures. FPGAs in particular
offer large amounts of on-chip spatial parallel
units, thus capable of performing orders of
magnitude faster than regular serial processors.
But FPGAs suffer from the drawbacks of being
application agnostic and hence incur penalties of
loss of clock cycles in redundant
reconfigurations, generic routing and poor memory
architectures which impact speed, power and
silicon area. All these factors have led us into
exploring the reconfigurable architecture design
space with the application domain being
prioritized.
Figure 7. Packing Cost Function
Figure 4. LUT Based Architecture Methodology
Table-2 Amount of nets and tracks savings
Captions to be set in Times or Times New Roman or
equivalent, italic, between 18 and 24 points.
Left aligned if it refers to a figure on its
left. Caption starts right at the top edge of the
picture (graph or photo).
Conclusion From this research effort we believe
that partial reconfiguration with the use of
computation cores embedded in a sea of LUTs offer
the potential for massive savings in gate density
and by eliminating the need for unnecessary and
redundant sub-circuit pattern configurations. We
believe that this direction will lead to the next
generation FPGA devices geared towards
computationally intensive applications such as
bio-chemical algorithms and scientific
applications.
Finding Comon Sub-Graph
Dominant Sub-Graph Figure 3
Figure 1. Methodology
Table-1 configuration bits and clock cycles
savings
  • Selected Publications
  • Reconfigurable Media Processing, A. Dasu et.al,
    Parallel Computing Vol. 28, August 2002. Pg(s)
    1111 - 1139.
  • 2. A.Akoglu, A. Dasu and S. Panchanathan ,A
    Framework for the Design of the Heterogeneous
    Hierarchical Routing Architecture of a
    Dynamically Reconfigurable Application Specific
    Media Processor,Workshop on Embedded Systems for
    Media Processing,Dec 17, 2003,Hyderabad, India
  • 3. A. Dasu, A.Akoglu, S.Panchanathan , An
    Analysis Tool Set for Reconfigurable Media
    Processing,The International Conference on
    Engineering of Reconfigurable Systems and
    Algorithms (ERSA'03),June 2003,Las Vegas
  • 4. 3 Patents pending under US and International
    Protection in the technologies for Designing
    High Performance Reconfigurable Processing

Tool 2
Step-4
Step-1 Step-2
Step-3
Captions to be set in Times or Times New Roman or
equivalent, italic, 18 to 24 points, to the
length of the column in case a figure takes more
than 2/3 of column width.
Step-5
Figure 5. Rent s Rule based Packing Parameters
Figure 2a. Hybrid-FPGA
Write a Comment
User Comments (0)
About PowerShow.com