Title: Highlevel Partitioning Of Discrete Signal Transforms For Distributed Hardware Architectures
1High-level Partitioning Of Discrete Signal
Transforms For Distributed Hardware Architectures
- Rafael Arce Nazario, PhD
- Department of Physics and Electronics
- University of Puerto Rico, Humacao
- University of Puerto Rico, Rio Piedras Campus
- November 13, 2007
2Presentation Overview
- Background / Motivation
- Problem statement
- Related Work
- Research methodology
- Partitioning Tools
- Formulation Exploration
- Results and discussion
- Conclusions, Contributions, and Future Work
3Motivation and Objective
4P DFT size 16
- Discrete Signal Transforms (DSTs)
- DFT, DCT, lots of applications
- Hardware accelerated but at high area cost
- Example 4P DFT formulation by Dr. Rodríguez
- Distributed (dedicated) hardware architectures
(DHAs) - Partitioning plays key role
- Partitioning beyond multi-device
- Multi-core GPPs IBM CELL BE, Intel Core Duo
- System-on-Chip Network-On-Chip
- Need tools to aid in partition exploration,
mapping, implementation ? design automation
Virtex Family FPGA
4Philosophy
- Hypothesis
- Automated partitioning of DSTs can be improved by
introducing high-level DST considerations, such
as graphic and algorithmic properties.
5Problem statement - Inputs
- Given
- T a high-level description of a DST,
T(T0,T1,..,TM-1) - N points, R resolution, F numerical format
- A description of the target architecture as a
hypergraph H(D,C) - D Set of devices,
- Wd(di) Weight of device
- B(ci) Bitwidth of channel bi
- C Set of communication channels
- Wc(ci) Weight of each channel
Time from beginning of computation until last
data point is processed.
- Determine a mapping function fT?D such that
- Minimize
- Constraints
-
-
At any time, bits from task i to task j dont
exceed communication resources.
The resources to implement tasks to a given
device dont exceed device resources.
- Minimize Latency of the overall transform
implementation.
6Background Partitioning in CAD
High-level / Behavioral
Design Idea
- partitions a high-level specification which has
been represented as a flow-graph
High-level synthesis
More information
More structure
Logic Design
Structural
- partitions a netlist at the Register Transfer
Level or gate-level
Technology Mapping
Place and Route
Fabrication or Bit-stream Transfer
- We choose HLP
- More information for functional aspects
- CAD Higher Level Higher Benefits
DHA
6
7Previous Work DHA Partitioning
- Partitioning is an NP problem Garey76
- Use heuristics stochastic and/or deterministic
- Main limitations in previous strategies
- Exploration limited to graph-level and below
- Most available methods are structural or pure DFG
- Functional properties - not appropriately
considered - Generic methodology
- Good for common case, not excellent
- Representation granularity
- Either finest-granularity or user-specified
coarse nodes - Comparison to other multi-device implementations
- Lack of accepted benchmarks algorithms and
architectures.
8DMAGIC Partitioning Methodology
DMAGIC DST Mapping using Algorithmic and Graph
Interaction and Computation
DST-features are introduced in two abstraction
levels as part of our methodology at the graph
level and the algorithmic-level.
8
9Inputs
Distributed Hardware Architecture (DHA)
optional ring connection
- Representative of commercial and academic
platforms - Practically scalable
- Discrete Signal Transforms
- Discrete Fourier Transform (DFT) Discrete
Cosine Transform (DCT) - Focus on one-dimensional
- Extensible to multidimensional transforms through
(anti)lexicographically-ordering into a vector. - Kronecker Products Algebra (KPA) used for
algorithmic representation - Compact framework Formulation implies
structure - Exploration of formulations for various
architectures VanLoan92 Johnson90
10Research methodology
10
11Kronecker to Graph Conversion Tool
KTG
Operator matrices Identity, Transform,
Permutation, Unitary, Unitary Transpose,
Twiddle KPA operations Tensor product (?),
Direct Sum (? ), Matrix Multiplication
- Output Weighted graph with level information
12Graph Partitioning/Placement Heuristic
Partition/ Placement
Unpartitioned DFG
Min Cost Function
Partitioned DFG
- P/P inspired by Kernighan Lin bipartition
heuristic - Extended to k-way partitioning for heterogenous
channels - Cost function sensible to DHA main concerns
Previous impl.
Our Part/Place
weight of channel i
required communications through i
- Better reflects the impact on DHA resources
- Communication channels are heterogeneous in cost
and given
13Partition/Placement Engine
DST considerations
- Initial solution balanced horizontal linear
partitioning - Scheduling consideration swap nodes from same
computational stages.
14Research methodology
14
15Formulation Exploration
- Use DST rules to explore space of equivalent
formulations in search for one that better suits
the target architecture. - Combinatorial explosion of the exploration space.
- Find rules amenable to hardware implementation.
- Conducted experiments to assess the impact of
transformations on partition quality. Results
used to devise exploration strategy.
Objective
Challenges
Approach
16Experiment 1 2
- Effect of inter-column permutations (ICP)
Cooley-Tukey G. Sande
Stockaham T. Stockham
Pease
ICP and granularity have effect on solution
quality, yet hard to establish heuristic.
17Experiment 3 Breakdown strategy
- Breakdown strategy order and divisors with
which a transform is decomposed using a rule such
as Cooley-Tukey factorization algorithm
,where nmp - Split trees common graphical representation of
breakdown strategy - Example Two split trees for a DFT size 64.
(a)
(b)
(a)
(b)
18Experiment 3 Results
- Procedure
- Exhaustive generation of split trees for DFT
sizes n16 to 256. - Formulations partitioned for various topologies
using tools. - Results represented as mega-trees
- Observation of split tree decisions that lead to
partition friendly formulations ? heuristics
Mega tree for 32-point FFT
19Formulation Exploration Heuristic
Greedy top-down formulation exploration using
breakdown strategy.
Start
Gen. Initial Splits
DFG Partition
Latency Improvement?
- Results compared against best results of
Simulated Annealing heuristic. - Latency reduction Average 10.8 , Peak 13.3
- Run time reduction Average 96.3 , Peak
99.4
F
End
T
Det. Next Split Leaf
Reformulate
20Discrete Cosine Transform
- Encouraging results obtained from FFT formulation
exploration with CT-factorization. - DCT is not as regular as FFT.
- No Cooley-Tukey equivalent for DCT.
- Study existing DCT formulations, appropriateness
for distributed implementation. - Derived formulation that will allow proper
exploration via CT-like decomposition.
Motivation
Challenges
Approach
21Regular CT-like DCT Formulation
Permutation Rules
CT-like Formulation
- arbitrary decomposition of size 2n DCT onto m and
k sized components
22DCT Experiment and Results
- Compared latency and run-time for DCT
formulations
Log scale
Up to 18.3 latency reduction vs. best of the
rest formulations. Average 10.8
Up to 83.3 run-time reduction vs. best of the
rest formulations. Average 47.7
23Research methodology
23
24Results Comparison with Previous Methodology
- Latency Reduction Average 23.3, Peak 34.1
- Run-Time Reduction Average 98.0, Peak 99.9
25Results Comparison with Previous Methodology
- Previous methodology Srinivasan01
- Features
- Generic partitioning methodology for multi-FPGA
boards - Reported results of partitioning small (16-point)
DFT and DCT - Well documented ? allows 3rd party implementation
- Methodology
- Input (fully expanded) dataflow graph
- Partition heuristic Genetic Algorithms
optimizing objective funct.
26Conclusions
- Hypothesis was proven correct!
- Multiple opportunities to improve partitioning by
taking advantage of DST and DHA features. - Graph level regularity of permutations and
operations. - Graph partitioning and area estimator can be made
more sensible to DHA and DST concerns - Algorithmic level
- Reformulation has significant impact on partition
quality - Improvements over generic methodology
- Latency reduction (23.3 avg, 34.1 max)
- Runtime (98.8 avg, 99.9 max)
27Main Contributions
- Automated high-level partitioning methodology for
DSTs onto distributed hardware architectures - Fusion of CAD / DSP knowledge
- KTG automated and extensible methodology for
conversion of Kronecker product algebra (KPA)
formulations into DFGs - Architectural model and high-level estimator for
the implementation of distributed DSTs - Graph partitioning heuristic for k-way for DSTs
to DHAs - New knowledge of effect of DST formulations on
their distributed implementation - New heuristic for exploring DST formulation space
- New arbitrary decomposition DCT formulation
28Future Work
- Study partitioning in alternated architectures
and for diverse objectives - Network-on-chip architectures
- Power efficiency objective
- Computer-driven search of heuristics
- Computer Learning / Genetic programming
- Development of a production-quality tool
- Study of further DSTs
- Partition-aware scheduling heuristics
28
29Other future work
- Application of EDA algorithms to bioinformatics
problems - Analogy of abstraction levels electronics vs.
molecules
- EDA abstraction levels for transistors
- differential equations at the electronic level
- boolean equations for digital electronics
- Biological abstraction models molecular
reactions - differential equations at the physical chemistry
level - discrete models of molecular behavior with
probabilistic considerations
- Research intention Discover novel ways in which
these two fields may benefit from each other. - Data structures
- Algorithms
30Examples
Discrete genetic modeling can benefit from
representation and manipulation techniques
commonly used for logic optimization and
minimization in digital circuits
Ideker00Riedel.
Cartoon of a Boolean network with two inputs per
node. Colors represent the state of a node ("on"
or "off"). At each time step, the each color is
updated according to the node's truth table and
the states of its input nodes.
? Binary Decision Diagram
- .
- Protein structure prediction from amino acid
sequence information has been improved by using
heuristics common in graph partitioning.
31Examples (cont.)
Protein structure prediction from amino acid
sequence information has been improved by using
heuristics common in graph partitioning.
Previously use of stochastic methods (S.A. ,
G.A). Recently use of deterministic methods
inspired in EDA tasks DeRonne07
32Electronic device security
- Electronic devices being used in security
sensitive situations
- Be able to authenticate devices to prevent
identity theft.
33Traditional Authentication
- Store secret key inside device.
Send a random number
Private Key
- Encrypted Number
- Only the valid secret key can generate a
- valid resonse
Public Key
- Expensive in terms of energy, fabrication
- Key may be read using electromagnetic, microscopy
methods.
34Physically Unclonable Functions
- Use electrical characteristics of device as a
digital fingerprint.
- No two electronic devices are exactly alike.
- Implement digital circuits that take advantage of
this to distinguish one chip from the other
(authenticate)
35Physically Unclonable Functions
- Use electrical characteristics of device as a
digital fingerprint.
- No two electronic devices are exactly alike.
- Implement digital circuits that take advantage of
this to distinguish one chip from the other
(authenticate)
36Physically Unclonable Functions (PUFs)
- Only the authentic device will have a given
response to a series of challenges.
- Currently a few logic circuits have been
proposed to implement PUFs - Conjecture Electronic Design Automation
techniques can be applied to find improved PUFs
for various types of architectures, and
optimizing for various characteristics.
37Foreseeable tasks
- Define models to describe physical
characteristics. - Objective functions
- Optimization methodology
- Heuristics
- Stochastic optimization algorithms
38Publications
Journal Articles
- R. Arce Nazario, M. Jiménez, D. Rodríguez.
Mapping of Discrete Cosine Transforms onto
Distributed Hardware Architectures. Submitted
Journal of VLSI Signal Processing. April 2007.
Springer. Status Under revision. - R.. Arce Nazario, M. Jiménez, D. Rodriguez.
Algorithmic-level Exploration of Discrete Signal
Transforms for Partitioning to Distributed
Hardware Architectures. Accepted for publication
on IET Computers Digital Techniques. April
2007.
Articles in peer-reviewed IEEE ACM conferences
- R. Arce Nazario, M. Jiménez, D. Rodríguez.
Partitioning Exploration for Automated Mapping
of Discrete Cosine Transforms onto Distributed
Hardware Architectures. Accepted to the 50th
IEEE Midwest Symposium on Circuits and Systems.
August 2007. Montreal, Canada. - R. Arce Nazario, M. Jiménez, D. Rodríguez.
High-level Partitioning of Discrete Signal
Transforms for Multi-FPGA Architectures. 16th
IEEE International Conference on Field
Programmable Logic and Applications. August
2006. Madrid, Spain. - R. Arce Nazario, M. Jiménez, D. Rodríguez.
Functionally-aware Partitioning of Discrete
Signal Transforms for Distributed Hardware
Architectures. 49th IEEE Midwest Symposium on
Circuits and Systems. August 2006. San Juan, PR. - R. Arce Nazario, M. Jiménez, D. Rodríguez.
Effects of High-Level Discrete Signal Transform
Formulations on Partitioning for Distributed
Hardware Architectures. IEEE on Symposium
Field-Programmable Custom Computing Machines.
April 2006. Napa, CA
39Publications
Articles in peer-reviewed IEEE ACM conferences
(continued)
- R. Arce Nazario, M. Jiménez, D. Rodríguez. An
assessment of high-level partitioning techniques
for implementing discrete signal transforms on
distributed hardware architectures. 48th IEEE
Midwest Symposium on Circuits and Systems.
August 2005. Cincinnati, Ohio. - R. Lembach, R. Arce-Nazario, D. Eisenmenger, and
C. Wood. A diagnostic method for detecting and
assessing the impact of physical design
optimizations on routing. ACM International
Symposium on Physical Design. April 2005. San
Francisco, CA. - R. Arce Nazario, M. Jiménez, Integer Pair
Representation for Multiple Output Logic, 47th
IEEE Midwest Symposium on Circuits and Systems.
December 2003, Cairo, Egypt.
Other publications, posters and presentations
- Rafael A. Arce-Nazario and Manuel Jimenez.
High-Level Partitioning of Discrete Signal
Transforms for Distributed Hardware
Architectures . Poster presentation Puerto Rico
Interdisciplinary Scientific Meeting. February
2007. Bayamón, Puerto Rico - Rafael A. Arce-Nazario and Manuel Jimenez.
High-Level Partitioning of Discrete Signal
Transforms for Distributed Hardware
Architectures . Poster presentation Workshop on
Grid Services, Automated Information Processing,
and Wireless Sensor Networks. February 2007. San
Juan, Puerto Rico - Rafael A. Arce-Nazario and Manuel Jimenez.
High-Level Partitioning of Discrete Signal
Transforms for Distributed Hardware
Architectures . Poster presentation Puerto Rico
Interdisciplinary Scientific Meeting. March 2006.
Cayey, Puerto Rico - Rafael A. Arce-Nazario, Manuel Jimenez, and
Domingo Rodriguez. High-level Partitioning of
Discrete Signal Transforms for Multi-FPGA
Architectures. Poster presented at WALSAIP
Project HP Labs research visit. October 2006.
Mayagüez, Puerto Rico. - R. Arce Nazario, M. Jimenez, D. Rodriguez.
High-Level Partitioning Techniques For
Implementing Discrete Signal Transforms On
Distributed Hardware Architectures. Poster
presentation in Annual EPSCoR conference.
September 2005. Rio Grande, PR.
39
40Publications
Other publications, posters and presentations
(continued)
- R. Arce Nazario, M. Jimenez, High-Level
Partitioning Of DSP Algorithms For Multi-FPGA
Systems. Poster presentation. GEM Consortium
Future Faculty and Professionals Symposium. Las
Vegas, NV, June 2004. - R. Arce Nazario, M. Jiménez, High-Level
Partitioning Of DSP Algorithms For Multi-FPGA
Systems - A First Approach, Proceedings of the
Computing Research Conference, Mayagüez, PR,
April 2004 - R. Arce Nazario, M. Jimenez, Integer Pair
Representation for Multiple Output Logic, PRSGC
Second Congress on Integrating NASA Research and
Education Projects in Puerto Rico, San Juan, PR,
November 2003 - R. Arce Nazario, M. Jiménez, Integer Pair
Representation for Multiple Output Logic,
Proceedings of the Computing Research Conference,
Mayagüez, PR, April 2003.
41References cited in presentation
- Garey76 M. Garey, D. Johnson, and L.
Stockmeye. Some simplified NP-complete graph
problems. Theoretical Computer Science,
(1)237267, December 30, 1976. - VanLoan92 Charles VanLoan. Computational
frameworks for the fast Fourier transform.SIAM,
1992. - Johnson90 J. Johnson and R. Johnson and D.
Rodriguez and R. Tolimieri, "A Methodology for
Designing, Modifying, and Implementing Fourier
Transform Algorithms on Various Architectures.
Circuits, Systems, and Signal Processing 9,
449500. 1990 - Sriniva01 Vinoo Srinivasan, Sriram
Govindarajan, and Ranga Vemuri. Fine-grained and
coarse-grained behavioral partitioning with
effective utilization of memory and design space
exploration for multi-FPGA Architectures. IEEE
Trans. Very Large Scale Integr. Syst.,
9(1)140159, 2001. - SPIRAL05 Püeschel, et al. SPIRAL Code
Generation for DSP Transforms. Proceedings of the
IEEE special issue on "Program Generation,
Optimization, and Adaptation," Vol. 93, No. 2,
2005, pp. 232-275 - Hagen95 Lars W. Hagen, Dennis J. H. Huang, and
Andrew B. Kahng. Quantified suboptimality of VLSI
layout heuristics. In Proceedings of the 32nd
ACM/IEEE conference on Design automation, pages
216221, New York, NY, USA, 1995. ACM Press.
42Questions