Mapping of Regular Nested Loop Programs to - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Mapping of Regular Nested Loop Programs to

Description:

this set is affinely mapped onto iteration vectors I using an affine transformation ... Affine Transformations. Localization. Operator Splitting. Exploration of ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 25
Provided by: csU66
Category:

less

Transcript and Presenter's Notes

Title: Mapping of Regular Nested Loop Programs to


1
Mapping of Regular Nested Loop Programs
to Coarse-grained Reconfigurable Arrays
Constraints and Methodology
F. Hanning, H. Dutta, W. Tichy, and Jürgen
Teich University of Erlangen-Nuremberg, Germany
Proceedings of the 18thInternational Parallel and
Distributed Processing Symposium (IPDPS04)
Presented by Luis Ortiz
Department of Computer Science The University of
Texas at San Antonio
2
Outline
  • Overview
  • The Problem
  • Reconfigurable Architectures
  • Design Flow for Regular Mapping
  • Parallelizing Transformations
  • Constraints Related to CG Reconfigurable Arrays
  • Case Study
  • Results
  • Conclusions and Future Work

3
Overview
  • Constructing a parallel program is equivalent to
    specifying its execution order
  • the operations of a program form a set, and its
    execution order is a binary, transitive and
    asymmetric relation
  • the relevant sets are (unions of) Z-polytopes
  • most of the optimizations may be presented as
    transformation of the original program
  • The problem of automatic parallelization
  • given a set of operations E and a strict total
    order on it
  • find a partial order on E such that execution of
    E under it is determinate and gives the same
    results as the original program

4
Overview (cont.)
  • Defining a polyhedron
  • a set of linear inequalities Ax a 0
  • the polyhedron is the set of all x which
    satisfies these inequalities
  • the basic property of a polyhedron is convexity
  • if two points a and b belong to a polyhedron,
    then so all convex combinations
  • ?a (1 ?)b, 0 ? 1
  • a bounded polyhedron is called a polytope

5
Overview (cont.)
  • The essence of the polytope model is to apply
    affine transformations to the iteration spaces of
    a program
  • the iteration domain of statement S
  • Dom(S) x Dsx ds 0
  • Ds and ds are the matrix and constant vector
    which define the iteration polytope. ds may
    depend linearly on the structure parameters

6
Overview (cont.)
  • Coarse-grained reconfigurable architectures
  • provide flexibility of software combined with the
    performance of hardware
  • but, hardware complexity is a problem due to a
    lack of mapping tools
  • Parallelization techniques and compilers
  • map computationally intensive algorithms
    efficiently to coarse-grained reconfigurable
    arrays

7
The Problem
Mapping a certain class of regular nested
loop programs onto a dedicated processor array
8
Reconfigurable Architectures
  • Span a wide range of abstraction levels
  • from fine-grained Look-Up Table (LUT) based
    reconfigurable logic devices to distributed and
    hierarchical systems with heterogeneous
    reconfigurable components
  • Efficiency comparison
  • standard arithmetic is less efficient on
    fine-grained architectures
  • due to the large routing area overhead
  • Few research work which deals with the
    compilation to coarse-grained reconfigurable
    architecture

9
Design Flow for Regular Mapping
10
Design Flow for Regular Mapping (cont.)
  • A piecewise regular algorithm contains N
    quantified equations
  • each equation SiI is of the form
  • xiI are indexed variables
  • fi are arbitrary functions
  • dji ? Zn are constant data dependence vectors,
    and denote similar arguments
  • Ii are called index spaces

11
Design Flow for Regular Mapping (cont.)
  • Linearly bounded lattice
  • this set is affinely mapped onto iteration
    vectors I using an affine transformation
  • Block pipelining period
  • time interval between the initiations of two
    successive problem instances (ß)

12
Parallelizing Transformations
  • Based on the representation of equations and
    index spaces several combinations of
    parallelizing transformations in the polytope
    model can be applied
  • Affine Transformations
  • Localization
  • Operator Splitting
  • Exploration of Space-Time Mappings
  • Partitioning
  • Control Generation
  • HDL Generation Synthesis

13
Constraints Related to CG Reconfigurable Arrays
  • Coarse-grained (re)configurable architectures
    consist of an array of processor elements (PE)
  • array of processor elements (PE)
  • one or more dedicated functional units or
  • one or more arithmetic logic units (ALU)
  • memory
  • local memory ? register files
  • memory banks
  • an instruction memory is required if the PE
    contains an instruction programmable ALU
  • interconnect structures
  • I/O ports
  • synchronization and reconfiguration mechanisms

14
Case Study
  • Regular mapping methodology applied for a matrix
    multiplication algorithm
  • target architecture
  • PACT XPP64-A reconfigurable processor array
  • 64 ALU-PAEs of 24 bit data with in an 8x8 array
  • each ALU-PAE contains of three objects
  • the ALU-PAE
  • Back-Register-object (BREG)
  • Forward-Register-object (FREG)
  • all objects are connected to horizontal routing
    channels

15
Case Study (cont.)
  • RAM-PAE are located in two columns at the left
    and the right border of the array, two ports for
    independent r/w operations
  • RAM can be configured to FIFO mode
  • each RAM-PAE has a 512x24 bit storage capacity
  • four independent I/O interfaces located in the
    corners of the array

16
Case Study (cont.)
Structure of the PACT XPP64-A reconfigurable
processor
ALU-PAE objects
17
Case Study (cont.)
  • Matrix multiplication algorithm
  • C A B
  • A ? ZNxN
  • B ? ZNxN
  • computations may be represented by a dependence
    graph (DG)
  • dependence graphs can be represented in a reduced
    form
  • Reduced Dependence Graph to each edge e (vi,
    vj) there is associated a dependence vector dij ?
    Zn
  • virtual Processor Elements (VPEs) are used to map
    the PE obtained from the design flow to the given
    architecture

18
Case Study (cont.)
Matrix multiplication algorithm, C-code
Matrix multiplication algorithm after
parallelization, operator splitting, embedding,
and localization
19
Case Study (cont.)
DG of transformed matrix multiplication
algorithm N 2
4 x 4 processor array
Reduced dependence graph
20
Case Study (cont.)
  • Output data
  • Ox the output-variable space of variable x of the
    space-time mapped or partitioned index space
  • the output can be two-dimensional
  • the transformed output variables are distributed
    over the entire array
  • collect the data from one processors line PL and
    feed them out to an array border
  • m ? Z1xn denote the time instances t ? Tx(Pi,j)
    where the variable x produces an output at
    processor element Pi,j

21
Case Study (cont.)
  • if one of the following conditions holds, output
    data can be serialized

22
Case Study (cont.)
  • Partitioned implementation of the matrix
    multiplication algorithm
  • Dataflow graph of the LPGS-partitioned matrix
    multiplication 4 x 4 example
  • Dataflow graph after performing localization
    inside each file
  • Array implementation of the partitioned example

23
Results
  • Both implementations (full-size and partitioned)
    show optimal utilization of resources
  • Each configured MAC-unit performs one operation
    per cycle
  • It is observed that using fewer resources with
    better implementation more performance per cycle
    can be achieved
  • The number of ALUs is reduced from O(3N) to O(N)
  • Merging and writing of output data streams is
    overlapped with computations in PEs

24
Conclusions and Future Work
  • The mapping methodology based on loop
    parallelization in the polytope model provides
    results that are efficient in terms of
    utilization of resources and execution time
  • Future work is focused on perform automatic
    compilation of nested loop programs
Write a Comment
User Comments (0)
About PowerShow.com