Title: Mapping of Regular Nested Loop Programs to
1Mapping of Regular Nested Loop Programs
to Coarse-grained Reconfigurable Arrays
Constraints and Methodology
F. Hanning, H. Dutta, W. Tichy, and Jürgen
Teich University of Erlangen-Nuremberg, Germany
Proceedings of the 18thInternational Parallel and
Distributed Processing Symposium (IPDPS04)
Presented by Luis Ortiz
Department of Computer Science The University of
Texas at San Antonio
2Outline
- Overview
- The Problem
- Reconfigurable Architectures
- Design Flow for Regular Mapping
- Parallelizing Transformations
- Constraints Related to CG Reconfigurable Arrays
- Case Study
- Results
- Conclusions and Future Work
3Overview
- Constructing a parallel program is equivalent to
specifying its execution order - the operations of a program form a set, and its
execution order is a binary, transitive and
asymmetric relation - the relevant sets are (unions of) Z-polytopes
- most of the optimizations may be presented as
transformation of the original program - The problem of automatic parallelization
- given a set of operations E and a strict total
order on it - find a partial order on E such that execution of
E under it is determinate and gives the same
results as the original program
4Overview (cont.)
- Defining a polyhedron
- a set of linear inequalities Ax a 0
- the polyhedron is the set of all x which
satisfies these inequalities - the basic property of a polyhedron is convexity
- if two points a and b belong to a polyhedron,
then so all convex combinations - ?a (1 ?)b, 0 ? 1
- a bounded polyhedron is called a polytope
5Overview (cont.)
- The essence of the polytope model is to apply
affine transformations to the iteration spaces of
a program - the iteration domain of statement S
- Dom(S) x Dsx ds 0
- Ds and ds are the matrix and constant vector
which define the iteration polytope. ds may
depend linearly on the structure parameters
6Overview (cont.)
- Coarse-grained reconfigurable architectures
- provide flexibility of software combined with the
performance of hardware - but, hardware complexity is a problem due to a
lack of mapping tools - Parallelization techniques and compilers
- map computationally intensive algorithms
efficiently to coarse-grained reconfigurable
arrays
7The Problem
Mapping a certain class of regular nested
loop programs onto a dedicated processor array
8Reconfigurable Architectures
- Span a wide range of abstraction levels
- from fine-grained Look-Up Table (LUT) based
reconfigurable logic devices to distributed and
hierarchical systems with heterogeneous
reconfigurable components - Efficiency comparison
- standard arithmetic is less efficient on
fine-grained architectures - due to the large routing area overhead
- Few research work which deals with the
compilation to coarse-grained reconfigurable
architecture
9Design Flow for Regular Mapping
10Design Flow for Regular Mapping (cont.)
- A piecewise regular algorithm contains N
quantified equations -
- each equation SiI is of the form
-
-
- xiI are indexed variables
- fi are arbitrary functions
- dji ? Zn are constant data dependence vectors,
and denote similar arguments - Ii are called index spaces
11Design Flow for Regular Mapping (cont.)
- Linearly bounded lattice
-
-
-
- this set is affinely mapped onto iteration
vectors I using an affine transformation - Block pipelining period
- time interval between the initiations of two
successive problem instances (ß)
12Parallelizing Transformations
- Based on the representation of equations and
index spaces several combinations of
parallelizing transformations in the polytope
model can be applied - Affine Transformations
- Localization
- Operator Splitting
- Exploration of Space-Time Mappings
- Partitioning
- Control Generation
- HDL Generation Synthesis
13Constraints Related to CG Reconfigurable Arrays
- Coarse-grained (re)configurable architectures
consist of an array of processor elements (PE) - array of processor elements (PE)
- one or more dedicated functional units or
- one or more arithmetic logic units (ALU)
- memory
- local memory ? register files
- memory banks
- an instruction memory is required if the PE
contains an instruction programmable ALU - interconnect structures
- I/O ports
- synchronization and reconfiguration mechanisms
14Case Study
- Regular mapping methodology applied for a matrix
multiplication algorithm - target architecture
- PACT XPP64-A reconfigurable processor array
- 64 ALU-PAEs of 24 bit data with in an 8x8 array
- each ALU-PAE contains of three objects
- the ALU-PAE
- Back-Register-object (BREG)
- Forward-Register-object (FREG)
- all objects are connected to horizontal routing
channels
15Case Study (cont.)
- RAM-PAE are located in two columns at the left
and the right border of the array, two ports for
independent r/w operations - RAM can be configured to FIFO mode
- each RAM-PAE has a 512x24 bit storage capacity
- four independent I/O interfaces located in the
corners of the array
16Case Study (cont.)
Structure of the PACT XPP64-A reconfigurable
processor
ALU-PAE objects
17Case Study (cont.)
- Matrix multiplication algorithm
- C A B
- A ? ZNxN
- B ? ZNxN
- computations may be represented by a dependence
graph (DG) - dependence graphs can be represented in a reduced
form - Reduced Dependence Graph to each edge e (vi,
vj) there is associated a dependence vector dij ?
Zn - virtual Processor Elements (VPEs) are used to map
the PE obtained from the design flow to the given
architecture
18Case Study (cont.)
Matrix multiplication algorithm, C-code
Matrix multiplication algorithm after
parallelization, operator splitting, embedding,
and localization
19Case Study (cont.)
DG of transformed matrix multiplication
algorithm N 2
4 x 4 processor array
Reduced dependence graph
20Case Study (cont.)
- Output data
- Ox the output-variable space of variable x of the
space-time mapped or partitioned index space - the output can be two-dimensional
- the transformed output variables are distributed
over the entire array - collect the data from one processors line PL and
feed them out to an array border -
-
-
- m ? Z1xn denote the time instances t ? Tx(Pi,j)
where the variable x produces an output at
processor element Pi,j
21Case Study (cont.)
- if one of the following conditions holds, output
data can be serialized
22Case Study (cont.)
- Partitioned implementation of the matrix
multiplication algorithm
- Dataflow graph of the LPGS-partitioned matrix
multiplication 4 x 4 example - Dataflow graph after performing localization
inside each file - Array implementation of the partitioned example
23Results
- Both implementations (full-size and partitioned)
show optimal utilization of resources - Each configured MAC-unit performs one operation
per cycle - It is observed that using fewer resources with
better implementation more performance per cycle
can be achieved - The number of ALUs is reduced from O(3N) to O(N)
- Merging and writing of output data streams is
overlapped with computations in PEs
24Conclusions and Future Work
- The mapping methodology based on loop
parallelization in the polytope model provides
results that are efficient in terms of
utilization of resources and execution time - Future work is focused on perform automatic
compilation of nested loop programs