Chapter 7' Systolic Array - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Chapter 7' Systolic Array

Description:

Each node in the iteration DGs in the index space will be mapped onto a PU's index ... However, if the iteration DG corresponds to a RIA algorithm, the assignment and ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 23
Provided by: YuHe8
Category:

less

Transcript and Presenter's Notes

Title: Chapter 7' Systolic Array


1
Chapter 7. Systolic Array
2
Systolic Array
  • Systolic array is an array processor architecture
    that consists of
  • An array of identical processing units (PUs) or
    nodes
  • Inter-connected with localized data links
  • that performs
  • Pipelined computation between PUs with
  • Identical computation at each node
  • Motivations
  • Low communication overhead
  • Easy to design
  • Suitable for VLSI implementation
  • Applications
  • Implementation of algorithms that can be
    formulated in nested loops
  • Numerical linear algebra
  • Signal and image processing

3
Systolic Design Methodology
  • Algorithm mapping
  • Individual PUs are assigned with indices in the
    index space
  • Assignment
  • Each node in the iteration DGs in the index space
    will be mapped onto a PUs index
  • Scheduling
  • Each node in the iteration DGs will be assigned
    with an integer schedule indicating the time step
    it is to be executed.
  • Linear Mapping Methodology
  • In general, the assignment and scheduling is a
    nonlinear operation.
  • However, if the iteration DG corresponds to a RIA
    algorithm, the assignment and scheduling can be
    accomplished by linear projection of each node in
    the DG onto the index space of the PUs, and be
    assigned with a schedule.

4
Formulate Algorithm in RIA format
  • Single Assignment Transformation
  • Remove unnecessary false data dependency between
    iterations
  • Accomplished by introducing a new variable or
    array of variables to hold intermediate values
    during computation.
  • May impose unnecessary dependence constraint
    during mapping.
  • Pipelined data duplication
  • Replace data broadcasting that requires global
    data bus
  • Without affecting the algorithm performance
  • Accomplished by introducing an intermediate
    variable that propagated among index nodes in the
    DG
  • May impose unnecessary dependence constraint
    during mapping

5
Example FIR Filter
y(4)
y(5)
y(6)
  • FIR filter formulation
  • Single assignment format with broadcasting data
  • Do n1,2, . . .
  • y1(n,-1)0
  • Do k0,K
  • y1(n,k)y1(n,k-1)
  • h(k)x(n-k)
  • enddo
  • y(n)y1(n,K)
  • Enddo

k
y(3)
h(4)
y(2)
h(3)
y(1)
h(2)
y(0)
h(1)
h(0)
n
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
6
Example FIR Filter
  • Regular Recurrent eq.
  • y1(n,-1)0, n 0,1,2,
  • h1(0,k)h(k), k0,,K
  • n0,1,2,, and k0,,K
  • y1(n,k)y1(n,k-1)
  • h1(n,k)x1(n,k)
  • h1(n,k)h1(n-1,k)
  • x1(n,k)x1(n-1,k-1)
  • y(n)y1(n,K), n0,1,2,
  • Leads to SIDG

7
Linear Schedule and Assignment
  • A schedule t(i) is a mapping from index i in the
    DG to a positive integer t (time index)
  • A time index is a quantum of time that takes to
    execute the operations of an iteration.
  • A linear schedule maps all indices i on the same
    hyper-plane to the same time index.
  • It can be characterized by the normal vector of
    the equi-temporal hyper plane s.
  • An assignment is a mapping from an index i in the
    DG to an index n in the systolic array processor
    index.
  • The processor index space has lower dimension
    than that of the DG.
  • A linear assignment assigns all indices along the
    same vector d in the DG to the same processor
    (index).

8
Algebraic Formulation of Linear Assignment and
Schedule
  • Entries of the assignment vector d and scheduling
    vector s must be integers. Their dimensions are
    the same as the DG indices.
  • Processor space the orthogonal subspace of d.
    Its entries are also integers.
  • PE Assignment by index node mapping
  • n PT i
  • Scheduling by arc (dependence vectors) mapping
  • ?(e) of delays on the edge of DFG
  • e edges of the systolic array
  • v dependence vector

9
Affine Transformation
  • Processor space P span the subspace where the
    processor array index space lies.
  • For any DG index i, it is assigned to the PE
    whose index is
  • p(i) Pi po
  • where po mini?DG Pi is an offset.
  • If iteration i and j are both assigned to the
    same PE, then i j kd where k is an integer.
  • The iteration i is scheduled to be executed at
  • t(i) sTi to
  • time step where
  • to mini?DG sTi is an offset.
  • If iteration i and j are both assigned to the
    same PE, the schedule duration is
  • t(i) t(j) sT(i j) k(sTd).
  • Thus, when k 1, sTd is the iteration interval
    between execution of two successive iterations on
    the same PE.

10
Finding Processor Space Matrix
  • Problem
  • Given projection vector d, how to find processor
    space P such that ?PPT ?ddT I? where ?, ?
    are scaling constants such that P and d have
    integer entries.
  • Solution
  • Find n ? n-1 matrix V, s.t.
  • Convert entries of V into integers to yield the P
    matrix.
  • Find matrix V
  • Compute
  • Factorize M matrix
  • this can be accomplished using LU factorization,
    eigenvalue or singular value decomposition of M.
  • Scale entries of the V matrix so that all entries
    are integers

11
FIR Filter Linear Mapping Example
  • Dependence matrix
  • Choose s 1 1T
  • Choose d 1 0T. Then, P0 1T.
  • PE Assignment
  • Linear Schedule (arc mapping)
  • Input/Output Mapping

12
FIR Linear Mapping
y(0) 2(1) y(2) ?
y(4)
y(5)
y(6)
k
D
2D
y(3)
h(4)
d
D
D
2D
y(2)
D
h(3)
D
2D
s
y(1)
h(2)
D
D
2D
y(0)
h(1)
D
D
2D
h(0)
D
n
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(0) x(1) x(2) ?
13
FIR Linear Mapping
y(0) 2(1) y(2) ?
Weight stays, input pipelined, long cr. path for
output y
D
D
d
D
D
D
s
D
D
D
D
D
x(0) x(1) x(2) ?
14
FIR Linear Mapping
y(0) 2(1) y(2) ?
Weight stays, input pipelined, long cr. path for
output y
3D
D
2D
d
3D
D
2D
3D
D
s
2D
3D
D
2D
3D
D
2D
x(0) x(1) x(2) ?
15
Requirements for Valid Linear Assignment and
Schedule Vectors
  • Causality constraint
  • s scheduling vector
  • v any dependence vector.
  • If iteration i has data dependence on iteration
    j, then t(i) gt t(j).
  • is permitted if v is a dependence
    vector due to localization of a broadcast
    variable
  • Resource conflict avoidance
  • s scheduling vector
  • d assignment vector
  • Note that sTd is the iteration interval between
    two successive iterations that are executed on
    the same PE. If sTd 0, a resource conflict will
    occur.

16
Example Sorting
i
  • Input x(i), output m(i)
  • m(i)? x(i), m(i)?m(j) for i lt j.
  • RIA formulation (m1(i,i) -?)
  • for i1N,
  • x1(i,1)x(i)
  • for j1i,
  • m1(i1,j)max(x1(i,j),m1(i,j))
  • x1(i,j1)min(x1(i,j),m1(i,j))
  • if iN,
  • m(j)m(N1,j)
  • end
  • end
  • end

x11 x21 x31 x41
m11 m21 m31 m41
m51
x12 x22 x32 x42
m22 m32 m42 m52
x23 x33 x43
m33 m43 m53
x34 x44
m44 m54
j
x45
sort734.m
17
Sorting 3 different Mappings
i
x11 x21 x31 x41
m11 m21 m31 m41
m51
D
x12 x22 x32 x42
m22 m32 m42 m52
Insertion sort
D
x23 x33 x43
m33 m43 m53
D
x34 x44
m44 m54
D
j
D
x45
D
D
D
D
D
D
D
D
D
D
D
D
Bubble sort
D
Selection sort
18
Optimal Design Method
  • Total computation time Tcomp
  • p, q node indices in the DG
  • Sampling period Ts
  • Constrained optimization formulation
  • Find d and s to
  • minimize Tcomp or Ts
  • Subject to sTv gt 0, sTd gt 0.

19
Multi-Projection
  • Since dim(d) 1, so the dimension of processor
    space (the dimension of systolic array) n-1
    where n is the dimension of the DG.
  • If the dimension of the systolic array is smaller
    than n-1, multi-projection can be applied.
  • Each projection will introduce a new scale in
    time.
  • 1st each delay D 1 iteration
  • 2nd each delay ? M iterations
  • Example

D
D
D
D
D
s
?
D?
d
D
D
D
s
d
20
Comparison of schedule
  • After first projection
  • Execution time 4D with D 4 t.u. where the
    execution time of one iteration (one node) is 1
    t. u.
  • After second projection
  • Execution time 3D4t
  • D4 t.u., t 1 t. u.,
  • D 4t
  • Same as if s 1 5T with D 1 t. u.

3D3t
3D
3D
2D3t
2D
2D
D
D3t
D
0
0
t
2t
3t
21
Multi-projection
  • New presentation
  • The logical processor space after earlier linear
    mapping can be regarded as an iteration DFG
    (IDFG) where each node represents the execution
    of an iteration, and a delay D represent
    dependence on previous iteration.
  • IDFG contains delays and cycles.
  • An instance graph (IG) can be created from IDFG
    by removing all edges with delays.
  • Node mapping same
  • Arc mapping delays on each dependence edge
  • ?D of delays in previous mapping
  • (sTv)? additional delay in new time scale after
    second level multi-projection

22
Illustration of Multi-projection
  • Remove arcs with delays to create instant graph
  • Project instant graph along new projection
    direction and add appropriate delay on edges of
    the IG.
  • Put back edges with delay with corrections due to
    the new delay unit.

DFG
?
IG
D
?
D?
Write a Comment
User Comments (0)
About PowerShow.com