Title: Chapter 7' Systolic Array
1Chapter 7. Systolic Array
2Systolic Array
- Systolic array is an array processor architecture
that consists of - An array of identical processing units (PUs) or
nodes - Inter-connected with localized data links
- that performs
- Pipelined computation between PUs with
- Identical computation at each node
- Motivations
- Low communication overhead
- Easy to design
- Suitable for VLSI implementation
- Applications
- Implementation of algorithms that can be
formulated in nested loops - Numerical linear algebra
- Signal and image processing
3Systolic Design Methodology
- Algorithm mapping
- Individual PUs are assigned with indices in the
index space - Assignment
- Each node in the iteration DGs in the index space
will be mapped onto a PUs index - Scheduling
- Each node in the iteration DGs will be assigned
with an integer schedule indicating the time step
it is to be executed.
- Linear Mapping Methodology
- In general, the assignment and scheduling is a
nonlinear operation. - However, if the iteration DG corresponds to a RIA
algorithm, the assignment and scheduling can be
accomplished by linear projection of each node in
the DG onto the index space of the PUs, and be
assigned with a schedule.
4Formulate Algorithm in RIA format
- Single Assignment Transformation
- Remove unnecessary false data dependency between
iterations - Accomplished by introducing a new variable or
array of variables to hold intermediate values
during computation. - May impose unnecessary dependence constraint
during mapping.
- Pipelined data duplication
- Replace data broadcasting that requires global
data bus - Without affecting the algorithm performance
- Accomplished by introducing an intermediate
variable that propagated among index nodes in the
DG - May impose unnecessary dependence constraint
during mapping
5Example FIR Filter
y(4)
y(5)
y(6)
- FIR filter formulation
- Single assignment format with broadcasting data
- Do n1,2, . . .
- y1(n,-1)0
- Do k0,K
- y1(n,k)y1(n,k-1)
- h(k)x(n-k)
- enddo
- y(n)y1(n,K)
- Enddo
k
y(3)
h(4)
y(2)
h(3)
y(1)
h(2)
y(0)
h(1)
h(0)
n
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
6Example FIR Filter
- Regular Recurrent eq.
- y1(n,-1)0, n 0,1,2,
- h1(0,k)h(k), k0,,K
- n0,1,2,, and k0,,K
- y1(n,k)y1(n,k-1)
- h1(n,k)x1(n,k)
- h1(n,k)h1(n-1,k)
- x1(n,k)x1(n-1,k-1)
- y(n)y1(n,K), n0,1,2,
- Leads to SIDG
7Linear Schedule and Assignment
- A schedule t(i) is a mapping from index i in the
DG to a positive integer t (time index) - A time index is a quantum of time that takes to
execute the operations of an iteration. - A linear schedule maps all indices i on the same
hyper-plane to the same time index. - It can be characterized by the normal vector of
the equi-temporal hyper plane s.
- An assignment is a mapping from an index i in the
DG to an index n in the systolic array processor
index. - The processor index space has lower dimension
than that of the DG. - A linear assignment assigns all indices along the
same vector d in the DG to the same processor
(index).
8Algebraic Formulation of Linear Assignment and
Schedule
- Entries of the assignment vector d and scheduling
vector s must be integers. Their dimensions are
the same as the DG indices. - Processor space the orthogonal subspace of d.
Its entries are also integers.
- PE Assignment by index node mapping
- n PT i
- Scheduling by arc (dependence vectors) mapping
- ?(e) of delays on the edge of DFG
- e edges of the systolic array
- v dependence vector
9Affine Transformation
- Processor space P span the subspace where the
processor array index space lies. - For any DG index i, it is assigned to the PE
whose index is - p(i) Pi po
- where po mini?DG Pi is an offset.
- If iteration i and j are both assigned to the
same PE, then i j kd where k is an integer.
- The iteration i is scheduled to be executed at
- t(i) sTi to
- time step where
- to mini?DG sTi is an offset.
- If iteration i and j are both assigned to the
same PE, the schedule duration is - t(i) t(j) sT(i j) k(sTd).
- Thus, when k 1, sTd is the iteration interval
between execution of two successive iterations on
the same PE.
10Finding Processor Space Matrix
- Problem
- Given projection vector d, how to find processor
space P such that ?PPT ?ddT I? where ?, ?
are scaling constants such that P and d have
integer entries. - Solution
- Find n ? n-1 matrix V, s.t.
- Convert entries of V into integers to yield the P
matrix. -
- Find matrix V
- Compute
- Factorize M matrix
- this can be accomplished using LU factorization,
eigenvalue or singular value decomposition of M. - Scale entries of the V matrix so that all entries
are integers
11FIR Filter Linear Mapping Example
- Dependence matrix
- Choose s 1 1T
- Choose d 1 0T. Then, P0 1T.
- PE Assignment
- Linear Schedule (arc mapping)
- Input/Output Mapping
12FIR Linear Mapping
y(0) 2(1) y(2) ?
y(4)
y(5)
y(6)
k
D
2D
y(3)
h(4)
d
D
D
2D
y(2)
D
h(3)
D
2D
s
y(1)
h(2)
D
D
2D
y(0)
h(1)
D
D
2D
h(0)
D
n
x(0)
x(1)
x(2)
x(3)
x(4)
x(5)
x(6)
x(0) x(1) x(2) ?
13FIR Linear Mapping
y(0) 2(1) y(2) ?
Weight stays, input pipelined, long cr. path for
output y
D
D
d
D
D
D
s
D
D
D
D
D
x(0) x(1) x(2) ?
14FIR Linear Mapping
y(0) 2(1) y(2) ?
Weight stays, input pipelined, long cr. path for
output y
3D
D
2D
d
3D
D
2D
3D
D
s
2D
3D
D
2D
3D
D
2D
x(0) x(1) x(2) ?
15Requirements for Valid Linear Assignment and
Schedule Vectors
- Causality constraint
- s scheduling vector
- v any dependence vector.
- If iteration i has data dependence on iteration
j, then t(i) gt t(j). - is permitted if v is a dependence
vector due to localization of a broadcast
variable
- Resource conflict avoidance
- s scheduling vector
- d assignment vector
- Note that sTd is the iteration interval between
two successive iterations that are executed on
the same PE. If sTd 0, a resource conflict will
occur.
16Example Sorting
i
- Input x(i), output m(i)
- m(i)? x(i), m(i)?m(j) for i lt j.
- RIA formulation (m1(i,i) -?)
- for i1N,
- x1(i,1)x(i)
- for j1i,
- m1(i1,j)max(x1(i,j),m1(i,j))
- x1(i,j1)min(x1(i,j),m1(i,j))
- if iN,
- m(j)m(N1,j)
- end
- end
- end
x11 x21 x31 x41
m11 m21 m31 m41
m51
x12 x22 x32 x42
m22 m32 m42 m52
x23 x33 x43
m33 m43 m53
x34 x44
m44 m54
j
x45
sort734.m
17Sorting 3 different Mappings
i
x11 x21 x31 x41
m11 m21 m31 m41
m51
D
x12 x22 x32 x42
m22 m32 m42 m52
Insertion sort
D
x23 x33 x43
m33 m43 m53
D
x34 x44
m44 m54
D
j
D
x45
D
D
D
D
D
D
D
D
D
D
D
D
Bubble sort
D
Selection sort
18Optimal Design Method
- Total computation time Tcomp
- p, q node indices in the DG
- Sampling period Ts
- Constrained optimization formulation
- Find d and s to
- minimize Tcomp or Ts
- Subject to sTv gt 0, sTd gt 0.
19Multi-Projection
- Since dim(d) 1, so the dimension of processor
space (the dimension of systolic array) n-1
where n is the dimension of the DG. - If the dimension of the systolic array is smaller
than n-1, multi-projection can be applied. - Each projection will introduce a new scale in
time. - 1st each delay D 1 iteration
- 2nd each delay ? M iterations
D
D
D
D
D
s
?
D?
d
D
D
D
s
d
20Comparison of schedule
- After first projection
- Execution time 4D with D 4 t.u. where the
execution time of one iteration (one node) is 1
t. u.
- After second projection
- Execution time 3D4t
- D4 t.u., t 1 t. u.,
- D 4t
- Same as if s 1 5T with D 1 t. u.
3D3t
3D
3D
2D3t
2D
2D
D
D3t
D
0
0
t
2t
3t
21Multi-projection
- New presentation
- The logical processor space after earlier linear
mapping can be regarded as an iteration DFG
(IDFG) where each node represents the execution
of an iteration, and a delay D represent
dependence on previous iteration. - IDFG contains delays and cycles.
- An instance graph (IG) can be created from IDFG
by removing all edges with delays.
- Node mapping same
- Arc mapping delays on each dependence edge
- ?D of delays in previous mapping
- (sTv)? additional delay in new time scale after
second level multi-projection
22Illustration of Multi-projection
- Remove arcs with delays to create instant graph
- Project instant graph along new projection
direction and add appropriate delay on edges of
the IG. - Put back edges with delay with corrections due to
the new delay unit.
DFG
?
IG
D
?
D?