Title: CprE / ComS 583 Reconfigurable Computing
1CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 12 Systolic Computing
2Recap Multi-FPGA Systems
- Crossbar topology
- Devices A-D are routing only
- Gives predictable performance
- Potential waste of resources for near-neighbor
connections
3Recap Logic Emulation
- Emulation takes a sizable amount of resources
- Compilation time can be large due to FPGA compiles
4Recap Virtual Wires
- Overcome pin limitations by multiplexing pins and
signals - Schedule when communication will take place
5Outline
- Recap
- Introduction and Motivation
- Common Systolic Structures
- Algorithmic Mapping
- Mapping Examples
- Finite impulse response
- Matrix-vector product
- Banded matrix-vector product
- Banded matrix multiplication
6Systolic Computing
- systole (sist?-le) n. the rhythmic
contraction of the heart, especially of the
ventricles, by which blood is driven through the
aorta and pulmonary artery after each dilation or
diastole - Greek systole, from systellein to contract,
from syn- stellein to send - systolic (sis-tõlik) adj.
- Data flows from memory in a rhythmic fashion,
passing through many processing elements before
it returns to memory. - Kung, 1982
7Systolic Architectures
- Goal general methodology for mapping
computations into hardware (spatial computing)
structures - Composition
- Simple compute cells (e.g. add, sub, max, min)
- Regular interconnect pattern
- Pipelined communication between cells
- I/O at boundaries
x
x
min
x
x
c
8Motivation
- Effectively utilize VLSI
- Reduce Von Neumann Bottleneck
- Target compute-intensive applications
- Reduce design cost
- Simplicity
- Regularity
- Exploit concurrency
- Local communication
- Short wires (small delay, less area)
- Scalable
9Why Study?
- Original motivation specialized accelerator for
an application - Model/goals is a close match to reconfigurable
computing - Target algorithms match
- Well-developed theory, techniques, and solutions
- One big difference Kungs approach targeted
custom silicon (not a reconfigurable fabric) - Compute elements needed to be more general
10Common Systolic Structures
- One-dimensional linear array
11Hexagonal Array
Squared-up representation
- Communicates with six nearest neighbors
12Binary Tree
13Mapping Approach
- Allocate PEs
- Schedule computation
- Schedule PEs
- Schedule data flow
- Optimize
- Available Transformations
- Preload repeated values
- Replace feedback loops with registers
- Internalize data flow
- Broadcast common input
14Example Finite Impulse Response
- A Finite Impulse Response (FIR) filter is a type
of digital filter - Finite response to an impulse eventually
settles to zero - Requires no feedback
for (i1 iltn i) for (j1 j ltk j)
yi wj xij-1
15FIR Attempt 1
- Parallelize the outer loop
for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
wj
y1
xj
y1
wj
y2
xj1
y2
wj
yn
xnj-1
yn
16FIR Atttempt 1 (cont.)
for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
17FIR Attempt 1 (cont.)
- Retime to eliminate broadcast
for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
18FIR Attempt 1 (cont.)
for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
19FIR Attempt 2
- Parallelize the inner loop
for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
w1
yi
xi
yi
w2
yi
xi1
yi
wk
yi
xik-1
yi
20FIR Attempt 2 (cont.)
for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
21FIR Attempt 2 (cont.)
for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
22FIR Attempt 2 (cont.)
for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
23FIR Attempt 2 (cont.)
for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
24FIR Attempt 2 (cont.)
- Retime to eliminate broadcast
for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
25FIR Summary
- Sequential
- Memory bandwidth per output 2k1
- O(k) cycles per output
- O(1) hardware
- Systolic
- Memory bandwidth per output 2
- O(1) cycles per output
- O(k) hardware
xi
x
x
x
x
w1
w2
w3
w4
yi
26Example Matrix-Vector Product
for (i1 iltn i) for (j1 jltn j)
yi aij xj
27Matrix-Vector Product (cont.)
t 4
a41
a23
a23
a14
t 3
a31
a22
a13
t 2
a21
a12
t 1
a11
x1
x2
x3
x4
xn
y1
t n
y2
t n1
y3
t n2
y4
t n3
28Banded Matrix-Vector Product
q
p
for (i1 iltn i) for (j1 jltpq-1
j) yi aij-q-i xj
29Banded Matrix-Vector Product (cont.)
t 5
a23
a32
t 4
a22
a31
t 3
a12
a21
t 2
a11
t 1
yi
t 1
x1
t 2
ain
yout
yin
t 3
x2
t 4
xin
xout
t 5
x3
30Banded Matrix Multiplication
31Banded Matrix Multiplication (cont.)
t 7
c41
c22
c14
t 6
c31
c32
cout
t 5
c21
c12
t 4
c11
bin
ain
F
aout
bout
a12
a13
a11
cin
a21
t 5
a31
t 4
t 3
t 2
t 1
32Summary
- Systolic structures are good for
computation-bound problems - Models costs in VLSI systems
- Minimize number of memory accesses
- Emphasize local interconnections (long wires are
bad) - Candidate algorithms
- Makes multiple use of input data (ex n inputs,
O(n3) computations - Concurrency
- Simple control flow, simple processing elements