CprE / ComS 583 Reconfigurable Computing - PowerPoint PPT Presentation

About This Presentation

Title:

CprE / ComS 583 Reconfigurable Computing

Description:

Title: PowerPoint Presentation Last modified by: Joseph Zambreno Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 33

Provided by: iast157

Learn more at: https://www.ece.iastate.edu

Category:

more less

Transcript and Presenter's Notes

Title: CprE / ComS 583 Reconfigurable Computing

1
CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 12 Systolic Computing
2
Recap Multi-FPGA Systems

Crossbar topology
Devices A-D are routing only
Gives predictable performance
Potential waste of resources for near-neighbor
connections

3
Recap Logic Emulation

Emulation takes a sizable amount of resources
Compilation time can be large due to FPGA compiles

4
Recap Virtual Wires

Overcome pin limitations by multiplexing pins and
signals
Schedule when communication will take place

5
Outline

Recap
Introduction and Motivation
Common Systolic Structures
Algorithmic Mapping
Mapping Examples
Finite impulse response
Matrix-vector product
Banded matrix-vector product
Banded matrix multiplication

6
Systolic Computing

systole (sist?-le) n. the rhythmic
contraction of the heart, especially of the
ventricles, by which blood is driven through the
aorta and pulmonary artery after each dilation or
diastole
Greek systole, from systellein to contract,
from syn- stellein to send
systolic (sis-tõlik) adj.
Data flows from memory in a rhythmic fashion,
passing through many processing elements before
it returns to memory.
Kung, 1982

7
Systolic Architectures

Goal general methodology for mapping
computations into hardware (spatial computing)
structures
Composition
Simple compute cells (e.g. add, sub, max, min)
Regular interconnect pattern
Pipelined communication between cells
I/O at boundaries

x

x
min
x
x
c
8
Motivation

Effectively utilize VLSI
Reduce Von Neumann Bottleneck
Target compute-intensive applications
Reduce design cost
Simplicity
Regularity
Exploit concurrency
Local communication
Short wires (small delay, less area)
Scalable

9
Why Study?

Original motivation specialized accelerator for
an application
Model/goals is a close match to reconfigurable
computing
Target algorithms match
Well-developed theory, techniques, and solutions
One big difference Kungs approach targeted
custom silicon (not a reconfigurable fabric)
Compute elements needed to be more general

10
Common Systolic Structures

One-dimensional linear array

Two-dimensional mesh

11
Hexagonal Array
Squared-up representation

Communicates with six nearest neighbors

12
Binary Tree

H-Tree Representation

13
Mapping Approach

Allocate PEs
Schedule computation
Schedule PEs
Schedule data flow
Optimize
Available Transformations
Preload repeated values
Replace feedback loops with registers
Internalize data flow
Broadcast common input

14
Example Finite Impulse Response

A Finite Impulse Response (FIR) filter is a type
of digital filter
Finite response to an impulse eventually
settles to zero
Requires no feedback

for (i1 iltn i) for (j1 j ltk j)
yi wj xij-1
15
FIR Attempt 1

Parallelize the outer loop

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
wj
y1
xj
y1
wj
y2
xj1
y2
wj
yn
xnj-1
yn
16
FIR Atttempt 1 (cont.)

Broadcast common inputs

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
17
FIR Attempt 1 (cont.)

Retime to eliminate broadcast

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
18
FIR Attempt 1 (cont.)

Broadcast common values

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
19
FIR Attempt 2

Parallelize the inner loop

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
w1
yi
xi
yi
w2
yi
xi1
yi
wk
yi
xik-1
yi
20
FIR Attempt 2 (cont.)

Internalize data flow

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
21
FIR Attempt 2 (cont.)

Allocation schedule

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
22
FIR Attempt 2 (cont.)

Preload repeated values

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
23
FIR Attempt 2 (cont.)

Broadcast common values

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
24
FIR Attempt 2 (cont.)

Retime to eliminate broadcast

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
25
FIR Summary

Sequential
Memory bandwidth per output 2k1
O(k) cycles per output
O(1) hardware

Systolic
Memory bandwidth per output 2
O(1) cycles per output
O(k) hardware

xi
x
x
x
x
w1
w2
w3
w4

yi
26
Example Matrix-Vector Product
for (i1 iltn i) for (j1 jltn j)
yi aij xj
27
Matrix-Vector Product (cont.)
t 4
a41
a23
a23
a14

t 3
a31
a22
a13

t 2
a21
a12

t 1
a11

x1
x2
x3
x4
xn
y1
t n

y2
t n1
y3
t n2
y4
t n3
28
Banded Matrix-Vector Product
q
p
for (i1 iltn i) for (j1 jltpq-1
j) yi aij-q-i xj
29
Banded Matrix-Vector Product (cont.)
t 5
a23

a32

t 4

a22

a31
t 3
a12

a21

t 2

a11

t 1

yi
t 1
x1
t 2

ain
yout
yin
t 3
x2
t 4

xin
xout
t 5
x3
30
Banded Matrix Multiplication
31
Banded Matrix Multiplication (cont.)
t 7
c41

c22

c14
t 6

c31

c32

cout
t 5

c21

c12

t 4

c11

bin
ain
F
aout
bout

a12

a13

a11

cin

a21

t 5

a31
t 4

t 3

t 2
t 1
32
Summary