CprE / ComS 583 Reconfigurable Computing - PowerPoint PPT Presentation

About This Presentation
Title:

CprE / ComS 583 Reconfigurable Computing

Description:

Title: PowerPoint Presentation Last modified by: Joseph Zambreno Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 33
Provided by: iast157
Category:

less

Transcript and Presenter's Notes

Title: CprE / ComS 583 Reconfigurable Computing


1
CprE / ComS 583Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical
and Computer Engineering Iowa State
University Lecture 12 Systolic Computing
2
Recap Multi-FPGA Systems
  • Crossbar topology
  • Devices A-D are routing only
  • Gives predictable performance
  • Potential waste of resources for near-neighbor
    connections

3
Recap Logic Emulation
  • Emulation takes a sizable amount of resources
  • Compilation time can be large due to FPGA compiles

4
Recap Virtual Wires
  • Overcome pin limitations by multiplexing pins and
    signals
  • Schedule when communication will take place

5
Outline
  • Recap
  • Introduction and Motivation
  • Common Systolic Structures
  • Algorithmic Mapping
  • Mapping Examples
  • Finite impulse response
  • Matrix-vector product
  • Banded matrix-vector product
  • Banded matrix multiplication

6
Systolic Computing
  • systole (sist?-le) n. the rhythmic
    contraction of the heart, especially of the
    ventricles, by which blood is driven through the
    aorta and pulmonary artery after each dilation or
    diastole
  • Greek systole, from systellein to contract,
    from syn- stellein to send
  • systolic (sis-tõlik) adj.
  • Data flows from memory in a rhythmic fashion,
    passing through many processing elements before
    it returns to memory.
  • Kung, 1982

7
Systolic Architectures
  • Goal general methodology for mapping
    computations into hardware (spatial computing)
    structures
  • Composition
  • Simple compute cells (e.g. add, sub, max, min)
  • Regular interconnect pattern
  • Pipelined communication between cells
  • I/O at boundaries

x

x
min
x
x
c
8
Motivation
  • Effectively utilize VLSI
  • Reduce Von Neumann Bottleneck
  • Target compute-intensive applications
  • Reduce design cost
  • Simplicity
  • Regularity
  • Exploit concurrency
  • Local communication
  • Short wires (small delay, less area)
  • Scalable

9
Why Study?
  • Original motivation specialized accelerator for
    an application
  • Model/goals is a close match to reconfigurable
    computing
  • Target algorithms match
  • Well-developed theory, techniques, and solutions
  • One big difference Kungs approach targeted
    custom silicon (not a reconfigurable fabric)
  • Compute elements needed to be more general

10
Common Systolic Structures
  • One-dimensional linear array
  • Two-dimensional mesh

11
Hexagonal Array
Squared-up representation
  • Communicates with six nearest neighbors

12
Binary Tree
  • H-Tree Representation

13
Mapping Approach
  • Allocate PEs
  • Schedule computation
  • Schedule PEs
  • Schedule data flow
  • Optimize
  • Available Transformations
  • Preload repeated values
  • Replace feedback loops with registers
  • Internalize data flow
  • Broadcast common input

14
Example Finite Impulse Response
  • A Finite Impulse Response (FIR) filter is a type
    of digital filter
  • Finite response to an impulse eventually
    settles to zero
  • Requires no feedback

for (i1 iltn i) for (j1 j ltk j)
yi wj xij-1
15
FIR Attempt 1
  • Parallelize the outer loop

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
wj
y1
xj
y1
wj
y2
xj1
y2
wj
yn
xnj-1
yn
16
FIR Atttempt 1 (cont.)
  • Broadcast common inputs

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
17
FIR Attempt 1 (cont.)
  • Retime to eliminate broadcast

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
18
FIR Attempt 1 (cont.)
  • Broadcast common values

for (i1 iltn i) in parallel for (j1
j ltk j) sequential yi wj
xij-1
19
FIR Attempt 2
  • Parallelize the inner loop

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
w1
yi
xi
yi
w2
yi
xi1
yi
wk
yi
xik-1
yi
20
FIR Attempt 2 (cont.)
  • Internalize data flow

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
21
FIR Attempt 2 (cont.)
  • Allocation schedule

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
22
FIR Attempt 2 (cont.)
  • Preload repeated values

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
23
FIR Attempt 2 (cont.)
  • Broadcast common values

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
24
FIR Attempt 2 (cont.)
  • Retime to eliminate broadcast

for (i1 iltn i) sequential for (j1 j
ltk j) in parallel yi wj xij-1
25
FIR Summary
  • Sequential
  • Memory bandwidth per output 2k1
  • O(k) cycles per output
  • O(1) hardware
  • Systolic
  • Memory bandwidth per output 2
  • O(1) cycles per output
  • O(k) hardware

xi
x
x
x
x
w1
w2
w3
w4




yi
26
Example Matrix-Vector Product
for (i1 iltn i) for (j1 jltn j)
yi aij xj
27
Matrix-Vector Product (cont.)
t 4
a41
a23
a23
a14

t 3
a31
a22
a13


t 2
a21
a12



t 1
a11




x1
x2
x3
x4
xn
y1
t n

y2
t n1
y3
t n2
y4
t n3
28
Banded Matrix-Vector Product
q
p
for (i1 iltn i) for (j1 jltpq-1
j) yi aij-q-i xj
29
Banded Matrix-Vector Product (cont.)
t 5
a23

a32

t 4

a22

a31
t 3
a12

a21

t 2

a11


t 1




yi
t 1
x1
t 2

ain
yout
yin
t 3
x2
t 4

xin
xout
t 5
x3
30
Banded Matrix Multiplication
31
Banded Matrix Multiplication (cont.)
t 7
c41


c22


c14
t 6

c31



c32

cout
t 5


c21

c12


t 4



c11



bin
ain
F
aout
bout

a12

a13


a11

cin






a21

t 5

a31
t 4

t 3

t 2
t 1
32
Summary
  • Systolic structures are good for
    computation-bound problems
  • Models costs in VLSI systems
  • Minimize number of memory accesses
  • Emphasize local interconnections (long wires are
    bad)
  • Candidate algorithms
  • Makes multiple use of input data (ex n inputs,
    O(n3) computations
  • Concurrency
  • Simple control flow, simple processing elements
Write a Comment
User Comments (0)
About PowerShow.com