Systolic Architectures: Why is RC fast presentation

About This Presentation

Transcript and Presenter's Notes

Title: Systolic Architectures: Why is RC fast

1
Systolic ArchitecturesWhy is RC fast?

Greg Stitt
ECE Department
University of Florida

2
Why are microprocessors slow?

Von Neumann architecture
Stored-program machine
Memory for instructions (and data)

3
Von Neumann architecture

Summary
1) Fetch instruction
2) Decode instruction, fetch data
3) Execute
4) Store results
5) Repeat from 1 until end of program
Problem
Inherently sequential
Only executes one instruction at a time
Does not take into consideration parallelism of
application

4
Von Neumann architecture

Problem 2 Von Neumann bottleneck
Constantly reading/writing data for every
instruction requires high memory bandwidth
Performance limited by bandwidth of memory

RAM
Bandwidth not sufficient - Von Neumann
bottleneck
Control
Datapath
5
Improvements

Increase resources in datapath to execute
multiple instructions in parallel
VLIW - very long instruction word
Compiler encodes parallelism into very-long
instructions
Superscalar
Architecture determines parallelism at run time -
out-of-order instruction execution
Von Neumann bottleneck still problem

RAM
Control
Datapath
Datapath
. . .
Datapath
6
Why is RC fast?

RC implements custom circuits for an application
Circuits can exploit massive amount of
parallelism
VLIW/Superscalar Parallelism
5 ins/cycle in best case (rarely occurs)
RC
Potentially thousands
As many ops as will fit in device
Also, supports different types of parallelism

7
Types of Parallelism

Bit-level

C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
8
Types of Parallelism

Arithmetic-level Parallelism

C Code
Circuit
for (i0 i lt 128 i) y ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..

7 cycles
Speedup gt 100x for same clock

1000s of instructions
Several thousand cycles

9
Types of Parallelism

Pipeline Parallelism

for (j0 j lt n j) y aj x bj
for (i0 i lt 128 i) y ci xi
// output y y 0
for (i0 i lt 128 i) yi ci
xi .. .. ..
Start new inner loop every cycle
After filling up pipeline, performs 128 mults
127 adds every cycle
10
Types of Parallelism

Task-level Parallelism
e.g. MPEG-2
Execute each task
in parallel

11
How to exploit parallelism?

General Idea
Identify tasks
Create circuit for each task
Communication between tasks with buffers
How to create circuit for each task?
Want to exploit bit-level, arithmetic-level, and
pipeline-level parallelism
Solution Systolic architectures
(arrays/computing)

12
Systolic Architectures

Systolic definition
The rhythmic contraction of the heart, especially
of the ventricles, by which blood is driven
through the aorta and pulmonary artery after each
dilation or diastole.
Analogy with heart pumping blood
We want architecture that pumps data through
efficiently.
Data flows from memory in a rhythmic fashion,
passing through many processing elements before
it returns to memory. Hung

13
Systolic Architecture

General Idea Fully pipelined circuit, with I/O
at top and bottom level
Local connections - each element communicates
with elements at same level or level below

Inputs arrive each cycle
Outputs depart each cycle, after
pipeline is full
14
Systolic Architecture

Simple Example
Create DFG (data flow graph) for body of loop
Represent data dependencies of code

for (i0 i lt 100 I) ai bi bi1
bi2
bi
bi1
bi2

ai
15
Simple Example

Add pipeline stages to each level of DFG

bi
bi1
bi2

ai
16
Simple Example

Allocate one resource (adder, ALU, etc) for each
operation in DFG
Resulting systolic architecture

b0
b1
b2
Cycle 1

for (i0 i lt 100 I) ai bi bi1
bi2

17
Simple Example

Allocate one resource for each operation in DFG
Resulting systolic architecture

b1
b2
b3
Cycle 2

for (i0 i lt 100 I) ai bi bi1
bi2
b0b1
b2

18
Simple Example

Allocate one resource for each operation in DFG
Resulting systolic architecture

b2
b3
b4
Cycle 3

for (i0 i lt 100 I) ai bi bi1
bi2
b1b2
b3

b0b1b2
19
Simple Example

Allocate one resource for each operation in DFG
Resulting systolic architecture

b3
b4
b5
Cycle 4

for (i0 i lt 100 I) ai bi bi1
bi2
b2b3
b4

b1b2b3
First output appears, takes 4 cycles to fill
pipeline
a0
20
Simple Example

Allocate one resource for each operation in DFG
Resulting systolic architecture

b4
b5
b6
Cycle 5

for (i0 i lt 100 I) ai bi bi1
bi2
b3b4
b5

b2b3b4
Total Cycles gt 4 init 99 103
One output per cycle at this point, 99 more until
completion
a1
21
uP Performance Comparison

Assumptions
10 instructions for loop body
CPI 1.5
Clk 10x faster than FPGA
Total SW cycles
100101.5 1,500 cycles
RC Speedup
(1500/103)(1/10) 1.46x

for (i0 i lt 100 I) ai bi bi1
bi2
22
uP Performance Comparison

What if uP clock is 15x faster?
e.g. 3 GHz vs. 200 MHz
RC Speedup
(1500/103)(1/15) .97x
RC is slightly slower
But!
RC requires much less power
Several Watts vs 100 Watts
SW may be practical for embedded uPs gt low power
Clock may be just 2x faster
(1500/103)(1/2) 7.3x faster
RC may be cheaper
Depends on area needed
This example would certainly be cheaper

for (i0 i lt 100 I) ai bi bi1
bi2
23
Simple Example, Cont.

Improvement to systolic array
Why not execute multiple iterations at same time?
No data dependencies
Loop unrolling

for (i0 i lt 100 I) ai bi bi1
bi2
Unrolled DFG
bi
bi1
bi2
bi1
bi2
bi3

. . . . .

ai
ai1
24
Simple Example, Cont.

How much to unroll?
Limited by memory bandwidth and area

Must get all inputs once per cycle
bi
bi1
bi2
bi1
bi2
bi3

. . . . .

ai
ai1
Must write all outputs once per cycle
Must be sufficient area for all ops in DFG
25
Unrolling Example

Original circuit

for (i0 i lt 100 I) ai bi bi1
bi2
1st iteration requires 3 inputs
b0
b1
b2

a0
26
Unrolling Example

Original circuit

Each unrolled iteration requires one additional
input
for (i0 i lt 100 I) ai bi bi1
bi2
b0
b1
b2
b3

a0
a1
27
Unrolling Example

Original circuit

Each cycle brings in 4 inputs (instead of 6)
for (i0 i lt 100 I) ai bi bi1
bi2
b1
b2
b3
b4

b3
b2
b0b1
b1b2

28
Performance after unrolling
for (i0 i lt 100 I) ai bi bi1
bi2

How much unrolling?
Assume b elements are 8 bits
First iteration requires 3 elements 24 bits
Each unrolled iteration requires 1 element 8
bit
Due to overlapping inputs
Assume memory bandwidth 64 bits/cycle
Can perform 6 iterations in parallel
(24 8 8 8 8 8) 64 bits
New performance
Unrolled systolic architecture requires
4 cycles to fill pipeline, 100/6 iterations
21 cycles
With unrolling, RC is (1500/21)(1/15) 4.8x
faster than 3 GHz microprocessor!!!

29
Importance of Memory Bandwidth

Performance with wider memories
128-bit bus
14 iterations in parallel
64 extra bits/8 bits per iteration 8 parallel
iterations
6 original unrolled iterations 14 total
parallel iterations
Total cycles 4 to fill pipeline 100/14 11
Speedup (1500/11)(1/15) 9.1x
Doubling memory width increased speedup from 4.8x
to 9.1x!!!
Important Point
Performance of hardware often limited by memory
bandwidth
More bandwidth gt more unrolling gt more
parallelism gt BIG SPEEDUP

30
Delay Registers

Common mistake
Forgetting to add registers for values not used
during a cycle
Values delayed or passed on until needed

Instead of

Correct
Incorrect
31
Delay Registers

Illustration of incorrect delays

for (i0 i lt 100 I) ai bi bi1
bi2
b0
b1
b2
Cycle 1

32
Delay Registers

Illustration of incorrect delays

for (i0 i lt 100 I) ai bi bi1
bi2
b1
b2
b3
Cycle 2

b0b1

b2 ?????
33
Delay Registers

Illustration of incorrect delays

for (i0 i lt 100 I) ai bi bi1
bi2
b2
b3
b4
Cycle 3

b1b2

b0 b1 b3
b2 ?????
34
Another Example

Your turn
Steps
Build DFG for body of loop
Add pipeline stages
Map operations to hardware resources
Assume divide takes one cycle
Determine maximum amount of unrolling
Memory bandwidth 128 bits/cycle
Determine performance compared to uP
Assume 15 instructions per iteration, CPI 1.5,
CLK 15x faster than RC

short b1004, a1000 for (i0 i lt 1000 i)
ai avg( bi, bi1, bi2, bi3,
bi4 )
35
Another Example, Cont.

What if divider takes 20 cycles?
But, fully pipelined
Calculate the effect on performance

In systolic architectures, performance usually
dominated by throughput of pipeline, not latency
36
Dealing with Dependencies

op2 is dependent on op1 when the input to op2 is
an output from op1
Problem limits arithmetic parallelism, increases
latency
i.e. Cant execute op2 before op1
Serious Problem FPGAs rely on parallelism for
performance
Little parallelism Bad performance

op1
op2
37
Dealing with Dependencies

Partial solution
Parallelizing transformations
e.g. tree height reduction

a
b
c
d
c
d
a
b

Depth of adders
Depth log2( of adders )
38
Dealing with Dependencies

Simple example w/ inter-iteration dependency -
potential problem for systolic arrays
Cant keep pipeline full

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
Cant execute until 1st iteration completes -
limited arithmetic parallelism, increases latency
b1
b2
a0
b2
b3
a1

39
Dealing with Dependencies

But, systolic arrays also have pipeline-level
parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
a1

40
Dealing with Dependencies

But, systolic arrays also have pipeline-level
parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3

a1

41
Dealing with Dependencies

But, systolic arrays also have pipeline-level
parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
b4
b3

a1
. . . .

a2

42
Dealing with Dependencies

But, systolic arrays also have pipeline-level
parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
b4
b3

Add pipeline stages gt systolic array

a1
. . . .

a2

Only works if loop is fully unrolled! Requires
sufficient memory bandwidth
Outputs not shown
43
Dealing with Dependencies

Your turn

char b1006 for (i0 i lt 1000 i)
acc0 for (j0 j lt 6 j) acc
bij

Steps
Build DFG for inner loop (note dependencies)
Fully unroll inner loop (check to see if memory
bandwidth allows)
Assume bandwidth 64 bits/cycle
Add pipeline stages
Map operations to hardware resources
Determine performance compared to uP
Assume 15 cycles per iteration, CPI 1.5, CLK
15x faster than RC

44
Dealing with Control

If statements

Cant wait for result of condition - stalls
pipeline
char b1006, a1000 for (i0 i lt 1000 i)
if (I 2 0) aI bI bI1
else aI bI2 bI3
Convert control into computation - if conversion
bI2
bi
bI1
bI3
i
2

MUX
ai
45
Dealing with Control

If conversion, not always so easy

char b1006, a1000, a21000 for (i0 i lt
1000 i) if (I 2 0) aI bI
bI1 else a2I bI2 bI3
bI2
ai
bi
bI1
bI3
i
2
a2i

MUX
MUX
ai
a2i
46
Other Challenges

Outputs can also limit unrolling
Example
4 outputs, 1 input
Each output 32 bits
Total output bandwidth for 1 iteration 128 bits
Memory bus 128 bits
Cant unroll, even though inputs only use 32 bits

long b1004, a1000 for (i0, j0 i lt 1000
i4, j) ai bj 10 ai1
bj 23 ai2 bj - 12 ai3
bj bj
47
Other Challenges

Requires streaming data to work well
Systolic array
But, pipelining is wasted because small data
stream
Point - systolic arrays work best with repeated
computation

for (i0 i lt 4 i) ai bi bi1
b3
b4
b0
b1
b2
b2
b3
b1

a3
a0
a2
a1
48
Other Challenges

Memory bandwidth
Values so far are peak values
Can only be achieved if all input data stored
sequentially in memory
Often not the case
Example
Two-dimensional arrays

long a100100, b100100 for (i1 i lt 100
i) for (j1 j lt 100 j) aij
avg( bi-1j, bIj-1, bI1j,
bIj1)
49
Other Challenges

Memory bandwidth, cont.
Example 2
Multiple array inputs
b and c stored in different locations
Memory accesses may jump back and forth
Possible solutions
Use multiple memories, or multiported memory
(high cost)
Interleave data from b and c in memory
(programming effort)
If no compiler support, requires manual rewite

long a100, b100, c100 for (i0 i lt 100
i) ai bi ci
50
Other Challenges

Dynamic memory access patterns
Sequence of addresses not known until run time
Clearly, not sequential
Possible solution
Something creative enough for a Ph.D thesis

int f( int val ) long a100, b100,
c100 for (i0 i lt 100 i) ai
brand()100 ci val
51
Other Challenges

Pointer-based data structures
Even if scanning through list, data could be all
over memory
Very unlikely to be sequential
Can cause aliasing problems
Greatly limits optimization potential
Solutions are another Ph. D.
Pointers ok if used as array

int f( int val ) long a100, b100 for
(i0 i lt 100 i) ai bi 1

int f( int val ) long a100, b100
long p b for (i0 i lt 100 i, p)
ai p 1
equivalent to
52
Other Challenges

Not all code is just one loop
Yet another Ph.D.
Main point to remember
Systolic arrays are extremely fast, but only
certain types of code work
What can we do instead of systolic arrays?

53
Other Options

Try something completely different
Try slight variation
Example - 3 inputs, but can only read 2 per cycle

Not possible - can only read two inputs per cycle
for (i0 i lt 100 I) ai bi bi1
bi2

54
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
bi2
bi
bi1

55
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b0
b1
Cycle 1

56
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
b2
Junk
Junk
b2
b0
b1
Cycle 2

Junk

57
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b1
b2
Junk
Junk
Cycle 3

b2
b0b1

Junk
58
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
b3
Junk
Junk
b1
b3
b2
Cycle 4

Junk
Junk

b0b1b2
Junk
59
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b2
b3
Junk
Junk
Junk
Cycle 5

b3
b1 b2

First output after 5 cycles
Junk
a0
60
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
b4
Junk
Junk
b2
b4
b3
Cycle 6

Junk
Junk

Junk on next cycle
b1b2b3
Junk
61
Variations

Example, cont.
Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b3
b4
Junk
Junk
Junk
Cycle 7

b4
b2b3
Valid output every 2 cycles - approximately 1/2
the performance

Second output 2 cycles later
Junk
a1
62
Entire Circuit
RAM
Input Address Generator
Buffers handle differences in speed between RAM
and datapath
Buffer
Controller
Datapath
Buffer
Output Address Generator
RAM

Write a Comment

User Comments (0)

About PowerShow.com

Systolic Architectures: Why is RC fast PowerPoint PPT Presentation