Title: Systolic Architectures: Why is RC fast
1Systolic ArchitecturesWhy is RC fast?
- Greg Stitt
- ECE Department
- University of Florida
2Why are microprocessors slow?
- Von Neumann architecture
- Stored-program machine
- Memory for instructions (and data)
3Von Neumann architecture
- Summary
- 1) Fetch instruction
- 2) Decode instruction, fetch data
- 3) Execute
- 4) Store results
- 5) Repeat from 1 until end of program
- Problem
- Inherently sequential
- Only executes one instruction at a time
- Does not take into consideration parallelism of
application
4Von Neumann architecture
- Problem 2 Von Neumann bottleneck
- Constantly reading/writing data for every
instruction requires high memory bandwidth - Performance limited by bandwidth of memory
RAM
Bandwidth not sufficient - Von Neumann
bottleneck
Control
Datapath
5Improvements
- Increase resources in datapath to execute
multiple instructions in parallel - VLIW - very long instruction word
- Compiler encodes parallelism into very-long
instructions - Superscalar
- Architecture determines parallelism at run time -
out-of-order instruction execution - Von Neumann bottleneck still problem
RAM
Control
Datapath
Datapath
. . .
Datapath
6Why is RC fast?
- RC implements custom circuits for an application
- Circuits can exploit massive amount of
parallelism - VLIW/Superscalar Parallelism
- 5 ins/cycle in best case (rarely occurs)
- RC
- Potentially thousands
- As many ops as will fit in device
- Also, supports different types of parallelism
7Types of Parallelism
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
8Types of Parallelism
- Arithmetic-level Parallelism
C Code
Circuit
for (i0 i lt 128 i) y ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
- 7 cycles
- Speedup gt 100x for same clock
- 1000s of instructions
- Several thousand cycles
9Types of Parallelism
for (j0 j lt n j) y aj x bj
for (i0 i lt 128 i) y ci xi
// output y y 0
for (i0 i lt 128 i) yi ci
xi .. .. ..
Start new inner loop every cycle
After filling up pipeline, performs 128 mults
127 adds every cycle
10Types of Parallelism
- Task-level Parallelism
- e.g. MPEG-2
- Execute each task
- in parallel
11How to exploit parallelism?
- General Idea
- Identify tasks
- Create circuit for each task
- Communication between tasks with buffers
- How to create circuit for each task?
- Want to exploit bit-level, arithmetic-level, and
pipeline-level parallelism - Solution Systolic architectures
(arrays/computing)
12Systolic Architectures
- Systolic definition
- The rhythmic contraction of the heart, especially
of the ventricles, by which blood is driven
through the aorta and pulmonary artery after each
dilation or diastole. - Analogy with heart pumping blood
- We want architecture that pumps data through
efficiently. - Data flows from memory in a rhythmic fashion,
passing through many processing elements before
it returns to memory. Hung
13Systolic Architecture
- General Idea Fully pipelined circuit, with I/O
at top and bottom level - Local connections - each element communicates
with elements at same level or level below
Inputs arrive each cycle
Outputs depart each cycle, after
pipeline is full
14Systolic Architecture
- Simple Example
- Create DFG (data flow graph) for body of loop
- Represent data dependencies of code
for (i0 i lt 100 I) ai bi bi1
bi2
bi
bi1
bi2
ai
15Simple Example
- Add pipeline stages to each level of DFG
bi
bi1
bi2
ai
16Simple Example
- Allocate one resource (adder, ALU, etc) for each
operation in DFG - Resulting systolic architecture
b0
b1
b2
Cycle 1
for (i0 i lt 100 I) ai bi bi1
bi2
17Simple Example
- Allocate one resource for each operation in DFG
- Resulting systolic architecture
b1
b2
b3
Cycle 2
for (i0 i lt 100 I) ai bi bi1
bi2
b0b1
b2
18Simple Example
- Allocate one resource for each operation in DFG
- Resulting systolic architecture
b2
b3
b4
Cycle 3
for (i0 i lt 100 I) ai bi bi1
bi2
b1b2
b3
b0b1b2
19Simple Example
- Allocate one resource for each operation in DFG
- Resulting systolic architecture
b3
b4
b5
Cycle 4
for (i0 i lt 100 I) ai bi bi1
bi2
b2b3
b4
b1b2b3
First output appears, takes 4 cycles to fill
pipeline
a0
20Simple Example
- Allocate one resource for each operation in DFG
- Resulting systolic architecture
b4
b5
b6
Cycle 5
for (i0 i lt 100 I) ai bi bi1
bi2
b3b4
b5
b2b3b4
Total Cycles gt 4 init 99 103
One output per cycle at this point, 99 more until
completion
a1
21uP Performance Comparison
- Assumptions
- 10 instructions for loop body
- CPI 1.5
- Clk 10x faster than FPGA
- Total SW cycles
- 100101.5 1,500 cycles
- RC Speedup
- (1500/103)(1/10) 1.46x
for (i0 i lt 100 I) ai bi bi1
bi2
22uP Performance Comparison
- What if uP clock is 15x faster?
- e.g. 3 GHz vs. 200 MHz
- RC Speedup
- (1500/103)(1/15) .97x
- RC is slightly slower
- But!
- RC requires much less power
- Several Watts vs 100 Watts
- SW may be practical for embedded uPs gt low power
- Clock may be just 2x faster
- (1500/103)(1/2) 7.3x faster
- RC may be cheaper
- Depends on area needed
- This example would certainly be cheaper
for (i0 i lt 100 I) ai bi bi1
bi2
23Simple Example, Cont.
- Improvement to systolic array
- Why not execute multiple iterations at same time?
- No data dependencies
- Loop unrolling
for (i0 i lt 100 I) ai bi bi1
bi2
Unrolled DFG
bi
bi1
bi2
bi1
bi2
bi3
. . . . .
ai
ai1
24Simple Example, Cont.
- How much to unroll?
- Limited by memory bandwidth and area
Must get all inputs once per cycle
bi
bi1
bi2
bi1
bi2
bi3
. . . . .
ai
ai1
Must write all outputs once per cycle
Must be sufficient area for all ops in DFG
25Unrolling Example
for (i0 i lt 100 I) ai bi bi1
bi2
1st iteration requires 3 inputs
b0
b1
b2
a0
26Unrolling Example
Each unrolled iteration requires one additional
input
for (i0 i lt 100 I) ai bi bi1
bi2
b0
b1
b2
b3
a0
a1
27Unrolling Example
Each cycle brings in 4 inputs (instead of 6)
for (i0 i lt 100 I) ai bi bi1
bi2
b1
b2
b3
b4
b3
b2
b0b1
b1b2
28Performance after unrolling
for (i0 i lt 100 I) ai bi bi1
bi2
- How much unrolling?
- Assume b elements are 8 bits
- First iteration requires 3 elements 24 bits
- Each unrolled iteration requires 1 element 8
bit - Due to overlapping inputs
- Assume memory bandwidth 64 bits/cycle
- Can perform 6 iterations in parallel
- (24 8 8 8 8 8) 64 bits
- New performance
- Unrolled systolic architecture requires
- 4 cycles to fill pipeline, 100/6 iterations
- 21 cycles
- With unrolling, RC is (1500/21)(1/15) 4.8x
faster than 3 GHz microprocessor!!!
29Importance of Memory Bandwidth
- Performance with wider memories
- 128-bit bus
- 14 iterations in parallel
- 64 extra bits/8 bits per iteration 8 parallel
iterations - 6 original unrolled iterations 14 total
parallel iterations - Total cycles 4 to fill pipeline 100/14 11
- Speedup (1500/11)(1/15) 9.1x
- Doubling memory width increased speedup from 4.8x
to 9.1x!!! - Important Point
- Performance of hardware often limited by memory
bandwidth - More bandwidth gt more unrolling gt more
parallelism gt BIG SPEEDUP
30Delay Registers
- Common mistake
- Forgetting to add registers for values not used
during a cycle - Values delayed or passed on until needed
Instead of
Correct
Incorrect
31Delay Registers
- Illustration of incorrect delays
for (i0 i lt 100 I) ai bi bi1
bi2
b0
b1
b2
Cycle 1
32Delay Registers
- Illustration of incorrect delays
for (i0 i lt 100 I) ai bi bi1
bi2
b1
b2
b3
Cycle 2
b0b1
b2 ?????
33Delay Registers
- Illustration of incorrect delays
for (i0 i lt 100 I) ai bi bi1
bi2
b2
b3
b4
Cycle 3
b1b2
b0 b1 b3
b2 ?????
34Another Example
- Your turn
- Steps
- Build DFG for body of loop
- Add pipeline stages
- Map operations to hardware resources
- Assume divide takes one cycle
- Determine maximum amount of unrolling
- Memory bandwidth 128 bits/cycle
- Determine performance compared to uP
- Assume 15 instructions per iteration, CPI 1.5,
CLK 15x faster than RC
short b1004, a1000 for (i0 i lt 1000 i)
ai avg( bi, bi1, bi2, bi3,
bi4 )
35Another Example, Cont.
- What if divider takes 20 cycles?
- But, fully pipelined
- Calculate the effect on performance
In systolic architectures, performance usually
dominated by throughput of pipeline, not latency
36Dealing with Dependencies
- op2 is dependent on op1 when the input to op2 is
an output from op1 - Problem limits arithmetic parallelism, increases
latency - i.e. Cant execute op2 before op1
- Serious Problem FPGAs rely on parallelism for
performance - Little parallelism Bad performance
op1
op2
37Dealing with Dependencies
- Partial solution
- Parallelizing transformations
- e.g. tree height reduction
a
b
c
d
c
d
a
b
Depth of adders
Depth log2( of adders )
38Dealing with Dependencies
- Simple example w/ inter-iteration dependency -
potential problem for systolic arrays - Cant keep pipeline full
a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
Cant execute until 1st iteration completes -
limited arithmetic parallelism, increases latency
b1
b2
a0
b2
b3
a1
39Dealing with Dependencies
- But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
a1
40Dealing with Dependencies
- But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
a1
41Dealing with Dependencies
- But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
b4
b3
a1
. . . .
a2
42Dealing with Dependencies
- But, systolic arrays also have pipeline-level
parallelism - latency less of an issue
a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
b4
b3
Add pipeline stages gt systolic array
a1
. . . .
a2
Only works if loop is fully unrolled! Requires
sufficient memory bandwidth
Outputs not shown
43Dealing with Dependencies
char b1006 for (i0 i lt 1000 i)
acc0 for (j0 j lt 6 j) acc
bij
- Steps
- Build DFG for inner loop (note dependencies)
- Fully unroll inner loop (check to see if memory
bandwidth allows) - Assume bandwidth 64 bits/cycle
- Add pipeline stages
- Map operations to hardware resources
- Determine performance compared to uP
- Assume 15 cycles per iteration, CPI 1.5, CLK
15x faster than RC
44Dealing with Control
Cant wait for result of condition - stalls
pipeline
char b1006, a1000 for (i0 i lt 1000 i)
if (I 2 0) aI bI bI1
else aI bI2 bI3
Convert control into computation - if conversion
bI2
bi
bI1
bI3
i
2
MUX
ai
45Dealing with Control
- If conversion, not always so easy
char b1006, a1000, a21000 for (i0 i lt
1000 i) if (I 2 0) aI bI
bI1 else a2I bI2 bI3
bI2
ai
bi
bI1
bI3
i
2
a2i
MUX
MUX
ai
a2i
46Other Challenges
- Outputs can also limit unrolling
- Example
- 4 outputs, 1 input
- Each output 32 bits
- Total output bandwidth for 1 iteration 128 bits
- Memory bus 128 bits
- Cant unroll, even though inputs only use 32 bits
long b1004, a1000 for (i0, j0 i lt 1000
i4, j) ai bj 10 ai1
bj 23 ai2 bj - 12 ai3
bj bj
47Other Challenges
- Requires streaming data to work well
- Systolic array
- But, pipelining is wasted because small data
stream - Point - systolic arrays work best with repeated
computation
for (i0 i lt 4 i) ai bi bi1
b3
b4
b0
b1
b2
b2
b3
b1
a3
a0
a2
a1
48Other Challenges
- Memory bandwidth
- Values so far are peak values
- Can only be achieved if all input data stored
sequentially in memory - Often not the case
- Example
- Two-dimensional arrays
long a100100, b100100 for (i1 i lt 100
i) for (j1 j lt 100 j) aij
avg( bi-1j, bIj-1, bI1j,
bIj1)
49Other Challenges
- Memory bandwidth, cont.
- Example 2
- Multiple array inputs
- b and c stored in different locations
- Memory accesses may jump back and forth
- Possible solutions
- Use multiple memories, or multiported memory
(high cost) - Interleave data from b and c in memory
(programming effort) - If no compiler support, requires manual rewite
long a100, b100, c100 for (i0 i lt 100
i) ai bi ci
50Other Challenges
- Dynamic memory access patterns
- Sequence of addresses not known until run time
- Clearly, not sequential
- Possible solution
- Something creative enough for a Ph.D thesis
int f( int val ) long a100, b100,
c100 for (i0 i lt 100 i) ai
brand()100 ci val
51Other Challenges
- Pointer-based data structures
- Even if scanning through list, data could be all
over memory - Very unlikely to be sequential
- Can cause aliasing problems
- Greatly limits optimization potential
- Solutions are another Ph. D.
- Pointers ok if used as array
int f( int val ) long a100, b100 for
(i0 i lt 100 i) ai bi 1
int f( int val ) long a100, b100
long p b for (i0 i lt 100 i, p)
ai p 1
equivalent to
52Other Challenges
- Not all code is just one loop
- Yet another Ph.D.
- Main point to remember
- Systolic arrays are extremely fast, but only
certain types of code work - What can we do instead of systolic arrays?
53Other Options
- Try something completely different
- Try slight variation
- Example - 3 inputs, but can only read 2 per cycle
Not possible - can only read two inputs per cycle
for (i0 i lt 100 I) ai bi bi1
bi2
54Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
bi2
bi
bi1
55Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b0
b1
Cycle 1
56Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
b2
Junk
Junk
b2
b0
b1
Cycle 2
Junk
57Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b1
b2
Junk
Junk
Cycle 3
b2
b0b1
Junk
58Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
b3
Junk
Junk
b1
b3
b2
Cycle 4
Junk
Junk
b0b1b2
Junk
59Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b2
b3
Junk
Junk
Junk
Cycle 5
b3
b1 b2
First output after 5 cycles
Junk
a0
60Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
b4
Junk
Junk
b2
b4
b3
Cycle 6
Junk
Junk
Junk on next cycle
b1b2b3
Junk
61Variations
- Example, cont.
- Break previous rules - use extra delay registers
for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b3
b4
Junk
Junk
Junk
Cycle 7
b4
b2b3
Valid output every 2 cycles - approximately 1/2
the performance
Second output 2 cycles later
Junk
a1
62Entire Circuit
RAM
Input Address Generator
Buffers handle differences in speed between RAM
and datapath
Buffer
Controller
Datapath
Buffer
Output Address Generator
RAM