Systolic Architectures: Why is RC fast - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Systolic Architectures: Why is RC fast

Description:

... from memory in a rhythmic fashion, passing through many processing elements ... Illustration of incorrect delays. b[0] b[1] b[2] Cycle 1. for (i=0; i 100; I ) ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 63
Provided by: gst4
Category:

less

Transcript and Presenter's Notes

Title: Systolic Architectures: Why is RC fast


1
Systolic ArchitecturesWhy is RC fast?
  • Greg Stitt
  • ECE Department
  • University of Florida

2
Why are microprocessors slow?
  • Von Neumann architecture
  • Stored-program machine
  • Memory for instructions (and data)

3
Von Neumann architecture
  • Summary
  • 1) Fetch instruction
  • 2) Decode instruction, fetch data
  • 3) Execute
  • 4) Store results
  • 5) Repeat from 1 until end of program
  • Problem
  • Inherently sequential
  • Only executes one instruction at a time
  • Does not take into consideration parallelism of
    application

4
Von Neumann architecture
  • Problem 2 Von Neumann bottleneck
  • Constantly reading/writing data for every
    instruction requires high memory bandwidth
  • Performance limited by bandwidth of memory

RAM
Bandwidth not sufficient - Von Neumann
bottleneck
Control
Datapath
5
Improvements
  • Increase resources in datapath to execute
    multiple instructions in parallel
  • VLIW - very long instruction word
  • Compiler encodes parallelism into very-long
    instructions
  • Superscalar
  • Architecture determines parallelism at run time -
    out-of-order instruction execution
  • Von Neumann bottleneck still problem

RAM
Control
Datapath
Datapath
. . .
Datapath
6
Why is RC fast?
  • RC implements custom circuits for an application
  • Circuits can exploit massive amount of
    parallelism
  • VLIW/Superscalar Parallelism
  • 5 ins/cycle in best case (rarely occurs)
  • RC
  • Potentially thousands
  • As many ops as will fit in device
  • Also, supports different types of parallelism

7
Types of Parallelism
  • Bit-level

C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
8
Types of Parallelism
  • Arithmetic-level Parallelism

C Code
Circuit
for (i0 i lt 128 i) y ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
  • 7 cycles
  • Speedup gt 100x for same clock
  • 1000s of instructions
  • Several thousand cycles

9
Types of Parallelism
  • Pipeline Parallelism

for (j0 j lt n j) y aj x bj
for (i0 i lt 128 i) y ci xi
// output y y 0
for (i0 i lt 128 i) yi ci
xi .. .. ..
Start new inner loop every cycle
After filling up pipeline, performs 128 mults
127 adds every cycle
10
Types of Parallelism
  • Task-level Parallelism
  • e.g. MPEG-2
  • Execute each task
  • in parallel

11
How to exploit parallelism?
  • General Idea
  • Identify tasks
  • Create circuit for each task
  • Communication between tasks with buffers
  • How to create circuit for each task?
  • Want to exploit bit-level, arithmetic-level, and
    pipeline-level parallelism
  • Solution Systolic architectures
    (arrays/computing)

12
Systolic Architectures
  • Systolic definition
  • The rhythmic contraction of the heart, especially
    of the ventricles, by which blood is driven
    through the aorta and pulmonary artery after each
    dilation or diastole.
  • Analogy with heart pumping blood
  • We want architecture that pumps data through
    efficiently.
  • Data flows from memory in a rhythmic fashion,
    passing through many processing elements before
    it returns to memory. Hung

13
Systolic Architecture
  • General Idea Fully pipelined circuit, with I/O
    at top and bottom level
  • Local connections - each element communicates
    with elements at same level or level below

Inputs arrive each cycle
Outputs depart each cycle, after
pipeline is full
14
Systolic Architecture
  • Simple Example
  • Create DFG (data flow graph) for body of loop
  • Represent data dependencies of code

for (i0 i lt 100 I) ai bi bi1
bi2
bi
bi1
bi2


ai
15
Simple Example
  • Add pipeline stages to each level of DFG

bi
bi1
bi2


ai
16
Simple Example
  • Allocate one resource (adder, ALU, etc) for each
    operation in DFG
  • Resulting systolic architecture

b0
b1
b2
Cycle 1

for (i0 i lt 100 I) ai bi bi1
bi2

17
Simple Example
  • Allocate one resource for each operation in DFG
  • Resulting systolic architecture

b1
b2
b3
Cycle 2

for (i0 i lt 100 I) ai bi bi1
bi2
b0b1
b2

18
Simple Example
  • Allocate one resource for each operation in DFG
  • Resulting systolic architecture

b2
b3
b4
Cycle 3

for (i0 i lt 100 I) ai bi bi1
bi2
b1b2
b3

b0b1b2
19
Simple Example
  • Allocate one resource for each operation in DFG
  • Resulting systolic architecture

b3
b4
b5
Cycle 4

for (i0 i lt 100 I) ai bi bi1
bi2
b2b3
b4

b1b2b3
First output appears, takes 4 cycles to fill
pipeline
a0
20
Simple Example
  • Allocate one resource for each operation in DFG
  • Resulting systolic architecture

b4
b5
b6
Cycle 5

for (i0 i lt 100 I) ai bi bi1
bi2
b3b4
b5

b2b3b4
Total Cycles gt 4 init 99 103
One output per cycle at this point, 99 more until
completion
a1
21
uP Performance Comparison
  • Assumptions
  • 10 instructions for loop body
  • CPI 1.5
  • Clk 10x faster than FPGA
  • Total SW cycles
  • 100101.5 1,500 cycles
  • RC Speedup
  • (1500/103)(1/10) 1.46x

for (i0 i lt 100 I) ai bi bi1
bi2
22
uP Performance Comparison
  • What if uP clock is 15x faster?
  • e.g. 3 GHz vs. 200 MHz
  • RC Speedup
  • (1500/103)(1/15) .97x
  • RC is slightly slower
  • But!
  • RC requires much less power
  • Several Watts vs 100 Watts
  • SW may be practical for embedded uPs gt low power
  • Clock may be just 2x faster
  • (1500/103)(1/2) 7.3x faster
  • RC may be cheaper
  • Depends on area needed
  • This example would certainly be cheaper

for (i0 i lt 100 I) ai bi bi1
bi2
23
Simple Example, Cont.
  • Improvement to systolic array
  • Why not execute multiple iterations at same time?
  • No data dependencies
  • Loop unrolling

for (i0 i lt 100 I) ai bi bi1
bi2
Unrolled DFG
bi
bi1
bi2
bi1
bi2
bi3


. . . . .


ai
ai1
24
Simple Example, Cont.
  • How much to unroll?
  • Limited by memory bandwidth and area

Must get all inputs once per cycle
bi
bi1
bi2
bi1
bi2
bi3


. . . . .


ai
ai1
Must write all outputs once per cycle
Must be sufficient area for all ops in DFG
25
Unrolling Example
  • Original circuit

for (i0 i lt 100 I) ai bi bi1
bi2
1st iteration requires 3 inputs
b0
b1
b2


a0
26
Unrolling Example
  • Original circuit

Each unrolled iteration requires one additional
input
for (i0 i lt 100 I) ai bi bi1
bi2
b0
b1
b2
b3




a0
a1
27
Unrolling Example
  • Original circuit

Each cycle brings in 4 inputs (instead of 6)
for (i0 i lt 100 I) ai bi bi1
bi2
b1
b2
b3
b4


b3
b2
b0b1
b1b2


28
Performance after unrolling
for (i0 i lt 100 I) ai bi bi1
bi2
  • How much unrolling?
  • Assume b elements are 8 bits
  • First iteration requires 3 elements 24 bits
  • Each unrolled iteration requires 1 element 8
    bit
  • Due to overlapping inputs
  • Assume memory bandwidth 64 bits/cycle
  • Can perform 6 iterations in parallel
  • (24 8 8 8 8 8) 64 bits
  • New performance
  • Unrolled systolic architecture requires
  • 4 cycles to fill pipeline, 100/6 iterations
  • 21 cycles
  • With unrolling, RC is (1500/21)(1/15) 4.8x
    faster than 3 GHz microprocessor!!!

29
Importance of Memory Bandwidth
  • Performance with wider memories
  • 128-bit bus
  • 14 iterations in parallel
  • 64 extra bits/8 bits per iteration 8 parallel
    iterations
  • 6 original unrolled iterations 14 total
    parallel iterations
  • Total cycles 4 to fill pipeline 100/14 11
  • Speedup (1500/11)(1/15) 9.1x
  • Doubling memory width increased speedup from 4.8x
    to 9.1x!!!
  • Important Point
  • Performance of hardware often limited by memory
    bandwidth
  • More bandwidth gt more unrolling gt more
    parallelism gt BIG SPEEDUP

30
Delay Registers
  • Common mistake
  • Forgetting to add registers for values not used
    during a cycle
  • Values delayed or passed on until needed



Instead of


Correct
Incorrect
31
Delay Registers
  • Illustration of incorrect delays

for (i0 i lt 100 I) ai bi bi1
bi2
b0
b1
b2
Cycle 1


32
Delay Registers
  • Illustration of incorrect delays

for (i0 i lt 100 I) ai bi bi1
bi2
b1
b2
b3
Cycle 2

b0b1

b2 ?????
33
Delay Registers
  • Illustration of incorrect delays

for (i0 i lt 100 I) ai bi bi1
bi2
b2
b3
b4
Cycle 3

b1b2

b0 b1 b3
b2 ?????
34
Another Example
  • Your turn
  • Steps
  • Build DFG for body of loop
  • Add pipeline stages
  • Map operations to hardware resources
  • Assume divide takes one cycle
  • Determine maximum amount of unrolling
  • Memory bandwidth 128 bits/cycle
  • Determine performance compared to uP
  • Assume 15 instructions per iteration, CPI 1.5,
    CLK 15x faster than RC

short b1004, a1000 for (i0 i lt 1000 i)
ai avg( bi, bi1, bi2, bi3,
bi4 )
35
Another Example, Cont.
  • What if divider takes 20 cycles?
  • But, fully pipelined
  • Calculate the effect on performance

In systolic architectures, performance usually
dominated by throughput of pipeline, not latency
36
Dealing with Dependencies
  • op2 is dependent on op1 when the input to op2 is
    an output from op1
  • Problem limits arithmetic parallelism, increases
    latency
  • i.e. Cant execute op2 before op1
  • Serious Problem FPGAs rely on parallelism for
    performance
  • Little parallelism Bad performance

op1
op2
37
Dealing with Dependencies
  • Partial solution
  • Parallelizing transformations
  • e.g. tree height reduction

a
b
c
d
c
d
a
b






Depth of adders
Depth log2( of adders )
38
Dealing with Dependencies
  • Simple example w/ inter-iteration dependency -
    potential problem for systolic arrays
  • Cant keep pipeline full

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
Cant execute until 1st iteration completes -
limited arithmetic parallelism, increases latency
b1
b2
a0
b2
b3
a1




39
Dealing with Dependencies
  • But, systolic arrays also have pipeline-level
    parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
a1




40
Dealing with Dependencies
  • But, systolic arrays also have pipeline-level
    parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3



a1

41
Dealing with Dependencies
  • But, systolic arrays also have pipeline-level
    parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
b4
b3




a1
. . . .

a2

42
Dealing with Dependencies
  • But, systolic arrays also have pipeline-level
    parallelism - latency less of an issue

a0 0 for (i1 i lt 8 I) ai bi
bi1 ai-1
b1
b2
a0
b2
b3
b4
b3



Add pipeline stages gt systolic array

a1
. . . .

a2

Only works if loop is fully unrolled! Requires
sufficient memory bandwidth
Outputs not shown
43
Dealing with Dependencies
  • Your turn

char b1006 for (i0 i lt 1000 i)
acc0 for (j0 j lt 6 j) acc
bij
  • Steps
  • Build DFG for inner loop (note dependencies)
  • Fully unroll inner loop (check to see if memory
    bandwidth allows)
  • Assume bandwidth 64 bits/cycle
  • Add pipeline stages
  • Map operations to hardware resources
  • Determine performance compared to uP
  • Assume 15 cycles per iteration, CPI 1.5, CLK
    15x faster than RC

44
Dealing with Control
  • If statements

Cant wait for result of condition - stalls
pipeline
char b1006, a1000 for (i0 i lt 1000 i)
if (I 2 0) aI bI bI1
else aI bI2 bI3
Convert control into computation - if conversion
bI2
bi
bI1
bI3
i
2



MUX
ai
45
Dealing with Control
  • If conversion, not always so easy

char b1006, a1000, a21000 for (i0 i lt
1000 i) if (I 2 0) aI bI
bI1 else a2I bI2 bI3
bI2
ai
bi
bI1
bI3
i
2
a2i



MUX
MUX
ai
a2i
46
Other Challenges
  • Outputs can also limit unrolling
  • Example
  • 4 outputs, 1 input
  • Each output 32 bits
  • Total output bandwidth for 1 iteration 128 bits
  • Memory bus 128 bits
  • Cant unroll, even though inputs only use 32 bits

long b1004, a1000 for (i0, j0 i lt 1000
i4, j) ai bj 10 ai1
bj 23 ai2 bj - 12 ai3
bj bj
47
Other Challenges
  • Requires streaming data to work well
  • Systolic array
  • But, pipelining is wasted because small data
    stream
  • Point - systolic arrays work best with repeated
    computation

for (i0 i lt 4 i) ai bi bi1
b3
b4
b0
b1
b2
b2
b3
b1




a3
a0
a2
a1
48
Other Challenges
  • Memory bandwidth
  • Values so far are peak values
  • Can only be achieved if all input data stored
    sequentially in memory
  • Often not the case
  • Example
  • Two-dimensional arrays

long a100100, b100100 for (i1 i lt 100
i) for (j1 j lt 100 j) aij
avg( bi-1j, bIj-1, bI1j,
bIj1)
49
Other Challenges
  • Memory bandwidth, cont.
  • Example 2
  • Multiple array inputs
  • b and c stored in different locations
  • Memory accesses may jump back and forth
  • Possible solutions
  • Use multiple memories, or multiported memory
    (high cost)
  • Interleave data from b and c in memory
    (programming effort)
  • If no compiler support, requires manual rewite

long a100, b100, c100 for (i0 i lt 100
i) ai bi ci
50
Other Challenges
  • Dynamic memory access patterns
  • Sequence of addresses not known until run time
  • Clearly, not sequential
  • Possible solution
  • Something creative enough for a Ph.D thesis

int f( int val ) long a100, b100,
c100 for (i0 i lt 100 i) ai
brand()100 ci val
51
Other Challenges
  • Pointer-based data structures
  • Even if scanning through list, data could be all
    over memory
  • Very unlikely to be sequential
  • Can cause aliasing problems
  • Greatly limits optimization potential
  • Solutions are another Ph. D.
  • Pointers ok if used as array

int f( int val ) long a100, b100 for
(i0 i lt 100 i) ai bi 1

int f( int val ) long a100, b100
long p b for (i0 i lt 100 i, p)
ai p 1
equivalent to
52
Other Challenges
  • Not all code is just one loop
  • Yet another Ph.D.
  • Main point to remember
  • Systolic arrays are extremely fast, but only
    certain types of code work
  • What can we do instead of systolic arrays?

53
Other Options
  • Try something completely different
  • Try slight variation
  • Example - 3 inputs, but can only read 2 per cycle

Not possible - can only read two inputs per cycle
for (i0 i lt 100 I) ai bi bi1
bi2


54
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
bi2
bi
bi1


55
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b0
b1
Cycle 1


56
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
b2
Junk
Junk
b2
b0
b1
Cycle 2

Junk

57
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b1
b2
Junk
Junk
Cycle 3

b2
b0b1

Junk
58
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
b3
Junk
Junk
b1
b3
b2
Cycle 4

Junk
Junk

b0b1b2
Junk
59
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b2
b3
Junk
Junk
Junk
Cycle 5

b3
b1 b2

First output after 5 cycles
Junk
a0
60
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
b4
Junk
Junk
b2
b4
b3
Cycle 6

Junk
Junk

Junk on next cycle
b1b2b3
Junk
61
Variations
  • Example, cont.
  • Break previous rules - use extra delay registers

for (i0 i lt 100 I) ai bi bi1
bi2
Junk
b3
b4
Junk
Junk
Junk
Cycle 7

b4
b2b3
Valid output every 2 cycles - approximately 1/2
the performance

Second output 2 cycles later
Junk
a1
62
Entire Circuit
RAM
Input Address Generator
Buffers handle differences in speed between RAM
and datapath
Buffer
Controller
Datapath
Buffer
Output Address Generator
RAM
Write a Comment
User Comments (0)
About PowerShow.com