Trying to avoid pipeline delays - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Trying to avoid pipeline delays

Description:

Trying to avoid pipeline delays Inter-leafing two sets of operations XY Compute block – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 24
Provided by: MichaelR223
Category:

less

Transcript and Presenter's Notes

Title: Trying to avoid pipeline delays


1
Trying to avoid pipeline delays
  • Inter-leafing two sets of operationsXY Compute
    block

2
Tackled today
  • Review of coding a hardware circular buffer
  • Roughly understanding where pipeline delays may
    occur
  • Refactor the working code to improve the speed
    without spending any time on examining whether
    delays really there works at the moment
    principle
  • Refactoring working code to perform operations
    using both X and Y ALUs in principle twice the
    speed

3
DCRemoval( )
MemoryintensiveAdditionintensive Loops
formain code FIFO implementedas
circularbuffer
  • Not as complex as FIR, but many of the same
    requirements
  • Easier to handle
  • You use same ideas in optimizing FIR over Labs 2
    and 3
  • Two issues speed and accuracy. Develop suitable
    tests for CPP code and check that various
    assembly language versions satisfy the same tests

4
Alternative approach
  • Move pointers rather than memory values
  • In principle 1 memory read, 1 memory write,
    pointer addition, conditional equate

5
Note Software circular buffer is NOT necessarily
more efficient than data moves
  • Now spending more time on moving / checking the
    software circular buffer pointers than moving the
    data?

SLOWERFASTER
6
Next step Hardware circular buffer
  • Do exactly the same pointer calculations as with
    software circular buffers, but now the
    calculations are done behind the scenes high
    speed using specialized pointer features
  • Only available with J0, J1, J2 and J3 registers
    (On older ADSP-21061 all pointer registers)
  • Jx -- The pointer register
  • JBx The BASE register set to start of the
    FIFO array
  • JLx The length register set to length of the
    FIFO array
  • VERY BIG WARNING? Reset to zero. On older
    ADSP-21061 it was very important that the length
    register be reset to zero, otherwise all the
    other functions using this register would
    suddenly start using circular buffer by mistake.
  • Still advisable but need special syntax for
    causing circular buffer operations to occur

7
Store values into hardware FIFO
  • CB instruction ONLY works on POST-MODIFY
    operations

8
Next stage in improving code speedHardware
circular buffers
  • 2
  • 8 Was 4
  • 3 N 4 Was 4 N 5
  • 1 Was 1 2 log2N
  • 6
  • 14 Was 3 6 N
  • 2
  • ---------------------------
  • 37 4 N Was 23 5 N
  • N 128 instructions 549 cycles
  • 549 300 delay cycle 879 cyclesDelays are now
    gt50 of useful time
  • Was
  • 677 360 delay cycles 1011 cycle
  • Set up pointers to buffers
  • Insert values into buffers
  • SUM LOOP
  • SHIFT LOOP
  • Update outgoing parameters
  • Update FIFO
  • Function return

9
On TigerSHARC Pipeline Issue
  • After you issue the command to read from memory,
    then must wait for value to come
  • Problem may be trading memory wait delays for
    I-ALU delays

Memory pipeline delay XR5 CB J0 1 XR4 R4 R5 XR6 CB J1 1 XR7 R7 R6 No Memory pipeline delay XR5 CB J0 1 XR6 CB J1 1 XR4 R4 R5 XR7 R7 R6
10
Now perform Math operation using circular buffer
operation
  • Note the possible memory delays
  • Memory cache helps?

Wait for read ofR2, use it, thenwait for read
of R3and then use it
11
Simple interleaving of codePossible saving of
memory delays
Original order 1 2 3 4 New order 1 3 2 4
12
Interleaving of codeSame instructions
different order
  • 2
  • 8 Was 4
  • 3 N 4 Was 4 N 5
  • 1 Was 1 2 log2N
  • 6
  • 14 Was 3 6 N
  • 2
  • ---------------------------
  • 37 4 N Was 23 5 N
  • N 128 instructions 549 cycles
  • 549 50 delay cycle 594 cyclesDelays were 10
    of useful time
  • Was
  • 549 300 delay cycle 879 cyclesDelays were
    gt50 of useful time
  • Set up pointers to buffers
  • Insert values into buffers
  • SUM LOOP
  • SHIFT LOOP
  • Update outgoing parameters
  • Update FIFO
  • Function return

13
The code is too slow because we are not taking
advantage of the available resources
  • Bring in up to 128 bits (4 instructions) per
    cycle
  • Ability to bring in 4 32-bit values along J data
    bus (data1) and 4 along K bus (data2)
  • Perform address calculations in J and K ALU
    single cycle hardware circular buffers
  • Perform math operations on both X and Y compute
    blocks
  • Background DMA activity
  • Off-load some of the processing to the second
    processor

14
Understanding how to use MIMD modeProcess left
filter in X-Compute, right in Y
  • XR6 0 Puts 0 into XR6 register
  • YR6 0 Puts 0 into YR6 register
  • XYR6 0 Puts 0 into XR6 and YR6 at same time
  • 1 instruction saved

15
Understanding how to use MIMD modeProcess left
filter in X-Compute, right in Y
  • XR6 R6 R2 Adds XR6 XR2 registers
  • YR6 R6 R2 Adds YR6 YR2 registers
  • XYR6 R6 R2 Adds XR6 XR2, AND YR6 YR2 at
    same time
  • N instructions saved

16
Understanding how to use MIMD modeProcess left
filter in X-Compute, right in Y
  • XR6 ASHIFT R6 BY -7 XR6 XR6 gtgt 7
  • YR6 ASHIFT R6 BY -7 YR6 YR6 gtgt 7
  • XYR6 ASHIFT R6 BY -7 XR6 XR6 gtgt 7 and
    YR6 YR6 gtgt 7 at same time
  • 1 instruction saved

17
Final operation dual subtraction
18
MIMD mode
  • 2
  • 8 Was 4
  • 3 N 3 Was 4 N 5
  • 1 Was 1 2 log2N
  • 6
  • 14 Was 3 6 N
  • 2
  • ---------------------------
  • 37 3 N Was 37 4 N
  • N 128 instructions 421 cycles
  • 421 180 delay cycles 590
  • Now delays are 50 of useful time
  • Was
  • 549 50 delay cycle 594 cyclesDelays were 10
    of useful time
  • Set up pointers to buffers
  • Insert values into buffers
  • SUM LOOP
  • SHIFT LOOP
  • Update outgoing parameters
  • Update FIFO
  • Function return

19
Why no improvement? Extra delays from where?
Back to having towait for R2 to come in from
memory beforethe sum can occur
20
The code is too slow because we are not taking
advantage of the available resources
  • Bring in up to 128 bits (4 instructions) per
    cycle
  • Ability to bring in 4 32-bit values along J data
    bus (data1) and 4 along K bus (data2)
  • Perform address calculations in J and K ALU
    single cycle hardware circular buffers
  • Perform math operations on both X and Y compute
    blocks
  • Background DMA activity
  • Off-load some of the processing to the second
    processor

21
Multiple data busses
  • Many issues to solve before we can bring in 8
    data values per cycle
  • Are the data values aligned so can access 4
    values at once?
  • If they are not aligned what can you do?
  • One step at a time Next lecture
  • Lets us bring 1 value in along the J-Data bus and
    another in along the K-data bus

22
Exercise on handling interleaving of instructions
and X-Y compute operations
23
Tackled today
  • Review of coding a hardware circular buffer
  • Roughly understanding where pipeline delays may
    occur
  • Refactor the working code to improve the speed
    without spending any time on examining whether
    delays really there works at the moment
    principle
  • Refactoring working code to perform operations
    using both X and Y ALUs in principle twice the
    speed
Write a Comment
User Comments (0)
About PowerShow.com