Title: Trying to avoid pipeline delays
1Trying to avoid pipeline delays
- Inter-leafing two sets of operationsXY Compute
block
2Tackled today
- Review of coding a hardware circular buffer
- Roughly understanding where pipeline delays may
occur - Refactor the working code to improve the speed
without spending any time on examining whether
delays really there works at the moment
principle - Refactoring working code to perform operations
using both X and Y ALUs in principle twice the
speed
3DCRemoval( )
MemoryintensiveAdditionintensive Loops
formain code FIFO implementedas
circularbuffer
- Not as complex as FIR, but many of the same
requirements - Easier to handle
- You use same ideas in optimizing FIR over Labs 2
and 3 - Two issues speed and accuracy. Develop suitable
tests for CPP code and check that various
assembly language versions satisfy the same tests
4Alternative approach
- Move pointers rather than memory values
- In principle 1 memory read, 1 memory write,
pointer addition, conditional equate
5Note Software circular buffer is NOT necessarily
more efficient than data moves
- Now spending more time on moving / checking the
software circular buffer pointers than moving the
data?
SLOWERFASTER
6Next step Hardware circular buffer
- Do exactly the same pointer calculations as with
software circular buffers, but now the
calculations are done behind the scenes high
speed using specialized pointer features - Only available with J0, J1, J2 and J3 registers
(On older ADSP-21061 all pointer registers) - Jx -- The pointer register
- JBx The BASE register set to start of the
FIFO array - JLx The length register set to length of the
FIFO array -
- VERY BIG WARNING? Reset to zero. On older
ADSP-21061 it was very important that the length
register be reset to zero, otherwise all the
other functions using this register would
suddenly start using circular buffer by mistake. - Still advisable but need special syntax for
causing circular buffer operations to occur
7Store values into hardware FIFO
- CB instruction ONLY works on POST-MODIFY
operations
8Next stage in improving code speedHardware
circular buffers
- 2
- 8 Was 4
- 3 N 4 Was 4 N 5
- 1 Was 1 2 log2N
- 6
- 14 Was 3 6 N
- 2
- ---------------------------
- 37 4 N Was 23 5 N
- N 128 instructions 549 cycles
- 549 300 delay cycle 879 cyclesDelays are now
gt50 of useful time - Was
- 677 360 delay cycles 1011 cycle
- Set up pointers to buffers
- Insert values into buffers
- SUM LOOP
- SHIFT LOOP
- Update outgoing parameters
- Update FIFO
- Function return
9On TigerSHARC Pipeline Issue
- After you issue the command to read from memory,
then must wait for value to come - Problem may be trading memory wait delays for
I-ALU delays
Memory pipeline delay XR5 CB J0 1 XR4 R4 R5 XR6 CB J1 1 XR7 R7 R6 No Memory pipeline delay XR5 CB J0 1 XR6 CB J1 1 XR4 R4 R5 XR7 R7 R6
10Now perform Math operation using circular buffer
operation
- Note the possible memory delays
- Memory cache helps?
Wait for read ofR2, use it, thenwait for read
of R3and then use it
11Simple interleaving of codePossible saving of
memory delays
Original order 1 2 3 4 New order 1 3 2 4
12Interleaving of codeSame instructions
different order
- 2
- 8 Was 4
- 3 N 4 Was 4 N 5
- 1 Was 1 2 log2N
- 6
- 14 Was 3 6 N
- 2
- ---------------------------
- 37 4 N Was 23 5 N
- N 128 instructions 549 cycles
- 549 50 delay cycle 594 cyclesDelays were 10
of useful time - Was
- 549 300 delay cycle 879 cyclesDelays were
gt50 of useful time
- Set up pointers to buffers
- Insert values into buffers
- SUM LOOP
- SHIFT LOOP
- Update outgoing parameters
- Update FIFO
- Function return
13The code is too slow because we are not taking
advantage of the available resources
- Bring in up to 128 bits (4 instructions) per
cycle - Ability to bring in 4 32-bit values along J data
bus (data1) and 4 along K bus (data2) - Perform address calculations in J and K ALU
single cycle hardware circular buffers - Perform math operations on both X and Y compute
blocks - Background DMA activity
- Off-load some of the processing to the second
processor
14Understanding how to use MIMD modeProcess left
filter in X-Compute, right in Y
- XR6 0 Puts 0 into XR6 register
- YR6 0 Puts 0 into YR6 register
- XYR6 0 Puts 0 into XR6 and YR6 at same time
- 1 instruction saved
15Understanding how to use MIMD modeProcess left
filter in X-Compute, right in Y
- XR6 R6 R2 Adds XR6 XR2 registers
- YR6 R6 R2 Adds YR6 YR2 registers
- XYR6 R6 R2 Adds XR6 XR2, AND YR6 YR2 at
same time - N instructions saved
16Understanding how to use MIMD modeProcess left
filter in X-Compute, right in Y
- XR6 ASHIFT R6 BY -7 XR6 XR6 gtgt 7
- YR6 ASHIFT R6 BY -7 YR6 YR6 gtgt 7
- XYR6 ASHIFT R6 BY -7 XR6 XR6 gtgt 7 and
YR6 YR6 gtgt 7 at same time - 1 instruction saved
17Final operation dual subtraction
18MIMD mode
- 2
- 8 Was 4
- 3 N 3 Was 4 N 5
- 1 Was 1 2 log2N
- 6
- 14 Was 3 6 N
- 2
- ---------------------------
- 37 3 N Was 37 4 N
- N 128 instructions 421 cycles
- 421 180 delay cycles 590
- Now delays are 50 of useful time
- Was
- 549 50 delay cycle 594 cyclesDelays were 10
of useful time
- Set up pointers to buffers
- Insert values into buffers
- SUM LOOP
- SHIFT LOOP
- Update outgoing parameters
- Update FIFO
- Function return
19Why no improvement? Extra delays from where?
Back to having towait for R2 to come in from
memory beforethe sum can occur
20The code is too slow because we are not taking
advantage of the available resources
- Bring in up to 128 bits (4 instructions) per
cycle - Ability to bring in 4 32-bit values along J data
bus (data1) and 4 along K bus (data2) - Perform address calculations in J and K ALU
single cycle hardware circular buffers - Perform math operations on both X and Y compute
blocks - Background DMA activity
- Off-load some of the processing to the second
processor
21Multiple data busses
- Many issues to solve before we can bring in 8
data values per cycle - Are the data values aligned so can access 4
values at once? - If they are not aligned what can you do?
- One step at a time Next lecture
- Lets us bring 1 value in along the J-Data bus and
another in along the K-data bus
22Exercise on handling interleaving of instructions
and X-Y compute operations
23Tackled today
- Review of coding a hardware circular buffer
- Roughly understanding where pipeline delays may
occur - Refactor the working code to improve the speed
without spending any time on examining whether
delays really there works at the moment
principle - Refactoring working code to perform operations
using both X and Y ALUs in principle twice the
speed