Title: Understanding the TigerSHARC ALU pipeline
1Understanding the TigerSHARC ALU pipeline
- Determining the speed of one stage of IIR filter
Part 4IIR operation with Memory
2Understanding the TigerSHARC ALU pipeline
- TigerSHARC has many pipelines
- Review of the COMPUTE pipeline works
- Interaction of memory (data) operations with
COMPUTE operations - What we want to be able to do?
- The problems we are expecting to have to solve
- Using the pipeline viewer to see what really
happens - Changing code practices to get better performance
- Specialized C compiler options and pragmas
(Will be covered by individual student
presentation) - Optimized assembly code and optimized C
3Processor Architecture
- 3 128-bitdata busses
- 2 Integer ALU
- 2 ComputationalBlocks
- ALU (Float and integer)
- SHIFTER
- MULTIPLIER
- COMMUNICATIONSCLU
4Simple ExampleIIR -- Biquad
S0 S1 S2
- For (Stages 0 to 3) Do
- S0 Xin H5 S2 H3 S1 H4
- Yout S0 H0 S1 H1 S2 H2
- S2 S1
- S1 S0
5Rewrite Tests so that IIR( ) function can take
parameters
6Rewrite the C code
I leave the old fixed values in until I can get
the code to work. Proved useful this time as the
code failed Why did it fail to return the
correct value?
7Explore design issues memory opsProbable
memory stalls expected
- XR0 0.0 // Set Fsum 0
- XR1 J1 1 // Fetch a coefficient from
memory - XFR2 R1 R4 // Multiply by Xinput (XR4)
- XFR0 R0 R2 // Add to sum
- XR3 J1 1 // Fetch a coefficient from
memory - XR5 J2 1 // Fetch a state value from
memory - XFR5 R3 R5 // Multiply coeff and state
- XFR0 R0 R5 // Perform a sum
- XR5 XR12 // Update a state variable (dummy)
- XR12 XR13 // Update a state variable (dummy)
- J3 1 XR12 // Store state variable to
memory - J3 1 XR5 // Store state variable to
memory
8Looking much better.
Use 10 nops to flush the instruction pipeline
9Pipeline performance predicted
When you start reading values from memory, 1
cycle delay for value fetched available for use
within the COMPUTE COMPUTE operations 1 cycle
delay expected if next instruction needs the
result of previous instruction When you have
adjacent memory accesses (read or write) does the
pipeline work better with J1 1 or withJ1
J4 where J4 has been set to 1? J1 1
works just fine here (no delay).Worry about J1
J4 another day
10Use C IIR code as comments
Things to think about Register name
reorganization Keep XR4 for xInput
save a cycle Put S1 and S2 into XR0 and XR1
-- chance to fetch 2 memory values in
one cycle using L Put H0 to H5 in
XR12 to XR16 -- chance to fetch 4 memory
values in one cycle using
Q followed by one normal fetch --
Problems if more than one IIR stage
then the second stage fetches are not
quad aligned There are two sets of
multiplications using S1 and S2. Can these by
done in X and Y compute blocks in one cycle?
float copyStateStartAddress stateS1
stateS2 state
copyStateStartAddress S1copyStateStartAddr
ess S2
11New assembly code step 1
Things to think about Register name
reorganization Keep XR4 for xInput
save a cycle Put S1 and S2 into XR10 and XR11
-- chance to fetch 2 memory values in
one cycle using L Put H0 to H5 in
XR12 to XR16 -- chance to fetch 4 memory
values in one cycle using
Q followed by one normal fetch --
Problems if more than one IIR stage
then the second stage fetches are not
quad aligned There are two sets of
multiplications using S1 and S2. Can these by
done in X and Y compute blocks in one cycle?
- Make copy of COMPUTE optimized codefloat
IIRASM_Memory(void) - Change the register names and make sure that it
still works
12Write new testsNOTE New register names dont
overlap with old namesMakes the name conversion
very straight forward
13Register name conversion done in steps
Setting Xin XR4and Yout XR8saves one cycle
Bulk conversionwith no error
So many errors made during bulk conversion that
went to Find/replace/ test for each register
individually
14Update tests to use IIRASM_Memory( ) version with
real memory access
15Fix bringing state variables in
QUESTION We haveXR18 J6 1
(load S1) andR19 J6 1
(load S2) Both are valid What is the
difference?
16Send state variables outGo for the gusto use
L (64-bit)
- Need to recalculate the test resultstate1 is
NOT Yout
17Redo calculation for value stored as S1
- S0 Xin 5.5 S1 H4
2 5 S2 H3 3 4 - S1 S0
- Expect stored value of 27.5
- Need to fix testof state values after function
- CHECK(state0 27.5)
18Working solution -- I
19Working Solution -- Part 2
20Working solution Part 3
I could not spot where any extra stalls would
occur because of memory pipeline reads and
writes All values were in place when
needed Need to check with pipeline viewer
21Lets look at DATA MEMORY and COMPUTE pipeline
issues -- 1
No problems here
22Lets look at DATA MEMORY and COMPUTE pipeline
issues -- 2
No problems here
23Weird stuff happening with INSTRUCTION pipeline
Only 9 instructions being fetched but we are
executing 21! Why all these instruction stalls?
24Adjust pipeline view for closer look.Adjust
dis-assembler window
25Analysis
- We are seeing the impact of the processor doing
quad-fetches of instructions (128-bits) into IAB
(instruction alignment buffer) - Once in the IAB, then the instructions
(32-bits) are issued to the various
executionunits as needed.
26Note the fetch into the next subroutine despite
return (CJMP)
27Note that processor continues to fetch the
wrong instructions
28Understanding the TigerSHARC ALU pipeline
- TigerSHARC has many pipelines
- Review of the COMPUTE pipeline works
- Interaction of memory (data) operations with
COMPUTE operations - What we want to be able to do?
- The problems we are expecting to have to solve
- Using the pipeline viewer to see what really
happens - Changing code practices to get better performance
- Specialized C compiler options and pragmas
(Will be covered by individual student
presentation) - Optimized assembly code and optimized C