Title: Understanding the TigerSHARC ALU pipeline
1Understanding the TigerSHARC ALU pipeline
- Determining the speed of one stage of IIR filter
Part 2Understanding the pipeline
2Understanding the TigerSHARC ALU pipeline
- TigerSHARC has many pipelines
- If these pipelines stall then the processor
speed goes down - Need to understand how the ALU pipeline works
- Learn to use the pipeline viewer
- Understanding what the pipeline viewer tells in
detail - Avoiding having to use the pipeline viewer
- Improving code efficency
- Excel and Project (Gantt charts) are useful tool
3Register File and COMPUTE Units
4Simple ExampleIIR -- Biquad
S0 S1 S2
- For (Stages 0 to 3) Do
- S0 Xin H5 S2 H3 S1 H4
- Yout S0 H0 S1 H1 S2 H2
- S2 S1
- S1 S0
5Code return float when using XR8 register NOTE
NOT XFR8
6Step 2 Using C code as comments set up the
coefficients
XFR0 0.0 Does not exist XR0 0.0 DOES
EXIST Bit-patternsrequireintegerregisters Lea
ve what youwanted to dobehind ascomments
7Expect to take8 cycles to execute
8PIPELINE STAGESSee page 8-34 of Processor manual
- 10 pipeline stages, but may be completely
desynchronized (happen semi-independently) - Instruction fetch -- F1, F2, F3 and F4
- Integer ALU PreDecode, Decode, Integer, Access
- Compute Block EX1 and EX2
9Pipeline Viewer Result
XR0 1.0 enters PD stage _at_ 39025, enters
E2 stage at cycle 39830 is
stored into XR0 at cycle 39831 -- 7 cycles
execution time
10Pipeline Viewer Result
XR6 5.5 enters PD stage at cycle 39032
enters E2 stage at cycle 39837
is stored into XR6 at cycle 39838
-- 7 cycles execution time Each instruction
takes 7 cycles but one new result each
cycle Result once pipeline filled 8 cycles 8
register transfer operations
11Doing filter operations generates different
results XR8 XR6 enters PD at
39833, enters EX2 at 39838, stored 39839 7
cyclesXFR23 R9 R4 enters PD at 39834,
enters EX2 at 39839, stored 39840 7 cyclesXFR0
R0 R23 enters PD at 39835, enters EX2 at
39841, stored 39842 8 cycles WHY?
FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL
12Instruction 0x17e XFR8 R8 R23 is STALLED
(waiting) for 0x17d to complete XFR23 R8 R4
Bubble B means that the pipeline is doing
nothingMeaning that the instruction shown is
place holder (garbage)
13Information on Window Event Icons
14Result of Analysis
- Cant use Float result immediately after
calculation - Writing XFR23 R8 R4 XFR8 R8 R23
// MUST WAIT FOR XFR23
// calculation to be completedIs the
same as coding XFR23 R8 R4 NOP ?
Note DOUBLE -- extra cycle because of stall
XFR8 R8 R23 - Proof write the code with the stalls shown in
it - Writing this way means we dont have to use the
pipeline viewer all the time - Pipeline viewer is only available with (slow)
simulator - define SHOW_ALU_STALL nop
15Code withstalls shown
- 8 code lines
- 5 expected stalls
- Expect 13 cyclesto completeif theory is correct
16Analysis approach IS correct
17Process for coding for improved speed code
re-organization
- Make a copy of the code so can test iirASM( ) and
iirASM_Optimized( ) to make sure get correct
result - Make a table of code showing ALU resource usage
(paper, EXCEL, Project (Gantt chart) ) - Identify data dependencies
- Make all temp operations use different register
- Move instructions forward to fill delay slots,
BUT dont break data dependencies
18Copy and paste to makeIIRASM_Optimized( )
19Need to re-order instructionsto fill delay slots
with useful instructions
- After refactoring code to fill delay slots, must
run tests to ensure that still have the correct
result - Change and check
- NOT EASY
- MUST HAVE APLAN
- I USE EXCEL
20Show resource usage and data dependencies
21Change all temporary registers to use different
register namesThen check code produces correct
answer
22Move instructions forward, without breaking data
dependencies
What appears possible! DO one thing at a time
and then check that code still works
23Check that code still operates1 cycle saved
24Move next multiplication up. NOTE certain stalls
remain, although reason for STALL changes
25Move up the R10 and R9 assignment operations --
check
4 cycle improvement?
26CHECK THE PIPELINE AFTER TESTING
27Are there still more improvements possible (I can
see 4 more moves)
28Problems with approach
- Identifying all the data dependencies
- Keep track of how the data dependencies change as
you move the code around - Handling all of this automatically
- I started the following design tool as something
that might work, but it actually turned out very
useful.M. R. Smith and J. Miller,
"Microprocessor Scheduling -- the irony of using
Microsoft Project", "Dont say CANT do it - Say
Gantt it! The irony of organizing
microprocessors with a big business tool"
Circuit Cellar magazine, Vol. 184, pp 26 - 35,
November 2005.
29Using Microsoft Project Step 1
30Add dependencies and resource usage then
activate level
31Microsoft Project as a microprocessor design tool
- Will look at this in more detail when we start
using memory operations to fill the coefficient
and state arrays
32Understanding the TigerSHARC ALU pipeline
- TigerSHARC has many pipelines
- If these pipelines stall then the processor
speed goes down - Need to understand how the ALU pipeline works
- Learn to use the pipeline viewer
- Understanding what the pipeline viewer tells in
detail - Avoiding having to use the pipeline viewer
- Improving code efficiency
- Excel and Project (Gantt charts) are useful tool