Title: General Optimization Issues
1General Optimization Issues
- Solving the exercise issues
2To be tackled today
- Exercise 1
- Solving the loop problem SIZE 128
- Exercise 2
- Solving the loop problem SIZE 127
- Exercise 3
- Moving from SISD to SIMD mode, SIZE 128
- Exercise 4
- Removing any expected stalls
3Most optimized SIMD Floating point(32-bit)TigerSH
ARC instruction
- xR30 CB Qj0 4 yR30 CB Qk0 4
xyFR4 R5 R6 xyFR7 R8 R9, FR10 R8 -
R9 - xR30 CB Qj0 4 / Fetches 4 values on J
BUS into x compute registers XR3, XR2,
XR1, XR0 Increments J register and
adjusts for circular buffer
operation / - yR30 CB Qk0 4 / Fetches 4 values on J
BUS into x compute registers XR3, XR2,
XR1, XR0 Increments J register and
adjusts for circular buffer
operation / - xyFR4 R5 R6 / Two multiplications XFR5
XFR6 and YFR5 YFR6 / - xyFR7 R8 R9, FR10 R8 - R9 / Two
additions XFR8 XFR9 and YFR8 YFR9 AND Two
subtractions XFR8 - XFR9 and YFR8 - YFR9 / - / Same register must be used either side
of and operators /
4Steps to optimize
- Get the algorithm to work in C
- Determine how much time is available
- If Timing already okay quit
- Determine maximum number of each type of
operation (add, subtract, multiple, memory
fetches) - Divide the calculated maximum by the number of
available resources for that type of operation - The largest division result is the in theory
number of cycles needed for the algorithm - If that minimum time is more than 100 of the
time available find a new algorithm - If that minimum time is less than 40 of the time
available perhaps you can optimize the code to
meet the speed requirements
5Code optimization 32 bit integersor 32-bit
floats
2 SIZE additions 2 SIZE Memory fetches Left
fetched on J-bus And done in X-compute Right
fetched on K-bus And done in Y-compute SIZE / 2
cycles in theory
6STAGE 1Get the C code to work
7Stage 2 Rewrite in simplest format
Note naming convention Single operation per
line Note other changes
8Step 3 -- Unwrap the loop
Again Note naming convention
9Step 4Overlap the first and second parts of
loops
Note The C code goes no faster, but using
this format for translating into parallel
assembly code will Step 1 -- 4 N Step 3 8
(N / 2) 2 Step 4 6 (N / 2) 2
10Step 5A - Rearrange start-up and ending code
Software Pipeline Move first read outside Need
to add extra read at the end of the
loop Timing 2 (N/2 1) 6 Need to adjust
loop start (Is it done correctly? Are we
one-out) CAUTION NEED TO FIX
11Step 5B - Rearrange start-up and ending code
Can now parallel additional adds and memory
fetches Note loop still in error
12Exercise 1 -- Get the loop control correct
BUFFER_SIZE 1 BUFFER_SIZE 2 BUFFER_SIZE
4 BUFFER_SIZE 5 BUFFER_SIZE 8 BUFFER_SIZE
128
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Unrecognized second key error What is it? How do
you fix it?
17(No Transcript)
18Exercise 2 -- Rewrite the code when it is known
that BUFFER_SIZE 129
SIZE 129 But loop only handles 128 Since
129 / 2 128 / 2
19(No Transcript)
20(No Transcript)
21Code to this point is SISD parallel optimization
- SISD single instruction single data
- Using X_compute block and J memory bus
- Next stage SIMD single instruction multiple
data - Using X_compute block and J memory bus for left
- Using Y_compute block and K memory bus for right
- Will need similar but different code when you are
doing FIR in Lab. 3
22Exercise 3 -- BUFFER_SIZE 128Rewrite so that
X and Y ops done together
23(No Transcript)
24Exercise 4 -- BUFFER_SIZE 128Rewrite so that
expect no data dependency stalls
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
Leave this one for a while until we have handled
multiple memory accesses asanswer may changes
25Tackled today
- Exercise 1
- Solving the loop problem SIZE 128
- Exercise 2
- Solving the loop problem SIZE 127
- Exercise 3
- Moving from SISD to SIMD mode, SIZE 128
- Incomplete
- Exercise 4
- Removing any expected stalls left for later