General Optimization Issues - PowerPoint PPT Presentation

About This Presentation
Title:

General Optimization Issues

Description:

General Optimization Issues M. Smith – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 30
Provided by: Micha878
Category:

less

Transcript and Presenter's Notes

Title: General Optimization Issues


1
General Optimization Issues
  • M. Smith

2
To be tackled today
  • Most optimized TigerSHARC instruction
  • Integer and float
  • Systematic optimization procedure
  • SISD and SIMD modes
  • Exercises

3
Most optimized SIMD Floating point(32-bit)TigerSH
ARC instruction
  • xR30 CB Qj0 4 yR30 CB Qk0 4
    xyFR4 R5 R6 xyFR7 R8 R9, FR10 R8 -
    R9
  • xR30 CB Qj0 4 / Fetches 4 values on J
    BUS into x compute registers XR3, XR2,
    XR1, XR0 Increments J register and
    adjusts for circular buffer
    operation /
  • yR30 CB Qk0 4 / Fetches 4 values on J
    BUS into x compute registers XR3, XR2,
    XR1, XR0 Increments J register and
    adjusts for circular buffer
    operation /
  • xyFR4 R5 R6 / Two multiplications XFR5
    XFR6 and YFR5 YFR6 /
  • xyFR7 R8 R9, FR10 R8 - R9 / Two
    additions XFR8 XFR9 and YFR8 YFR9 AND Two
    subtractions XFR8 - XFR9 and YFR8 - YFR9 /
  • / Same register must be used either side
    of and operators /

4
Most optimized SIMD Integer (short)(16-bit)TigerS
HARC instruction
  • xR30 CB Qj0 4 yR30 CB Qk0 4
    R76 R54 R32 xySR98 R76R10,SR1110
    R76-R10
  • xR30 CB Qj0 4 / Fetches 4 values on J
    BUS into x compute registers XR3, XR2,
    XR1, XR0 Increments J register and
    adjusts for circular buffer
    operation /
  • yR30 CB Qk0 4 / Fetches 4 values on J
    BUS into x compute registers XR3, XR2,
    XR1, XR0 Increments J register and
    adjusts for circular buffer
    operation /
  • xyR76 R54 R32 / Eight multiplications
    XR5.H XR3.H, and XR5.L XR3.L, XR4.H
    XR2.H, XR4.L XR3.L ditto YR /
  • xySR98 R76 R10, R1110 R76 R10
    / Eight additions ???????
    AND Eight subtractions
    ????????????????? /

5
ExerciseWrite out the 16 operations performed
  • xySR98 R76 R10, R1110 R76 R10
    / Eight additions ???????
    AND Eight subtractions
    ????????????????? /
  • Now do a sideways add on xySR98 and get a value

6
Steps to optimize
  • Get the algorithm to work in C
  • Determine how much time is available
  • If Timing already okay quit
  • Determine maximum number of each type of
    operation (add, subtract, multiple, memory
    fetches)
  • Divide the calculated maximum by the number of
    available resources for that type of operation
  • The largest division result is the in theory
    number of cycles needed for the algorithm
  • If that minimum time is more than 100 of the
    time available find a new algorithm
  • If that minimum time is less than 40 of the time
    available perhaps you can optimize the code to
    meet the speed requirements

7
Code optimization 32 bit integersor 32-bit
floats
2 SIZE additions 2 SIZE Memory fetches If
done correctly Can do 2 additions AND 2 memory
fetches each cycle Therefore optimum isSIZE
cycles IFF can find all optimizations
8
Code optimization 32 bit integersor 32-bit
floats
2 SIZE additions 2 SIZE Memory fetches Left
fetched on J-bus And done in X-compute Right
fetched on K-bus And done in Y-compute
9
16-bit integers (short int) might be okay in some
circumstances
2 SIZE additions 2 SIZE Memory fetches If
done correctly Can do 8 short additions AND 32
short memory fetches each cycle Therefore
optimum isSIZE / 4 cycles IFF can find all
optimizations
10
FIR optimization
SIZE additions SIZE multiplications SIZE 2
memory fetches 2 additions, 2 multiplications
and 8 fetches per cycles Should be able to do it
in SIZE / 2 cycles
11
FIR optimization
SIZE additions SIZE multiplications SIZE 2
memory fetches Fetch 2 values along J-bus into
XA and YA compute Fetch 2 coefficients along
K-bus into XB and YB compute
12
Need a systematic approach to handling the
optimization of code
  • Get the C code to work
  • Rewrite code in simplest format one operation
    per line
  • Recommend rewrite code using register names
  • Unwrap the loop start with twice
  • Rewrite the second part of the loop using
    different register names avoids setting up
    unexpected dependencies
  • Overlap the first and second parts of loops
  • Rearrange start-up and ending code

13
STAGE 1Get the C code to work
14
Need a systematic approach to handling the
optimization of code
  • Get the C code to work
  • Rewrite code in simplest format one operation
    per line
  • Recommend rewrite code using register names
  • Unwrap the loop start with twice
  • Rewrite the second part of the loop using
    different register names avoids setting up
    unexpected dependencies
  • Overlap the first and second parts of loops
  • Rearrange start-up and ending code

15
Stage 2 Rewrite in simplest format
Note naming convention Single operation per
line Note other changes
16
Need a systematic approach to handling the
optimization of code
  • Get the C code to work
  • Rewrite code in simplest format one operation
    per line
  • Recommend rewrite code using register names
  • Unwrap the loop start with twice
  • Rewrite the second part of the loop using
    different register names avoids setting up
    unexpected dependencies
  • Overlap the first and second parts of loops
  • Rearrange start-up and ending code

17
Step 3 -- Unwrap the loop
Again Note naming convention
18
Need a systematic approach to handling the
optimization of code
  • Get the C code to work
  • Rewrite code in simplest format one operation
    per line
  • Recommend rewrite code using register names
  • Unwrap the loop start with twice
  • Rewrite the second part of the loop using
    different register names avoids setting up
    unexpected dependencies
  • Overlap the first and second parts of loops
  • Rearrange start-up and ending code

19
Step 4Overlap the first and second parts of
loops
Note The C code goes no faster, but using
this format for translating into parallel
assembly code will Step 1 -- 4 N Step 3 8
(N / 2) 2 Step 4 6 (N / 2) 2
20
Need a systematic approach to handling the
optimization of code
  • Get the C code to work
  • Rewrite code in simplest format one operation
    per line
  • Recommend rewrite code using register names
  • Unwrap the loop start with twice
  • Rewrite the second part of the loop using
    different register names avoids setting up
    unexpected dependencies
  • Overlap the first and second parts of loops
  • Rearrange start-up and ending code

21
Step 5A - Rearrange start-up and ending code
Software Pipeline Move first read outside Need
to add extra read at the end of the
loop Timing 2 (N/2 1) 6 Need to adjust
loop start (Is it done correctly? Are we
one-out) CAUTION NEED TO FIX
22
Step 5B - Rearrange start-up and ending code
Can now parallel additional adds and memory
fetches Note loop still in error
23
Exercise -- Get the loop control correct
24
Exercise 1 -- Get the loop control correct
BUFFER_SIZE 1 BUFFER_SIZE 2 BUFFER_SIZE
4 BUFFER_SIZE 5 BUFFER_SIZE 8 BUFFER_SIZE
128
25
Exercise 2 -- Rewrite the code when it is known
that BUFFER_SIZE 127
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
26
Code to this point is SISD parallel optimization
  • SISD single instruction single data
  • Using X_compute block and J memory bus
  • Next stage SIMD single instruction multiple
    data
  • Using X_compute block and J memory bus for left
  • Using Y_compute block and K memory bus for right
  • Will need similar but different code when you are
    doing FIR in Lab. 3

27
Exercise 3 -- BUFFER_SIZE 128Rewrite so that
X and Y ops done together
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
28
Exercise 4 -- BUFFER_SIZE 128Rewrite so that
expect no data dependency stalls
BUFFER_SIZE 1 N 2 N 4 N 5 N 8 N 128
29
To be tackled today
  • Most optimized TigerSHARC instruction
  • Integer and float
  • Systematic optimization procedure
  • SISD and SIMD modes
  • Exercises
Write a Comment
User Comments (0)
About PowerShow.com