Title: Optimization techniques for high performance DSPs
1SE 746-NT Embedded Software Systems
Development Robert Oshana Lecture
30 For more information, please
contact NTU Tape Orders NTU Media
Services (970) 495-6455
oshana_at_airmail.net
tapeorders_at_ntu.edu
2Power Optimization
3Power Agenda
- Introduction
- Silicon Story in a Nutshell
- SW Management Makes a Difference
- RTOS
- Code gen
- Measurement/Estimation/Analysis
- CCP/Emulation
- ACT (Munin, Power Profiling, Real-time PBC)
- Simulation Simulation/Emulation Combos
- Strategy Summary
4The Power Pyramid
Measurement/Estimation
Confirm Predictions
P.S.
P.S.
Assure
Chips
Code Model.
Little Work In This Area
Software Management
Voltage Scaling
Consult
RTOS
Design Decisions Happen Here
Power Management
SI Story
Function Design
Predict
Circuit Design Techniques
Leakage
Switching
Power Components
5Energy vs. Power
- Power Energy / Time
- Energy relates to battery life
- How many mW over how long time
- Power relates to heat dissipation and current
draw - Temp. change avg. power over 1,000s -
1,000,000s cycles - Intel Pentium-4 showed max change in temp. of 1
degree/msec - Current draw avg. power over 100s 10,000s
cycles - Due to chip, board level and power supply
capacitance
6Power optimization
- Relatively little emphasis has been placed on
code size optimization, and even less on power
optimization - The growing embedded computing market is very
different in that it does place a budget on
everything it's all about cost - Code size relates directly to cost, as every
extra memory chip needed makes the system cost a
little bit more, which can be a big deal when you
sell a million units
7Power optimization
- Power also relates to cost
- if your application draws too much power, the
required battery can grow to make the product
expensive, unwieldy, and undesirable - How, then, to keep the power consumption under
control if you're programming in C code? - make the application run in as few cycles as
possible
8Power optimization
- Every instruction executed consumes power
- cheapest instruction is one you don't execute at
all - If you're meeting your real-time deadlines, you
can reduce the clock speed - Power consumed is roughly proportional to the
cube of the frequency - reducing the clock speed can be a big win
9Power optimization
- Apply lessons learned for code size reduction to
power optimizations - The fewer instructions there are in your program,
the less power it will use - Fewer instructions means a smaller memory
footprint - Fewer memory chips need to be kept powered up
- Fewer fetches are made from memory, which takes
power - Your program is more likely to fit in the cache,
from which fetches are lower power
10Power optimization
- Optimizing for speed and code size can go a long
way toward reducing power - may be even larger gains to be made by exploiting
hardware power reduction techniques - Some DSPs have multiple functional units
- Automatically detects whether upcoming
instructions will need a particular functional
unit - Turns off power to unused units automatically (no
programmer intervention)
11C code modifications
- First run the cycle profiler on your code
- to identify the "hot spots
- Code spends 90 of its time in a handful of loops
- Since this is where you're spending most of your
time, it's pretty likely where you're spending
the most power - Focus your efforts on these loops, and you might
not need to examine the other code
12Power optimization
- Some DSPs offer a zero-overhead loop feature
(RPTB) - don't pay the branch latency penalty for every
iteration of the loop - Loop buffer If your loop is small enough that it
fits entirely into this cache, the DSP CPU will
fetch from this very low-power, high-speed cache
rather than memory - Examine the assembly code the compiler emits to
see if you have loops which are just slightly too
big (the assembler can help), and tweak your
source code to try to reduce the size
13Power optimization
- It's possible to have a loop which apparently has
more instructions, but lower code size, than an
equivalent loop run faster because it fit in the
loop cache. - Fetches from memory take power
- The compiler will try to avoid fetching the same
value repeatedly, but if the code has complex
pointer manipulations (particularly multiple
pointers), it might not be able to prove to
itself that the memory location always has the
same value
14Power optimization
- Avoid using complicated pointer expressions when
you can - arrays are preferred
- Write your algorithms in a straightforward
fashion - The more clever you are with your code, the more
trouble the compiler will have with it - Many DSP compilers are highly optimizing
- might be able to do some of those tricks itself
- give the compiler a chance
15Power optimization
- Use the linker command file to place critical
sections of your application in on-chip
(typically lower-powered) memory - Try to cram on-chip as much of the application
(code and data) as possible - Be sure to place each function in its own section
so that the linker has more freedom to pack - Power-down modes for idle loops
- These can be big power savers
- Although not directly accessible from C code,
they could certainly be placed in a library
16Power optimization
- Look for the availability of power-saving library
features, and use them in your C code - Problems of speed, size, and power are not
independent - what might be an optimal solution for speed might
not be the optimal power solution - May find yourself making contorted
hand-optimizations to get the last tenth of a
percent of performance from a CPU, but
unwittingly sacrificing too much power to do so
17Power optimization
- idle instruction
- Scales down clock
- Turns off peripherals
- Turns off cache
- Dedicated interrupt to wake up
18Power optimization
- A trick that might work acceptably in one
application might be counter-productive in
another - Your best bet is to be familiar with your
application and with the features of your hardware
19Example a tweak that improves power
performance
cl55 -o power.c include ltstdio.hgt char
reverse(char str) int i char beg
str char end str strlen(str) - 1
for (i0 i lt strlen(str)/2 i) char
t end end-- beg beg t
return str
20Example a tweak that improves power
performance
char reverse_rptblocal(char str) int i
char beg str char end str
strlen(str) - 1 int len strlen(str)
for (i0 i lt len/2 i) char t
end end-- beg beg t
return str
Moving the call out of the loop allows the
optimizer to make the loop a RPTBLOCAL
21Summary Power Optimization with Code gen
- DO LESS
- Execute fewer instructions
- Access Memory Less (common sub-expression
elimination, etc) - Many performance optimizations also save power
- USE LOWER-POWER RESOURCES
- Use lowest-power functional unit for an operation
- Put important code on-chip
- Put important data on-chip
- Restructure code to maximize utilization of
caches - Keep the program close to the CPU (less bus and
pin traffic) - Identify critical code and data !
22Power Optimization with RTOS
- Keep the maximum HW powered down
- Peripherals
- Memory modules
- HW teams claim that most power is not consumed
by the CPU but by peripherals - 20 CPU power savings is not nearly as important
as a 30-60 power reduction in the peripherals
which seems possible
- Use Lowest Voltage and Clock Rates
- Use dynamic voltage scheduling (DVS)
- Must predict the MIPS requirements in the near
future
- The extra computing needed must not offset power
gains - Flat MIPs meeting all deadlines is probably best
- Peek MIPs needs force deviation from Flat MIPS
model
23Dynamic Voltage Scheduling
- Dynamic Voltage Scheduling (DVS) approaches
- on-line
- off-line
- inter-task
- intra-task
- System then adjusts the voltage so that the CPU
is (ideally) never idle and all threads continue
to meet real-time deadlines - In all cases, one must predict the MIPS
requirements in the immediate future. The real
trick is in predicting these MIPS requirements.
24Off-line Approach
- Compute a static schedule of voltages that may
vary in real-time but do not adjust to "new"
algorithms being downloaded to the platform - can spend an arbitrary amount of time determining
an "optimal" schedule - Processor must be a closed system
- ok for many other "classic" DSP applications
25On-line Approach
- Requires the RTOS to dynamically adjust to new
algorithms - Can't spend a huge amount of time computing an
optimal schedule - Simple heuristics are employed (because they must
execute in real-time on the target)
26Inter-task Approach
- Pure" RTOS solution no change is made to
algorithm code - Voltage scheduling decision points are only made
at context-switch boundaries - RTOS is given a "black-box characterization of
the execution time requirements of each task in
the system
27Intra-task Approach
- An algorithm solution
- Algorithms "call out" to the OS to inform the OS
about how much more work is required before
completing the current frame of data - Works well in algorithms with large variations
between worst/average-case execution times (e.g.,
MPEG4) - Theoretically, compilers can automatically
compute the points in an algorithm where it
should call out to the OS the compiler can
compute worst-case execution times from various
points in the call-graph
28SE 746-NT Embedded Software Systems
Development Robert Oshana End of
Lecture For more information, please
contact NTU Tape Orders NTU Media
Services (970) 495-6455
oshana_at_airmail.net
tapeorders_at_ntu.edu