Title: DSP Lecture 2332009
1DSP Lecture 23/3-2009
2Next DSP lecture moved
- New date April 6.
- Time 15.15
-
3Agenda
- Starting CCS
- Comparing matlab and DSP results.
- Profiling when comparing matlab and DSP results.
- Matlablt-gtDSP communication.
- EDMA
- EDMA_RTDX_GPIO, QUAD_DAC_ADC.
- _empty
- State-machine using case statement.
- Data formats.
- Overlap and add.
- Stack and heap.
- Simple optimization rules.
- Cache
- Some advices.
4Starting CCS
- CCStudio v3.3 is the code development
environment. - Use Setup CCStudion v3.3 when you need to change
between targets. - C6713 DSK-USB
- C6713 Device Cycle Accurate Simulator (little
endian) - C6416 Device Cycle Accurate Simulator (little
endian) - Connnect to matlab
- ccccsdsp
- cc.visible(0), cc.run, cc.isrunning.
The hardware
When doing tutorial
5Comparing matlab and DSP result
- Principle to test isolated functions e.g. a
decoder - Generate input in matlab.
- Write input to the DSP.
- Call DSP version of function.
- Read output from the DSP.
- Call matlab version of function.
- Compare results.
- Lets have a look at the compare_with_matlab_31
skeleton!
6Test important functions by
- Copy the entire compare_with_matlab_31.pjt
project. - Replace FuncionToBeTested with your code
- In the C-code.
- In the matlab code.
- Define input and output dataparameters as
relevant for your function. - Change the matlab code to generate relevant input
data. - Sometimes called test harness in industry.
7Matlab lt-gt DSP communication 1(2)
- Sending data between matlab and DSP when the DSP
is not running -
- Input_objcreateobj(cc,Input) Input is a
global in the DSP code. - write(Input_obj,Input) write data
- Inputread(Input_obj) read data
.
matlab code
8DSP -gt PC communication 2(3)
When the DSP is running (RTDX) On the DSP
side RTDX_write(ctrl_chan_dsp2pc,
data_to_matlab, sizeof(float)NO_FLOATS_TO_MATLAB
) On the matlab side data_from_DSPreadmsg(cc.r
tdx,'ctrl_chan_dsp2pc', 'single') Recommendation
Re-use code in the _empty skeletons.
9Matlab lt-gt DSP communication 3(3)
- The PClt-gtDSP interface is slow ?
- Allowed cheating (if necessary)
- Pre-read data into memory before real-time
processing. - Read result from memory, after real-time
processing. - Large memory areas available in external memory
- pragma DATA_SECTION(Data,".external_mem") // On
DSP - short Data1000 // On DSP
- write(cc,h_Data.address(1), int16(Data)) In
matlab - The data is not cleared when the program is
reloaded.
10Enhanced Direct Memory Access (EDMA)
Leaves DSP free from moving data back and forth
to ADC/DAC!
11EDMA PaRAM
12Ping-Pong Buffering
hEdmaReloadXmtPing
hEdmaReloadXmtPong
SRCgBufferXmtPing
SRCgBufferXmtPong
DSTDXR
DSTDXR
LINK hEdmaReloadXmtPing
LINK hEdmaReloadXmtPong
Let me show you EDMA_RTDX_GPIO_empty and
QUAD_DAC_ADC_empty!
13Skeleton programs handling EDMARTDX
- Single-antenna
- EDMA_RTDX_GPIO_31_empty
- EDMA_RTDX_GPIO_31.
- Dual-antenna
- QUAD_ADC_DAC_31_empty
- QUAD_ADC_DAC_31.
Code development
Matlab prototype
Code development
Matlab prototype
14EDMA_RTDX_GPIO
- Lets go through EDMA_RTDX_GPIO_31_empty
- Then go through EDMA_RTDX_GPIO_31
- This is the DSPlt-gtmatlab interface to be used in
the matlab prototype!! - Note Documentation in main.c!
15State Machine using Case Statement in
appl_Process
16Data formats
- C-types char8bits, short16bits, int32bits,
float 32bits. - Integers are signed or unsigned.
- Float. Sign1bit, exponent8bits, fraction 23
bits. - In C, conversion is automatic (when pointers are
not involved). - However, note the range ..
17The buffers in EDMA_RTDX_GPIO
- appl_Process(short receive_buffer,short
transmit_buffer) - The buffers consists of BUFFSIZE shorts (range
-215,215-1). - BUFFSIZE is defined in EDMA_RTDX_GPIO.h to be
256. - The number of bytes is 2BUFFSIZE512.
- In EDMA_RTDX_GPIO there are 2 channels (i.e. ADC
and DAC converters) which are interleaved. - Thus the number of 2-dimensional vector samples
is BUFFSIZE/2128. - In QUAD_ADC_DAC the are 4 channels which are
interleaved. - Thus the number of 4-dimensional vector samples
BUFFSIZE/464. - BUFFSIZE can be changed.
18Overlap and add
- Say we want to do implement a FIR filter.
- The input buffer is 128 samples.
- The filter is 10 samples.
- The filtered signal is 12810-1137 samples.
- But the output filter is 128 samples .
- Solution overlap and add.
- Variant 1 Save the last 9 samples. Add them to
the next buffer. - Variant 2 Overlap-and-add. See next slide.
19Overlap and Add With additional buffer
Move 1289 samples
128 samples
128 samples
9
Zero these samples
Add the new signal
Good if transmit signal is 128 samples and
unsynchronized!
20Stack and Heap
- float myfunction(short buffer)
-
- float internal_buffer1000
-
This data is stored in the stack. At least 4000
bytes needed.
The stack size is set in build options. No
warning is given by the compiler of the stack
size is to small!!!
Allocated in heap
float internal_buffer internal_buffer (float
) malloc(1000sizeof(float))
The heap size is also set in build options.
Also no warning!!!
21Code Optimization
- Let me show you optimization_example .
22Simple Optimization Rules 1(2)
- Turn optimization on. Flags -o3, program mode
compilation pm and -op3 if possible. - Turn debug off i.e do not use -g.
- Avoid function calls inside loops!
- Use of division / is a function call!, use
_rcpsp instead. Other intrinsics see table 8-6 in
spru187n. - Avoid math-functions such as sin(x) use look-up
tables instead. - Check that all important loops are pipelined by
searching for "SOFTWARE PIPELINE INFORMATION in
generated .asm files.
23Simple Optimization Rules 2(2)
- Allocate all time-critical code and data in
internal memory (in our skeletons this is default
allocating to external memory requires pragma
statement). - Use the touch function in an initialization
routine to have the most important data structure
cached in internal memory. (This function can be
copied from the cache_miss_example skeleton) - float ImportantData100
- .
- touch(ImportantData,100)
24TMS320C6713 cache
25One-way cache (L1P)
Mem 0x-0x1F
Line 0
Mem 0x20-0x3F
Line 1
Mem 0x0FE0-0x0FFF
Line 127
Mem 0x1000-0x101F
Mem 0x1020-0x103F
Cache
SDRAM
Mem 0x1FE0-0x1FFF
26Two-way cache (L1D)
Mem 0x-0x1F
Line 0A
Mem 0x20-0x3F
Line 1A
Mem 0x7E0-0x7FF
Line 63A
Mem 0x800-0x81F
Mem 0x820-0x83F
Mem 0x0FE0-0x0FFF
27L1D cache
L1D address allocation
- A new line of 32bytes is loaded on a read-miss
with a penalty 4 clock-cycles. - If two words are loaded per clock-cycle (reading
sequentially from a memory segment) the overhead
is 8/3241clock-cykle per instruction cycle. - A write-miss doesnt lead to a loading of a
new-line. A write buffer of four words handle up
to four misses without penalty.
28cache_miss_example
- main.c Illustrates impact of L1D write and read
misses (compulsory misses). - main2.c Illustrates the problem with several
data objects in the same set (thrashing) - Two data objects are in the same set if
- Aa K2048 Ab,
- for some address Aa and Ab in Object A or B
respectively, and for some K. - Two code objects are in the same set if
- Aa K4096 Ab,
- for some address Aa and Ab in Object A or B
respectively, and for some K.
29What to consider when programming to make good
use of the cache
- Align all data buffers on 32byte boundaries.
(pragma DATA_ALIGN). - Avoid to allocate more than two objects that map
to the same set in the same algorithm. - Avoid having two or more computationally complex
algorithms that map to the same set. - Profile the algorithms with and without cached
data and program (see cache_miss_example). - Force caching of important data and code before
starting the realtime program starts (e.g in
appl_Init()) by reading the data (touch) and
calling the functions. - Test processing data in smaller buffers to see if
performance improves.
30Some advices 1(2)
- Start with a skeleton.
- Only insert functions which have been checked
against matlab. - Make one change at a time gt much easier to find
out what went wrong. - Save before and after code.
- Dont use printf.
31Some advices 2(2)
- Check that all pointers are initialized.
- If a variable are corrupted, check .map file to
se how it could be over-written. - Use extern declaration both in the file where
variable is declared and where it is used. - In real-time debugging. Store results to
debug-globals. - When using sqrt, log, log10 use include
ltmath.hgt.