Title: Real-time Implementation of G.729A Speech Codec on Media Processor Architectures
1Real-time Implementation of G.729A Speech Codec
onMedia Processor Architectures
4th International Bhurban Conference on Applied
Sciences and Technology Bhurban, Pakistan. June
11-18, 2005
- M.Tahir Awan
- Faisal Abdullah
- Shahid Masud
- Nadeem A. Khan
Lahore University of Management Sciences
2Overview of Presentation
- Speech Codecs, G.729A
- Blackfin DSP Architecture
- DSP Design Flow
- Multi-level Optimization Methodology
- Results Analysis
- Conclusion
- Q A
3Speech Codecs
- Applications
- IP Telephony
- Wireless Communications.
- Multimedia applications
- Classification
- Waveform Coders
- Parametric Coders
- Characterized by
- Bandwidth
- Algorithmic Complexity
- Speech Quality
4The G.729A Speech Codec
- Parametric Codec by ITU-T
- Analysis by Synthesis Approach
- Algebraic-CELP Algorithmic Structure
- Compression rate is 8Kbps
- Frame length is 10ms
- High Algorithmic Complexity
5G.729A Block Diagram
6Overview of Blackfin ADSP-BF533
- 16-bit Fixed-point Media processor
- VLIW Architecture
- Dual MAC and 2 parallel ALU units
- 8 Data Registers, 8 Address Registers,
- 4 Index Registers and 1 barrel shifter
- Both 16-bit and 32-bit Instructions
- Operating Frequency is up to 750MHz
7Blackfin BF533 Block Diagram
Index Registers
Address Registers
Data Registers
2 MAC Units
8Traditional DSP Design-flow
9Real Time Implementation Issues
- Timing requirements for media applications are
stringent - Compiler optimization does not work well for DSP
processors - Compiler generated Code does not exploit DSP
processor Architectures - Optimization results in less Power Consumption
- Support for Multiple channels of application
10Implementation Methodology
- Implementation Stages
- Profiling
- Project Level Optimizations
- Function Level C Optimizations
- Assembly Optimizations
- Compliance Tests
111. Profiling
- Profiling is a performance and Complexity
analysis tool - Profiling points out the cycle intensive
functions - Porting of entire algorithm into Assembly is time
consuming - Visual C/Visual DSP platforms has been used for
profiling - Profiling Example
122. Project Level Optimizations
- High-level optimization (C-language environment)
- Cycle intensive parts of large functions are
called as separate subroutines (gt100 lines) - Functions to be ported in Assembly are placed in
separate files for debugging and interfacing with
C-code
133. Function Level Optimizations
- High-level optimization (C-language environment)
- Intrinsic Operations have been inlined
- Removal of overflow and saturation checks in
compliance with ITU-Specs - Replacing intrinsic operations with assembly
instructions (left/right shift, round, norm)
144. Assembly Optimizations
- Low-level optimization (Assembly-language
environment) - These Optimizations include
- Register Scheduling
- Loop Optimizations
- Software Pipelining
- Memory Access Management
15Loop Optimizations
- Loops are Critical parts of DSP Algorithms
- Loops occur in nested structure
- Use of Zero-overhead Loop Instructions
- Loops Unrolling helps while porting Code to
Assembly
16Software Pipelining
- Scheduling Technique that restructures the Code
- Keeps a check on processor resources
- For BF533 Software pipeline consists of 2
arithmetic instructions and 2 load/store
instructions
17Code Example
C-Code
Blackfin Assembly
.section G729a_Residu .global _residu_asm _resid
u_asm P1 R0 // P1 ---gt a P2 R1 // P2
---gt x P3 R2 // P3 ---gt y R7
SP0x30 // R7 lg R0 WP1(X) // R0
a0 ( 16-bit load) R7 gtgt 1 // R7 20 LC0
R7 LSETUP (begin_loop_residu, end_loop_residu)
LC0 begin_loop_residu I3 P2 // I2 ---gt
x-j A0 0 R6 WP1(X) R5.l
WI3 A1 0 R5.h WI3 R3
WP2(X) A0R6.lR5.l,A1R6.lR5.h R6
WP12(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP14(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP16(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP18(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP110(X) R5.l WI3-- A0R6.lR5.h,A1R6
.lR5.l R6 WP112(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP114(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP116(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP118(X) R5.h WI3-- A0R6.lR5.l,A1R6
.lR5.h R6 WP120(X) R5.h
WI3-- A0R6.lR0.l,A1R6.lR0.h R4 A0
, R5 A1 R3 WP2(X) R4 R4 ltlt 3
(S) R4.l R4 (RND) R5 R5 ltlt 3 (S)
WP3 R4 R5.l R5 (RND)
end_loop_residu WP3 R5
// store yi1 RTS _residu_asm.end
- void Residu(
- Word16 a, / (i) prediction coeffss /
- Word16 x, / (i) speech vector x /
- Word16 y, / (o) residual signal /
- Word16 lg / (i) size of filtering /
- )
-
- Word16 i, j
- Word32 s
- for (i 0 i lt lg i) / lg 40 /
-
- s L_mult(xi, a0)
- for (j 1 j lt M j)
- s L_mac(s, aj, xi-j)
- s L_shl(s, 3)
- yi round(s)
-
Outer Loop
Inner Loop
Outer Loop
Cycle Count 40,117
Cycle Count 702
18 19MCPS count for Multiple stages
Development Stage Processor Load for BF533 (MCPS) Reduction in Loads (MCPS)
Reference C Code 550 ---
C Level Optimizations 358 192
Intrinsic Functions Optimizations 278 80
Assembly Optimizations (4 Functions) 137 141
Assembly Optimizations (10 Functions) 30 107
20Functions ported to Assembly
Functions Calls per Fame Cycles per Func. Call Cycles per Func. Call Gain per Call (cycles) Gain per Frame (cycles) Net Gain (cycles)
Functions Calls per Fame C Code Assembly Code Gain per Call (cycles) Gain per Frame (cycles) Net Gain (cycles)
Syn_filt() 14 39332 526 38806 543284 543284
Residu() 4 40117 702 39415 157660 700944
Cor_h_x() 4 72175 623 71552 286208 987152
Autocorr() 1 235934 6233 229701 229701 1216853
Pred_lt_3() 8 66659 630 66029 528232 1745085
D4i40_17_fast() 2 194902 9953 184949 369898 2114983
Pitch_ol_fast() 1 326497 7145 319352 319352 2434335
Chebps() 73 2212 97 2115 154395 2588730
Qua_gain() 2 54601 2145 52456 104912 2693642
test_err() 2 260 49 211 422 2694064
21Comparison of optimization stages
22Final Implementation Statistics
- Original Codec runs at 550 MCPS
- Optimized Code codec runs at 30 MCPS
- 25 Full duplex channels can be supported for a
750MHz BF533 Processor - Implementation is compliant with
ITU-Specifications
23Conclusion
- Optimization of DSP algorithms is necessary to
meet MIPS requirements - Both high level low level optimizations play an
important role in reducing MIPS - Understanding underlying hardware architecture is
important for maximum performance - A MIPS reduction of 10 times can be achieved for
fixed point algorithms
24References
- Jennifer Eyre and Jeff Bier, The Evolution of
DSP Processors, IEEE Signal Processing
Magazine, VOL 17. NO.2, (2000). - Lapsley et. al, DSP Processor Fundamentals
Architectures and Features, IEEE Press, (1996). - Coding of Speech at 8 kbps Conjugate Structure
Algebraic Code Excited Linear Prediction
(CS-ACELP) G.729A Recommendation, ITU-T
http//www.itu.int, (1996). - Blackfin ADSP-BF533 Hardware Reference Manual,
Rev. 3.0, Analog Devices, http//www.analog.com,
(2004). - Jaewon Kim, Hyungjung Kim, Songin Choi, Younggap
You, Implementation of G.729 speech coder on a
16-bit DSP chip for the CDMA IMT-2000 system,
IEEE Transactions on Consumer Electronics,
(1999). - Razvan Ungureanu, Bogdan Costinescu, ITU-T
G.729A Implementation on StarCore SC-140,
Application Notes, Motorola http//www.motorola.c
om, (2001). - ITU-T G729 Speech Coder for TMS320C6201,
Overview Guide, Radisys http//www.radisys.com,
(2000). - David H. Crawford, Emmanuel Roy, Techniques for
Real-Time DSP Implementation of Speech Coding
Algorithms, Proceedings International Conference
on Signal Processing Applications and Technology
( ICSPAT), pp 2-7, 1-4 (1999).
25 26Profiling Example (G.729A Encoder)
- Profile Function timing, sorted by time
- Date Wed Oct 27 103413 2004
- Program Statistics
- ------------------
- Command line at 2004 Oct 27 1032 "G729aWS"
test.inp test.outp - Total time 7768.530 millisecond
- Time outside of functions 0.262 millisecond
- Call depth 10
- Total functions 126
- Module Statistics for g729aws.exe
- ---------------------------------
- Time in module 7768.268 millisecond
- Percent of time in module 100.0
- Functions in module 126
- Func FuncChild Hit
- Time Time Count
Function - --------------------------------------------------
------- - 191.747 2.5 860.104 11.1 1002
_Pred_lt_3 (pred_lt3.obj)
Back