Real-time Implementation of G.729A Speech Codec on Media Processor Architectures

About This Presentation

Title:

Real-time Implementation of G.729A Speech Codec on Media Processor Architectures

Description:

Word16 i, j; Word32 s; for (i = 0; i lg; i ) /* lg = 40 */ s = L_mult(x ... David H. Crawford, Emmanuel Roy, 'Techniques for Real-Time DSP Implementation of ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 27

Provided by: mta71

Category:

more less

Transcript and Presenter's Notes

Title: Real-time Implementation of G.729A Speech Codec on Media Processor Architectures

1
Real-time Implementation of G.729A Speech Codec
onMedia Processor Architectures
4th International Bhurban Conference on Applied
Sciences and Technology Bhurban, Pakistan. June
11-18, 2005

M.Tahir Awan
Faisal Abdullah
Shahid Masud
Nadeem A. Khan

Lahore University of Management Sciences
2
Overview of Presentation

Speech Codecs, G.729A
Blackfin DSP Architecture
DSP Design Flow
Multi-level Optimization Methodology
Results Analysis
Conclusion
Q A

3
Speech Codecs

Applications
IP Telephony
Wireless Communications.
Multimedia applications
Classification
Waveform Coders
Parametric Coders
Characterized by
Bandwidth
Algorithmic Complexity
Speech Quality

4
The G.729A Speech Codec

Parametric Codec by ITU-T
Analysis by Synthesis Approach
Algebraic-CELP Algorithmic Structure
Compression rate is 8Kbps
Frame length is 10ms
High Algorithmic Complexity

5
G.729A Block Diagram
6
Overview of Blackfin ADSP-BF533

16-bit Fixed-point Media processor
VLIW Architecture
Dual MAC and 2 parallel ALU units
8 Data Registers, 8 Address Registers,
4 Index Registers and 1 barrel shifter
Both 16-bit and 32-bit Instructions
Operating Frequency is up to 750MHz

7
Blackfin BF533 Block Diagram
Index Registers
Address Registers
Data Registers
2 MAC Units
8
Traditional DSP Design-flow
9
Real Time Implementation Issues

Timing requirements for media applications are
stringent
Compiler optimization does not work well for DSP
processors
Compiler generated Code does not exploit DSP
processor Architectures
Optimization results in less Power Consumption
Support for Multiple channels of application

10
Implementation Methodology

Implementation Stages
Profiling
Project Level Optimizations
Function Level C Optimizations
Assembly Optimizations
Compliance Tests

11
1. Profiling

Profiling is a performance and Complexity
analysis tool
Profiling points out the cycle intensive
functions
Porting of entire algorithm into Assembly is time
consuming
Visual C/Visual DSP platforms has been used for
profiling
Profiling Example

12
2. Project Level Optimizations

High-level optimization (C-language environment)
Cycle intensive parts of large functions are
called as separate subroutines (gt100 lines)
Functions to be ported in Assembly are placed in
separate files for debugging and interfacing with
C-code

13
3. Function Level Optimizations

High-level optimization (C-language environment)
Intrinsic Operations have been inlined
Removal of overflow and saturation checks in
compliance with ITU-Specs
Replacing intrinsic operations with assembly
instructions (left/right shift, round, norm)

14
4. Assembly Optimizations

Low-level optimization (Assembly-language
environment)
These Optimizations include
Register Scheduling
Loop Optimizations
Software Pipelining
Memory Access Management

15
Loop Optimizations

Loops are Critical parts of DSP Algorithms
Loops occur in nested structure
Use of Zero-overhead Loop Instructions
Loops Unrolling helps while porting Code to
Assembly

16
Software Pipelining

Scheduling Technique that restructures the Code
Keeps a check on processor resources
For BF533 Software pipeline consists of 2
arithmetic instructions and 2 load/store
instructions

17
Code Example
C-Code
Blackfin Assembly
.section G729a_Residu .global _residu_asm _resid
u_asm P1 R0 // P1 ---gt a P2 R1 // P2
---gt x P3 R2 // P3 ---gt y R7
SP0x30 // R7 lg R0 WP1(X) // R0
a0 ( 16-bit load) R7 gtgt 1 // R7 20 LC0
R7 LSETUP (begin_loop_residu, end_loop_residu)
LC0 begin_loop_residu I3 P2 // I2 ---gt
x-j A0 0 R6 WP1(X) R5.l
WI3 A1 0 R5.h WI3 R3
WP2(X) A0R6.lR5.l,A1R6.lR5.h R6
WP12(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP14(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP16(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP18(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP110(X) R5.l WI3-- A0R6.lR5.h,A1R6
.lR5.l R6 WP112(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP114(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP116(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP118(X) R5.h WI3-- A0R6.lR5.l,A1R6
.lR5.h R6 WP120(X) R5.h
WI3-- A0R6.lR0.l,A1R6.lR0.h R4 A0
, R5 A1 R3 WP2(X) R4 R4 ltlt 3
(S) R4.l R4 (RND) R5 R5 ltlt 3 (S)
WP3 R4 R5.l R5 (RND)
end_loop_residu WP3 R5
// store yi1 RTS _residu_asm.end

void Residu(
Word16 a, / (i) prediction coeffss /
Word16 x, / (i) speech vector x /
Word16 y, / (o) residual signal /
Word16 lg / (i) size of filtering /
)
Word16 i, j
Word32 s
for (i 0 i lt lg i) / lg 40 /
s L_mult(xi, a0)
for (j 1 j lt M j)
s L_mac(s, aj, xi-j)
s L_shl(s, 3)
yi round(s)

Outer Loop
Inner Loop
Outer Loop
Cycle Count 40,117
Cycle Count 702
18

Results Analysis

19
MCPS count for Multiple stages
Development Stage Processor Load for BF533 (MCPS) Reduction in Loads (MCPS)
Reference C Code 550 ---
C Level Optimizations 358 192
Intrinsic Functions Optimizations 278 80
Assembly Optimizations (4 Functions) 137 141
Assembly Optimizations (10 Functions) 30 107
20
Functions ported to Assembly
Functions Calls per Fame Cycles per Func. Call Cycles per Func. Call Gain per Call (cycles) Gain per Frame (cycles) Net Gain (cycles)
Functions Calls per Fame C Code Assembly Code Gain per Call (cycles) Gain per Frame (cycles) Net Gain (cycles)
Syn_filt() 14 39332 526 38806 543284 543284
Residu() 4 40117 702 39415 157660 700944
Cor_h_x() 4 72175 623 71552 286208 987152
Autocorr() 1 235934 6233 229701 229701 1216853
Pred_lt_3() 8 66659 630 66029 528232 1745085
D4i40_17_fast() 2 194902 9953 184949 369898 2114983
Pitch_ol_fast() 1 326497 7145 319352 319352 2434335
Chebps() 73 2212 97 2115 154395 2588730
Qua_gain() 2 54601 2145 52456 104912 2693642
test_err() 2 260 49 211 422 2694064
21
Comparison of optimization stages
22
Final Implementation Statistics

Original Codec runs at 550 MCPS
Optimized Code codec runs at 30 MCPS
25 Full duplex channels can be supported for a
750MHz BF533 Processor
Implementation is compliant with
ITU-Specifications

23
Conclusion

Optimization of DSP algorithms is necessary to
meet MIPS requirements
Both high level low level optimizations play an
important role in reducing MIPS
Understanding underlying hardware architecture is
important for maximum performance
A MIPS reduction of 10 times can be achieved for
fixed point algorithms

24
References

Jennifer Eyre and Jeff Bier, The Evolution of
DSP Processors, IEEE Signal Processing
Magazine, VOL 17. NO.2, (2000).
Lapsley et. al, DSP Processor Fundamentals
Architectures and Features, IEEE Press, (1996).
Coding of Speech at 8 kbps Conjugate Structure
Algebraic Code Excited Linear Prediction
(CS-ACELP) G.729A Recommendation, ITU-T
http//www.itu.int, (1996).
Blackfin ADSP-BF533 Hardware Reference Manual,
Rev. 3.0, Analog Devices, http//www.analog.com,
(2004).
Jaewon Kim, Hyungjung Kim, Songin Choi, Younggap
You, Implementation of G.729 speech coder on a
16-bit DSP chip for the CDMA IMT-2000 system,
IEEE Transactions on Consumer Electronics,
(1999).
Razvan Ungureanu, Bogdan Costinescu, ITU-T
G.729A Implementation on StarCore SC-140,
Application Notes, Motorola http//www.motorola.c
om, (2001).
ITU-T G729 Speech Coder for TMS320C6201,
Overview Guide, Radisys http//www.radisys.com,
(2000).
David H. Crawford, Emmanuel Roy, Techniques for
Real-Time DSP Implementation of Speech Coding
Algorithms, Proceedings International Conference
on Signal Processing Applications and Technology
( ICSPAT), pp 2-7, 1-4 (1999).

Questions?

26
Profiling Example (G.729A Encoder)

Profile Function timing, sorted by time
Date Wed Oct 27 103413 2004
Program Statistics
------------------
Command line at 2004 Oct 27 1032 "G729aWS"
test.inp test.outp
Total time 7768.530 millisecond
Time outside of functions 0.262 millisecond
Call depth 10
Total functions 126
Module Statistics for g729aws.exe
---------------------------------
Time in module 7768.268 millisecond
Percent of time in module 100.0
Functions in module 126
Func FuncChild Hit
Time Time Count
Function
--------------------------------------------------
-------
191.747 2.5 860.104 11.1 1002
_Pred_lt_3 (pred_lt3.obj)

Back

Write a Comment

User Comments (0)

About PowerShow.com

Real-time Implementation of G.729A Speech Codec on Media Processor Architectures - PowerPoint PPT Presentation

Real-time Implementation of G.729A Speech Codec on Media Processor Architectures

Word16 i, j; Word32 s; for (i = 0; i lg; i ) /* lg = 40 */ s = L_mult(x ... David H. Crawford, Emmanuel Roy, 'Techniques for Real-Time DSP Implementation of ... – PowerPoint PPT presentation