Real-time Implementation of G.729A Speech Codec on Media Processor Architectures - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Real-time Implementation of G.729A Speech Codec on Media Processor Architectures

Description:

Word16 i, j; Word32 s; for (i = 0; i lg; i ) /* lg = 40 */ s = L_mult(x ... David H. Crawford, Emmanuel Roy, 'Techniques for Real-Time DSP Implementation of ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: mta71
Category:

less

Transcript and Presenter's Notes

Title: Real-time Implementation of G.729A Speech Codec on Media Processor Architectures


1
Real-time Implementation of G.729A Speech Codec
onMedia Processor Architectures
4th International Bhurban Conference on Applied
Sciences and Technology Bhurban, Pakistan. June
11-18, 2005
  • M.Tahir Awan
  • Faisal Abdullah
  • Shahid Masud
  • Nadeem A. Khan

Lahore University of Management Sciences
2
Overview of Presentation
  • Speech Codecs, G.729A
  • Blackfin DSP Architecture
  • DSP Design Flow
  • Multi-level Optimization Methodology
  • Results Analysis
  • Conclusion
  • Q A

3
Speech Codecs
  • Applications
  • IP Telephony
  • Wireless Communications.
  • Multimedia applications
  • Classification
  • Waveform Coders
  • Parametric Coders
  • Characterized by
  • Bandwidth
  • Algorithmic Complexity
  • Speech Quality

4
The G.729A Speech Codec
  • Parametric Codec by ITU-T
  • Analysis by Synthesis Approach
  • Algebraic-CELP Algorithmic Structure
  • Compression rate is 8Kbps
  • Frame length is 10ms
  • High Algorithmic Complexity

5
G.729A Block Diagram
6
Overview of Blackfin ADSP-BF533
  • 16-bit Fixed-point Media processor
  • VLIW Architecture
  • Dual MAC and 2 parallel ALU units
  • 8 Data Registers, 8 Address Registers,
  • 4 Index Registers and 1 barrel shifter
  • Both 16-bit and 32-bit Instructions
  • Operating Frequency is up to 750MHz

7
Blackfin BF533 Block Diagram
Index Registers
Address Registers
Data Registers
2 MAC Units
8
Traditional DSP Design-flow
9
Real Time Implementation Issues
  • Timing requirements for media applications are
    stringent
  • Compiler optimization does not work well for DSP
    processors
  • Compiler generated Code does not exploit DSP
    processor Architectures
  • Optimization results in less Power Consumption
  • Support for Multiple channels of application

10
Implementation Methodology
  • Implementation Stages
  • Profiling
  • Project Level Optimizations
  • Function Level C Optimizations
  • Assembly Optimizations
  • Compliance Tests

11
1. Profiling
  • Profiling is a performance and Complexity
    analysis tool
  • Profiling points out the cycle intensive
    functions
  • Porting of entire algorithm into Assembly is time
    consuming
  • Visual C/Visual DSP platforms has been used for
    profiling
  • Profiling Example

12
2. Project Level Optimizations
  • High-level optimization (C-language environment)
  • Cycle intensive parts of large functions are
    called as separate subroutines (gt100 lines)
  • Functions to be ported in Assembly are placed in
    separate files for debugging and interfacing with
    C-code

13
3. Function Level Optimizations
  • High-level optimization (C-language environment)
  • Intrinsic Operations have been inlined
  • Removal of overflow and saturation checks in
    compliance with ITU-Specs
  • Replacing intrinsic operations with assembly
    instructions (left/right shift, round, norm)

14
4. Assembly Optimizations
  • Low-level optimization (Assembly-language
    environment)
  • These Optimizations include
  • Register Scheduling
  • Loop Optimizations
  • Software Pipelining
  • Memory Access Management

15
Loop Optimizations
  • Loops are Critical parts of DSP Algorithms
  • Loops occur in nested structure
  • Use of Zero-overhead Loop Instructions
  • Loops Unrolling helps while porting Code to
    Assembly

16
Software Pipelining
  • Scheduling Technique that restructures the Code
  • Keeps a check on processor resources
  • For BF533 Software pipeline consists of 2
    arithmetic instructions and 2 load/store
    instructions

17
Code Example
C-Code
Blackfin Assembly
.section G729a_Residu .global _residu_asm _resid
u_asm P1 R0 // P1 ---gt a P2 R1 // P2
---gt x P3 R2 // P3 ---gt y R7
SP0x30 // R7 lg R0 WP1(X) // R0
a0 ( 16-bit load) R7 gtgt 1 // R7 20 LC0
R7 LSETUP (begin_loop_residu, end_loop_residu)
LC0 begin_loop_residu I3 P2 // I2 ---gt
x-j A0 0 R6 WP1(X) R5.l
WI3 A1 0 R5.h WI3 R3
WP2(X) A0R6.lR5.l,A1R6.lR5.h R6
WP12(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP14(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP16(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP18(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP110(X) R5.l WI3-- A0R6.lR5.h,A1R6
.lR5.l R6 WP112(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP114(X) R5.l WI3-- A0R6.lR5.h,A1R
6.lR5.l R6 WP116(X) R5.h
WI3-- A0R6.lR5.l,A1R6.lR5.h R6
WP118(X) R5.h WI3-- A0R6.lR5.l,A1R6
.lR5.h R6 WP120(X) R5.h
WI3-- A0R6.lR0.l,A1R6.lR0.h R4 A0
, R5 A1 R3 WP2(X) R4 R4 ltlt 3
(S) R4.l R4 (RND) R5 R5 ltlt 3 (S)
WP3 R4 R5.l R5 (RND)
end_loop_residu WP3 R5
// store yi1 RTS _residu_asm.end
  • void Residu(
  • Word16 a, / (i) prediction coeffss /
  • Word16 x, / (i) speech vector x /
  • Word16 y, / (o) residual signal /
  • Word16 lg / (i) size of filtering /
  • )
  • Word16 i, j
  • Word32 s
  • for (i 0 i lt lg i) / lg 40 /
  • s L_mult(xi, a0)
  • for (j 1 j lt M j)
  • s L_mac(s, aj, xi-j)
  • s L_shl(s, 3)
  • yi round(s)

Outer Loop
Inner Loop
Outer Loop
Cycle Count 40,117
Cycle Count 702
18
  • Results Analysis

19
MCPS count for Multiple stages
Development Stage Processor Load for BF533 (MCPS) Reduction in Loads (MCPS)
Reference C Code 550 ---
C Level Optimizations 358 192
Intrinsic Functions Optimizations 278 80
Assembly Optimizations (4 Functions) 137 141
Assembly Optimizations (10 Functions) 30 107
20
Functions ported to Assembly
Functions Calls per Fame Cycles per Func. Call Cycles per Func. Call Gain per Call (cycles) Gain per Frame (cycles) Net Gain (cycles)
Functions Calls per Fame C Code Assembly Code Gain per Call (cycles) Gain per Frame (cycles) Net Gain (cycles)
Syn_filt() 14 39332 526 38806 543284 543284
Residu() 4 40117 702 39415 157660 700944
Cor_h_x() 4 72175 623 71552 286208 987152
Autocorr() 1 235934 6233 229701 229701 1216853
Pred_lt_3() 8 66659 630 66029 528232 1745085
D4i40_17_fast() 2 194902 9953 184949 369898 2114983
Pitch_ol_fast() 1 326497 7145 319352 319352 2434335
Chebps() 73 2212 97 2115 154395 2588730
Qua_gain() 2 54601 2145 52456 104912 2693642
test_err() 2 260 49 211 422 2694064
21
Comparison of optimization stages
22
Final Implementation Statistics
  • Original Codec runs at 550 MCPS
  • Optimized Code codec runs at 30 MCPS
  • 25 Full duplex channels can be supported for a
    750MHz BF533 Processor
  • Implementation is compliant with
    ITU-Specifications

23
Conclusion
  • Optimization of DSP algorithms is necessary to
    meet MIPS requirements
  • Both high level low level optimizations play an
    important role in reducing MIPS
  • Understanding underlying hardware architecture is
    important for maximum performance
  • A MIPS reduction of 10 times can be achieved for
    fixed point algorithms

24
References
  • Jennifer Eyre and Jeff Bier, The Evolution of
    DSP Processors, IEEE Signal Processing
    Magazine, VOL 17. NO.2, (2000).
  • Lapsley et. al, DSP Processor Fundamentals
    Architectures and Features, IEEE Press, (1996).
  • Coding of Speech at 8 kbps Conjugate Structure
    Algebraic Code Excited Linear Prediction
    (CS-ACELP) G.729A Recommendation, ITU-T
    http//www.itu.int, (1996).
  • Blackfin ADSP-BF533 Hardware Reference Manual,
    Rev. 3.0, Analog Devices, http//www.analog.com,
    (2004).
  • Jaewon Kim, Hyungjung Kim, Songin Choi, Younggap
    You, Implementation of G.729 speech coder on a
    16-bit DSP chip for the CDMA IMT-2000 system,
    IEEE Transactions on Consumer Electronics,
    (1999).
  • Razvan Ungureanu, Bogdan Costinescu, ITU-T
    G.729A Implementation on StarCore SC-140,
    Application Notes, Motorola http//www.motorola.c
    om, (2001).
  • ITU-T G729 Speech Coder for TMS320C6201,
    Overview Guide, Radisys http//www.radisys.com,
    (2000).
  • David H. Crawford, Emmanuel Roy, Techniques for
    Real-Time DSP Implementation of Speech Coding
    Algorithms, Proceedings International Conference
    on Signal Processing Applications and Technology
    ( ICSPAT), pp 2-7, 1-4 (1999).

25
  • Questions?

26
Profiling Example (G.729A Encoder)
  • Profile Function timing, sorted by time
  • Date Wed Oct 27 103413 2004
  • Program Statistics
  • ------------------
  • Command line at 2004 Oct 27 1032 "G729aWS"
    test.inp test.outp
  • Total time 7768.530 millisecond
  • Time outside of functions 0.262 millisecond
  • Call depth 10
  • Total functions 126
  • Module Statistics for g729aws.exe
  • ---------------------------------
  • Time in module 7768.268 millisecond
  • Percent of time in module 100.0
  • Functions in module 126
  • Func FuncChild Hit
  • Time Time Count
    Function
  • --------------------------------------------------
    -------
  • 191.747 2.5 860.104 11.1 1002
    _Pred_lt_3 (pred_lt3.obj)

Back
Write a Comment
User Comments (0)
About PowerShow.com