Speech Coding Techniques - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Speech Coding Techniques

Description:

as well as lecture notes from: Professor Ai-Chun Pang of National Taiwan University ... Digital streams of ones and zeros. The lower the bandwidth, the lower ... – PowerPoint PPT presentation

Number of Views:458

Avg rating:3.0/5.0

Slides: 19

Provided by: cmpe5

Category:

more less

Transcript and Presenter's Notes

Title: Speech Coding Techniques

1
Speech Coding Techniques
These slides are adapted from the following
book Daniel Collins as well as lecture notes
from Professor Ai-Chun Pang of National Taiwan
University Please send any comment,
enhancements, corrections to Frank.Lin_at_sjsu.edu
THANK YOU
2
Introduction

Efficient speech-coding techniques
Advantages for VoIP
Digital streams of ones and zeros
The lower the bandwidth, the lower the quality
RTP payload types
Processing power
The better quality (for a given bandwidth) uses a
more complex algorithm
A balance between quality and cost

3
Voice Quality - MOS

Voice quality is subjective
MOS, Mean Opinion Score
ITU-T Recommendation P.800
Excellent 5
Good 4
Fair 3
Poor 2
Bad 1
A minimum of 30 people
Listen to voice samples or in conversations
P.800 recommendations
The selection of participants
The test environment
Explanations to listeners
Analysis of results
Toll quality
A MOS of 4.0 or higher

4
Voice Quality - PSQM

Perceptual Speech Quality Measurement
ITU-T Recommendation P.861
Algorithm approach
Compare coding output against known input
type of speakers (male, female, child)
loudness
delay
silence ratio
clipping
environmental noise
PSQM score can be converted to MOS

5
About Speech

Speech
Air pushed from the lungs past the vocal cords
and along the vocal tract
The basic vibrations vocal cords
The sound is altered by the disposition of the
vocal tract ( tongue and mouth)
Model the vocal tract as a filter
The shape changes relatively slowly
The vibrations at the vocal cords
The excitation signal

6
Speech sounds

Voiced sound
The vocal cords vibrate open and close
Quasi-periodic pulses of air
The rate of the opening and closing the pitch
Unvoiced sounds
Forcing air at high velocities through a
constriction
Noise-like turbulence
Show little long-term periodicity
Short-term correlations still present
Plosive sounds
A complete closure in the vocal tract
Air pressure is built up and released suddenly

7
Sampling Quantization

Sampling rate is based on desired frequency
range/cutoff
300-3800 Hz human speech
Nyquists theorem gt Sampling rate 8K
Quantization how many bits per sample
quantization noise (more bits gt less noise)
uniform quantization
favors loud speaker (11.2 11)/11.2 1.8, (2.2
2)/2.2 9
SNR is better for loud speaker
non-uniform quantization
smaller quantization steps at smaller signal
try to achieve uniform SNR

8
Types of Speech Coders

Waveform codecs
source codecs (a.k.a vocoders)
model vocal tract, model parameters are sent
hybrid codecs

high Q simple large BW
low bit rate low Q
9
G.711

The most commonplace codec
Waveform coder
Used in circuit-switched telephone network
PCM, Pulse-Code Modulation
If uniform quantization
12 bits 8 k/sec 96 kbps
Non-uniform quantization
64 kbps DS0 rate
u-law
North America
A-law
Other countries, a little friendlier to lower
signal levels
An MOS of about 4.3

10
ADPCM(adaptive differential PCM)

Waveform codec, no algorithmic delay, relative
high BW
Each PCM sample is independent of each other
DPCM differential PCM
assumption voice changes slowly
predicts the next sample and only sends diff
between actual and prediction (quantization
error)
simplest DPCM no prediction, just send diff of N
and N1
ADPCM adaptive differential PCM
Adaptive Prediction based on past samples
Adaptive Quantization not fixed bits
G.721 ADPCM speech at 32Kbps.
G.726 (A-law or u-law)
16,24,32,40Kbps
MOS 4.0 , at 32Kbps

11
Analysis-by-Synthesis (AbS) Codecs

Hybrid codec
Fill the gap between waveform and source codecs
The most successful and commonly used
Time-domain AbS codecs
Vocal tract prediction filter model same as LPC
vocoder
Not a simple two-state, voiced/unvoiced
Different excitation signals are attempted
Closest to the original waveform is selected
MPE, Multi-Pulse Excited (first AbS codec ..
1982)
RPE, Regular-Pulse Excited
CELP, Code-Excited Linear Predictive

12
G.728 LD-CELP

CELP codecs
A filter its characteristics change over time
A codebook of acoustic vectors (1024 vectors)
A vector a set of elements representing various
char. of the excitation
Transmit
Filter coefficients, gain, a pointer to the
vector chosen
Low Delay CELP
Backward-adaptive coder
Use previous samples to determine filter
coefficients
Operates on 5 samples at a time
Delay lt 1 ms (sample rate 8K, 125us 5 .625
ms)
Only the vector pointer (10 bits) is transmitted
for every 5 samples
G.728 LD-CELP rate is 16K bps (8K/510 16K)
MOS score 3.9
Process intensive

13
G.723.1 ACELP (algebraic)

6.3 or 5.3 kbps
Both mandatory
Can change from one to another during a
conversation
The coder
A band-limited input speech signal
Sampled at 8 KHz, 16-bit uniform PCM quantization
Operate on blocks of 240 samples at a time
A look-ahead of 7.5 ms
A total algorithmic delay of 37.5 ms other
delays
A high-pass filter to remove any DC component

14
G.723.1 Annex A

Silence Insertion Description (SID) frames of
size four octets
The two lsbs of the first octet
00 6.3kbps 24 octets/frame
01 5.3kbps 20
10 SID frame 4
MOS 3.8
At least 37.5 ms delay

15
G.729, G.729A, G.729B

8 kbps
Input frames of 10 ms, 80 samples for 8 KHz
sampling
5 ms look-ahead gt Algorithmic delay of 15 ms (10
ms 5 ms)
An 80-bit frame for 10 ms of speech
G.729 is a complex codec with MOS of 4.0
G.729A is a simplified version with MOS of 3.7
G.729B
VAD, Voice Activity Detection
current frame plus two preceding frames (i.e. 3
silence frames)
DTX, Discontinuous Transmission
Send nothing or send an SID frame
SID frame (15 bits) used to generate comfort
noise
CNG, Comfort Noise Generation

16
GSM Adaptive Multi-Rate (AMR)

Eight different modes
4.75 kbps to 12.2 kbps
12.2 kbps, GSM EFR
7.4 kbps, IS-641 (TDMA cellular systems)
Change the mode at any time
Offer discontinuous transmission
The coding choice of many 3G wireless networks

17
Tones, Signal, and DTMF Digits

The hybrid codecs are optimized for human speech
Other data may need to be transmitted
Tones fax tones, ringing, busy, congestion, etc.
DTMF digits for two-stage dialing or voice-mail
G.711 is OK, G.723.1 and G.729 can be
unintelligible
The ingress gateway needs to
Intercept the tones and DTMT digits
Use an external signaling system to relay the
info
Easy at the start of a call, difficult in the
middle of a call
Encode in RTP packet with tone name and duration
Encode in RTP packet containing frequency, volume
and duration
Encode in RTP payload format as redundant audio
data and send both types of RTP payload