Title: Speech Coding Techniques
1Speech Coding Techniques
These slides are adapted from the following
book Daniel Collins as well as lecture notes
from Professor Ai-Chun Pang of National Taiwan
University Please send any comment,
enhancements, corrections to Frank.Lin_at_sjsu.edu
THANK YOU
2Introduction
- Efficient speech-coding techniques
- Advantages for VoIP
- Digital streams of ones and zeros
- The lower the bandwidth, the lower the quality
- RTP payload types
- Processing power
- The better quality (for a given bandwidth) uses a
more complex algorithm - A balance between quality and cost
3Voice Quality - MOS
- Voice quality is subjective
- MOS, Mean Opinion Score
- ITU-T Recommendation P.800
- Excellent 5
- Good 4
- Fair 3
- Poor 2
- Bad 1
- A minimum of 30 people
- Listen to voice samples or in conversations
- P.800 recommendations
- The selection of participants
- The test environment
- Explanations to listeners
- Analysis of results
- Toll quality
- A MOS of 4.0 or higher
4Voice Quality - PSQM
- Perceptual Speech Quality Measurement
- ITU-T Recommendation P.861
- Algorithm approach
- Compare coding output against known input
- type of speakers (male, female, child)
- loudness
- delay
- silence ratio
- clipping
- environmental noise
- PSQM score can be converted to MOS
5About Speech
- Speech
- Air pushed from the lungs past the vocal cords
and along the vocal tract - The basic vibrations vocal cords
- The sound is altered by the disposition of the
vocal tract ( tongue and mouth) - Model the vocal tract as a filter
- The shape changes relatively slowly
- The vibrations at the vocal cords
- The excitation signal
6Speech sounds
- Voiced sound
- The vocal cords vibrate open and close
- Quasi-periodic pulses of air
- The rate of the opening and closing the pitch
- Unvoiced sounds
- Forcing air at high velocities through a
constriction - Noise-like turbulence
- Show little long-term periodicity
- Short-term correlations still present
- Plosive sounds
- A complete closure in the vocal tract
- Air pressure is built up and released suddenly
7Sampling Quantization
- Sampling rate is based on desired frequency
range/cutoff - 300-3800 Hz human speech
- Nyquists theorem gt Sampling rate 8K
- Quantization how many bits per sample
- quantization noise (more bits gt less noise)
- uniform quantization
- favors loud speaker (11.2 11)/11.2 1.8, (2.2
2)/2.2 9 - SNR is better for loud speaker
- non-uniform quantization
- smaller quantization steps at smaller signal
- try to achieve uniform SNR
8Types of Speech Coders
- Waveform codecs
- source codecs (a.k.a vocoders)
- model vocal tract, model parameters are sent
- hybrid codecs
high Q simple large BW
low bit rate low Q
9G.711
- The most commonplace codec
- Waveform coder
- Used in circuit-switched telephone network
- PCM, Pulse-Code Modulation
- If uniform quantization
- 12 bits 8 k/sec 96 kbps
- Non-uniform quantization
- 64 kbps DS0 rate
- u-law
- North America
- A-law
- Other countries, a little friendlier to lower
signal levels - An MOS of about 4.3
10ADPCM(adaptive differential PCM)
- Waveform codec, no algorithmic delay, relative
high BW - Each PCM sample is independent of each other
- DPCM differential PCM
- assumption voice changes slowly
- predicts the next sample and only sends diff
between actual and prediction (quantization
error) - simplest DPCM no prediction, just send diff of N
and N1 - ADPCM adaptive differential PCM
- Adaptive Prediction based on past samples
- Adaptive Quantization not fixed bits
- G.721 ADPCM speech at 32Kbps.
- G.726 (A-law or u-law)
- 16,24,32,40Kbps
- MOS 4.0 , at 32Kbps
11Analysis-by-Synthesis (AbS) Codecs
- Hybrid codec
- Fill the gap between waveform and source codecs
- The most successful and commonly used
- Time-domain AbS codecs
- Vocal tract prediction filter model same as LPC
vocoder - Not a simple two-state, voiced/unvoiced
- Different excitation signals are attempted
- Closest to the original waveform is selected
- MPE, Multi-Pulse Excited (first AbS codec ..
1982) - RPE, Regular-Pulse Excited
- CELP, Code-Excited Linear Predictive
12G.728 LD-CELP
- CELP codecs
- A filter its characteristics change over time
- A codebook of acoustic vectors (1024 vectors)
- A vector a set of elements representing various
char. of the excitation - Transmit
- Filter coefficients, gain, a pointer to the
vector chosen - Low Delay CELP
- Backward-adaptive coder
- Use previous samples to determine filter
coefficients - Operates on 5 samples at a time
- Delay lt 1 ms (sample rate 8K, 125us 5 .625
ms) - Only the vector pointer (10 bits) is transmitted
for every 5 samples - G.728 LD-CELP rate is 16K bps (8K/510 16K)
- MOS score 3.9
- Process intensive
13G.723.1 ACELP (algebraic)
- 6.3 or 5.3 kbps
- Both mandatory
- Can change from one to another during a
conversation - The coder
- A band-limited input speech signal
- Sampled at 8 KHz, 16-bit uniform PCM quantization
- Operate on blocks of 240 samples at a time
- A look-ahead of 7.5 ms
- A total algorithmic delay of 37.5 ms other
delays - A high-pass filter to remove any DC component
14G.723.1 Annex A
- Silence Insertion Description (SID) frames of
size four octets - The two lsbs of the first octet
- 00 6.3kbps 24 octets/frame
- 01 5.3kbps 20
- 10 SID frame 4
- MOS 3.8
- At least 37.5 ms delay
15G.729, G.729A, G.729B
- 8 kbps
- Input frames of 10 ms, 80 samples for 8 KHz
sampling - 5 ms look-ahead gt Algorithmic delay of 15 ms (10
ms 5 ms) - An 80-bit frame for 10 ms of speech
- G.729 is a complex codec with MOS of 4.0
- G.729A is a simplified version with MOS of 3.7
- G.729B
- VAD, Voice Activity Detection
- current frame plus two preceding frames (i.e. 3
silence frames) - DTX, Discontinuous Transmission
- Send nothing or send an SID frame
- SID frame (15 bits) used to generate comfort
noise - CNG, Comfort Noise Generation
16GSM Adaptive Multi-Rate (AMR)
- Eight different modes
- 4.75 kbps to 12.2 kbps
- 12.2 kbps, GSM EFR
- 7.4 kbps, IS-641 (TDMA cellular systems)
- Change the mode at any time
- Offer discontinuous transmission
- The coding choice of many 3G wireless networks
17Tones, Signal, and DTMF Digits
- The hybrid codecs are optimized for human speech
- Other data may need to be transmitted
- Tones fax tones, ringing, busy, congestion, etc.
- DTMF digits for two-stage dialing or voice-mail
- G.711 is OK, G.723.1 and G.729 can be
unintelligible - The ingress gateway needs to
- Intercept the tones and DTMT digits
- Use an external signaling system to relay the
info - Easy at the start of a call, difficult in the
middle of a call - Encode in RTP packet with tone name and duration
- Encode in RTP packet containing frequency, volume
and duration - Encode in RTP payload format as redundant audio
data and send both types of RTP payload
18RTP Payload Format for DTMF Digits
E end of the tone R reserved