Speech Coding Techniques - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Speech Coding Techniques

Description:

as well as lecture notes from: Professor Ai-Chun Pang of National Taiwan University ... Digital streams of ones and zeros. The lower the bandwidth, the lower ... – PowerPoint PPT presentation

Number of Views:458
Avg rating:3.0/5.0
Slides: 19
Provided by: cmpe5
Category:
Tags: ai | coding | speech | techniques

less

Transcript and Presenter's Notes

Title: Speech Coding Techniques


1
Speech Coding Techniques
These slides are adapted from the following
book Daniel Collins as well as lecture notes
from Professor Ai-Chun Pang of National Taiwan
University Please send any comment,
enhancements, corrections to Frank.Lin_at_sjsu.edu
THANK YOU
2
Introduction
  • Efficient speech-coding techniques
  • Advantages for VoIP
  • Digital streams of ones and zeros
  • The lower the bandwidth, the lower the quality
  • RTP payload types
  • Processing power
  • The better quality (for a given bandwidth) uses a
    more complex algorithm
  • A balance between quality and cost

3
Voice Quality - MOS
  • Voice quality is subjective
  • MOS, Mean Opinion Score
  • ITU-T Recommendation P.800
  • Excellent 5
  • Good 4
  • Fair 3
  • Poor 2
  • Bad 1
  • A minimum of 30 people
  • Listen to voice samples or in conversations
  • P.800 recommendations
  • The selection of participants
  • The test environment
  • Explanations to listeners
  • Analysis of results
  • Toll quality
  • A MOS of 4.0 or higher

4
Voice Quality - PSQM
  • Perceptual Speech Quality Measurement
  • ITU-T Recommendation P.861
  • Algorithm approach
  • Compare coding output against known input
  • type of speakers (male, female, child)
  • loudness
  • delay
  • silence ratio
  • clipping
  • environmental noise
  • PSQM score can be converted to MOS

5
About Speech
  • Speech
  • Air pushed from the lungs past the vocal cords
    and along the vocal tract
  • The basic vibrations vocal cords
  • The sound is altered by the disposition of the
    vocal tract ( tongue and mouth)
  • Model the vocal tract as a filter
  • The shape changes relatively slowly
  • The vibrations at the vocal cords
  • The excitation signal

6
Speech sounds
  • Voiced sound
  • The vocal cords vibrate open and close
  • Quasi-periodic pulses of air
  • The rate of the opening and closing the pitch
  • Unvoiced sounds
  • Forcing air at high velocities through a
    constriction
  • Noise-like turbulence
  • Show little long-term periodicity
  • Short-term correlations still present
  • Plosive sounds
  • A complete closure in the vocal tract
  • Air pressure is built up and released suddenly

7
Sampling Quantization
  • Sampling rate is based on desired frequency
    range/cutoff
  • 300-3800 Hz human speech
  • Nyquists theorem gt Sampling rate 8K
  • Quantization how many bits per sample
  • quantization noise (more bits gt less noise)
  • uniform quantization
  • favors loud speaker (11.2 11)/11.2 1.8, (2.2
    2)/2.2 9
  • SNR is better for loud speaker
  • non-uniform quantization
  • smaller quantization steps at smaller signal
  • try to achieve uniform SNR

8
Types of Speech Coders
  • Waveform codecs
  • source codecs (a.k.a vocoders)
  • model vocal tract, model parameters are sent
  • hybrid codecs

high Q simple large BW
low bit rate low Q
9
G.711
  • The most commonplace codec
  • Waveform coder
  • Used in circuit-switched telephone network
  • PCM, Pulse-Code Modulation
  • If uniform quantization
  • 12 bits 8 k/sec 96 kbps
  • Non-uniform quantization
  • 64 kbps DS0 rate
  • u-law
  • North America
  • A-law
  • Other countries, a little friendlier to lower
    signal levels
  • An MOS of about 4.3

10
ADPCM(adaptive differential PCM)
  • Waveform codec, no algorithmic delay, relative
    high BW
  • Each PCM sample is independent of each other
  • DPCM differential PCM
  • assumption voice changes slowly
  • predicts the next sample and only sends diff
    between actual and prediction (quantization
    error)
  • simplest DPCM no prediction, just send diff of N
    and N1
  • ADPCM adaptive differential PCM
  • Adaptive Prediction based on past samples
  • Adaptive Quantization not fixed bits
  • G.721 ADPCM speech at 32Kbps.
  • G.726 (A-law or u-law)
  • 16,24,32,40Kbps
  • MOS 4.0 , at 32Kbps

11
Analysis-by-Synthesis (AbS) Codecs
  • Hybrid codec
  • Fill the gap between waveform and source codecs
  • The most successful and commonly used
  • Time-domain AbS codecs
  • Vocal tract prediction filter model same as LPC
    vocoder
  • Not a simple two-state, voiced/unvoiced
  • Different excitation signals are attempted
  • Closest to the original waveform is selected
  • MPE, Multi-Pulse Excited (first AbS codec ..
    1982)
  • RPE, Regular-Pulse Excited
  • CELP, Code-Excited Linear Predictive

12
G.728 LD-CELP
  • CELP codecs
  • A filter its characteristics change over time
  • A codebook of acoustic vectors (1024 vectors)
  • A vector a set of elements representing various
    char. of the excitation
  • Transmit
  • Filter coefficients, gain, a pointer to the
    vector chosen
  • Low Delay CELP
  • Backward-adaptive coder
  • Use previous samples to determine filter
    coefficients
  • Operates on 5 samples at a time
  • Delay lt 1 ms (sample rate 8K, 125us 5 .625
    ms)
  • Only the vector pointer (10 bits) is transmitted
    for every 5 samples
  • G.728 LD-CELP rate is 16K bps (8K/510 16K)
  • MOS score 3.9
  • Process intensive

13
G.723.1 ACELP (algebraic)
  • 6.3 or 5.3 kbps
  • Both mandatory
  • Can change from one to another during a
    conversation
  • The coder
  • A band-limited input speech signal
  • Sampled at 8 KHz, 16-bit uniform PCM quantization
  • Operate on blocks of 240 samples at a time
  • A look-ahead of 7.5 ms
  • A total algorithmic delay of 37.5 ms other
    delays
  • A high-pass filter to remove any DC component

14
G.723.1 Annex A
  • Silence Insertion Description (SID) frames of
    size four octets
  • The two lsbs of the first octet
  • 00 6.3kbps 24 octets/frame
  • 01 5.3kbps 20
  • 10 SID frame 4
  • MOS 3.8
  • At least 37.5 ms delay

15
G.729, G.729A, G.729B
  • 8 kbps
  • Input frames of 10 ms, 80 samples for 8 KHz
    sampling
  • 5 ms look-ahead gt Algorithmic delay of 15 ms (10
    ms 5 ms)
  • An 80-bit frame for 10 ms of speech
  • G.729 is a complex codec with MOS of 4.0
  • G.729A is a simplified version with MOS of 3.7
  • G.729B
  • VAD, Voice Activity Detection
  • current frame plus two preceding frames (i.e. 3
    silence frames)
  • DTX, Discontinuous Transmission
  • Send nothing or send an SID frame
  • SID frame (15 bits) used to generate comfort
    noise
  • CNG, Comfort Noise Generation

16
GSM Adaptive Multi-Rate (AMR)
  • Eight different modes
  • 4.75 kbps to 12.2 kbps
  • 12.2 kbps, GSM EFR
  • 7.4 kbps, IS-641 (TDMA cellular systems)
  • Change the mode at any time
  • Offer discontinuous transmission
  • The coding choice of many 3G wireless networks

17
Tones, Signal, and DTMF Digits
  • The hybrid codecs are optimized for human speech
  • Other data may need to be transmitted
  • Tones fax tones, ringing, busy, congestion, etc.
  • DTMF digits for two-stage dialing or voice-mail
  • G.711 is OK, G.723.1 and G.729 can be
    unintelligible
  • The ingress gateway needs to
  • Intercept the tones and DTMT digits
  • Use an external signaling system to relay the
    info
  • Easy at the start of a call, difficult in the
    middle of a call
  • Encode in RTP packet with tone name and duration
  • Encode in RTP packet containing frequency, volume
    and duration
  • Encode in RTP payload format as redundant audio
    data and send both types of RTP payload

18
RTP Payload Format for DTMF Digits
  • Payload format

E end of the tone R reserved
Write a Comment
User Comments (0)
About PowerShow.com