Title: Advanced Speech Coding for VoIP
1Advanced Speech Coding for VoIP
- Contents
- 1) speech codec properties
- 2) current speech coding standards
- advanced speech coding methods
- 3) speech quality factors in VoIP
- 4) wideband speech coding over VoIP
- Tero Piirainen, tp105475_at_cs.tut.fi
- Speech coding basics can be found in "Speech
Coding for VoIP" by Konsta Koppinen (VoIP seminar
20.10.2000)
2Critical speech codec properties
- 1) Bit rate
- 2) Codec complexity
- 3) Speech quality
- intelligibility
- echo
- delay
3Bit rate
- wireline quality speech is 64 kbps PCM coded
- human speech contains lots of redundancy,
removing redundancy saves bits - normal speech can be compressed effecively 110
maintaining wireline quality, with silence
suppression up to 120 - advanced coding means more delay and more
complexity - backgound noise makes it much more difficult to
code speech - low bit rate codecs perform worse in noisy
environments
4Complexity
- better coding ( lower bit rate, better quality)
requires more processing - low bit rate codecs require 20-40 MIPS
- at the same time, processing power is needed also
for - echo cancellation
- noise suppression
- etc.
- minimizing complexity means minimizing hardware /
CPU MHz requirements
5Speech quality
- Speech quality
- intelligibility
- echo
- delay
- Intelligibility can be measured by MOS
- subjective listener tests, rating ranging 1-5
- 1 bad
- 2 poor
- 3 fair
- 4 good ( wireline quality)
- 5 excellent
6Speech coding standards
- G.711
- PCM wireline quality
- high bit rate (64 kbps)
- G.723.1
- wireline quality at 15.4 kbps
- chosen as default IP telephony speech codec by
International Multimedia Teleconferencing
Consortium (IMTC) Voice over IP (VoIP) forum - heavy computation (30 MIPS)
- narrowband reference codec
- G.729A
- almost wireline quality at 8 kbps
- low delay 35 ms low complexity
7Speech coding standards
- G.722
- SB-ADPCM Sub-Band Adaptine Differential Pulse
Code Modulation - wideband codec, sampling 16 khz audio signals
- bit rate 48..64 kbps
- used in many applications that require audio
frequency bandwidth coding, such as video
conferencing and multimedia - wideband reference codec
- ETSI GSM AMR
- variable rate speech coding suitable for packet
networks - based on the earlier ETSI GSM speech codecs (FR,
HR, EFR) - offer robust coding and wireline quality whilst
increasing network capacity - adaptive coding to one of eight data rate modes
4.75...12.2 kbps - AMR speech coding algorithm is based on ACELP
(Algebraic-Codebook-Excited Linear Predictive
Coding)
8ITU-T SpC summary
9ETSI SpC summary
10Speech codecs in H.323
- H.323 mandatory codec
- G.711
- H.323 supported speech codecs
- G.722
- G.723
- G.728
- G.729
- GSM codecs (FR, EFR, AMR)
- also audio codecs (MPEG1,..)
11Speech production
12Speech Synthesis Model
13Linear prediction
- The vocal tract forms the tube, which is
characterized by resonances, which are called
formants. - LPC analyzes the speech signal by estimating the
formants - The LPC parameters are transmitted and used as an
input to LPC synthesis in the receiver end - Because speech signals vary, LPC is done in short
frames, normally 30 to 50 frames per second.
14Linear Prediction
15LPC example - unvoiced sound
16LPC spectrum over time
Intensity
Time
Frequency
17Long term correlation
- Vocal cords produces the signal, which is
characterized by its intensity (loudness) and
frequency (pitch). - Long term correlation is represented by lag. Lag
is the number of samples between long-term
periods in continuos signal. - The range of lag values for range between 20-150
corresponding to the frequency range 400-50 Hz.
18Analysis-by-Synthesis Coding
- In analysis-by-synthesis speech coding method
encoder includes a local decoder used for speech
synthesis. - Input speech data is analyzed to obtain required
coefficients for synthesis filters. - Excitation vectors are generated and passed
through the local decoder i.e. synthesis filters.
- The synthesized speech for each excitation vector
is subtracted from the original speech to form an
objective error. - Objective error is spectrally weighted to obtain
perceptually more meaningful measure of the
coding error. Excitation vector and gain that
minimize the subjective error are selected. - The search of long-term periodic component in
speech signal using analysis-by-synthesis method
can be interpreted as an use of an adaptive
codebook. The codebook is indexed by lag and the
gain corresponds to the excitation gain.
19Basics of Adaptive Codebook
- LTP-state memory can be interpreted as an
adaptive codebook in which the consecutive
codevectors differ only by one new value and a
shift. - LAG is an index to the codebook/delay line and
GAIN is the excitation gain. - Excitation from the adaptive codebook is combined
with fixed excitation. - Delay line is updated with the "best" codevector.
20Virtual LAG
LAG
INPUT SPEECH
LAG
VIRTUAL LAG
USED LAG
- In simple case, LAG is bigger than subframe
length - If smaller LAG values are used, virtual LAG trick
is needed because in decoder only past samples
are available - For samples for which delayed samples would be
inside current frame LAG value multiples are
used (utilizing periodicity) - Enhances voices with high pitch (children, female)
21Basics of vector quantization
- Each vector (e.g. LPC coefficients) can get
whatever values in k- dimensional space. - A vector is replaced and represented by a
centroid - Centroid is one vector in the parameter space for
which distance to it is at minimum for a cluster
of vectors - Discrete amount of centroids quantization from
Rk to C, where C is amount of centroids - Table of centroids is called a codebook
22Basics of vector quantization
- Quantization with a codebook
- For each vector a centroid or codevector to
minimize - distortion between original
- sample and the codevector
- is searched.
- Codevector is represented by its index in the
codebook - Exhaustive search codebookhas exponentially
growingrequirements for calculationand storage
complexity - by using specific codebookstructure, the search
can be fastened and complexity andstorage grow
be made linear. - Example binary tree stucture.
23Two level vector quantizer
- Speech codec parameters can be quantized as
vectors instead of quantizing each parameter
separately. Vector quantization results savings
in output bit-rate. - The best quatization vector is searched from
predefined codebook. - In order to decrease the complexity of search,
the search can be done in stages. The original
vector is divided into smaller parts (vector
splitting, band splitting etc.) - For example in GSM HR speech codec, four best
candidate vectors out of 64 are selected in
prequantizer. In the next phase, 4 x 32 vectors
are evaluated and the best is chosen.
24Silence suppression
- In two-way conversation 60 of the time is only
background noise, the silent periods can be
suppressed without worsening quality. - VAD voice activity detection
- detects voice activity only speech is coded
transmitted - CNG comfort noise generation
- completely silent periods feel uncomfortable by
the receiver - CNG algorithms fills the silent periods with
generated noise - some noise parameters are transmitted to maintain
realistic noise characteristics - Lowers the usage of bandwidth gt well suited for
packet communication. - VAD performance
- If the VAD sensitive is low, the algorithm will
fail to notice the beginning of speech gt
front-end-clipping. - If the VAD is too sensitive gt inefficiency.
- The performance of the VAD algorithm becomes
apparent in noisy environments like in a office,
where there can be conversation or music as the
background noise.
25Example Analysis by Synthesis Coding (EFR)
Input Speech
LSP
LPC
GAIN
PULSE POSITIONS
LPC
LAG
GAIN
26Main Blocks of GSM EFR Codec
- High-pass filtering of input speech frame in
order to remove DC offset. - LPC analysis two times per speech frame (every 10
ms) i.e. two set of coefficients are calculated
for 10 th order linear prediction filter. LPC
coefficients are represented as Line Spectrum
Pairs (LSP) and quantized. - Open loop lag search is done two times per speech
frame (every 10 ms) producing candidate LAG
values - Closed loop lag search is done for each sub-frame
(every 5 ms) from the basis of candidate lags.
Optimization of LTP parameters (LTP lag and gain)
is done using analysis-by-synthesis search
(adaptive codebook). Lag values can be fractional
(non-integer) with accuracy of 1/6th. In
addition, virtual lag is used i.e lag can be
less that the size of subframe (pitch is over 200
Hz). Lag value for the first subframe is fully
coded. For other three subframes, delta coding is
utilized. - Lag is a number of samples between long-term
periods in continuos signal. The range of lag
values for GSM EFR codec is 18-143 corresponding
to the frequency range 444-56 Hz. LTP resolution
is enhanced in EFR by using fractional LAG
values, which are computed from upsampled signal.
27Main Blocks of GSM EFR Codec
- If smaller lag values are used than subframe
length, virtual lag calculation is needed because
in decoder only past samples are available. For
samples for which delayed samples would be inside
current frame lag value multiples are used
utilizing speech signal periodicity. Virtual lag
enhances high pitch voices of children and women. - Codec architecture is called Algebraic Code
Excited Linear Prediction (ACELP). Excitation
vector utilizes Algebraic codebook. Vector is
formed by 10 non-zero (equal to -1 or 1) pulses
for each subframe (40 samples) in order to
minimize the coding error. - Synthesis filter includes effect of subjective
weighting filters. - Optimal combination pulse positions is determined
using analysis-by-synthesis search. Codebook
index i.e combination of pulse positions is
calculated for each subframe. - The gains are quantized. For algebraic codebook
gain, moving average prediction is used.
28AMR in short
- Improved speech quality in variable rate
- Based on GSM FR, HR and EFR codecs.
- Ability to trade speech quality and capacity
smoothly and flexibly by codec mode adaptation - Improved robustness to errors
- Wideband option is under consideration
- Narrow band codec specifications ready in
December 1998 - Link adaption mechanism is required for measuring
channel quality and selecting speech codec modes
29Speech quality in VoIP
- These factors have major effect in speech
quality - Delay
- Jitter
- Packet loss
- Echo
- Tandeming
30Delay
- major factor in speech quality
- Preferred maximum one-way delay is 200-400 ms,
without echo with echo maximum is 25 ms - three levels of delay
- algorithmic delay
- frame sampling lookahead delay typically 15-40
ms - processing delay
- speech frame encoding decoding delay typically
5-10 ms - communications delay
- channel delay between encoder decoder
typically ?
31Jitter
- variations in delay
- data buffering must be used at network edges
ensuring that a constant stream of speech frames
can be reproduced - jitter may vary during a VoIP call
- smart buffering, where the buffer size can be
changed during a call - constantly measures network condition in order to
decide on making these buffer adjustments - if the buffer size is increased additional speech
segments must be synthesised. - if buffer size is decreased some parts of the
speech signal must be dropped. - gt These adjustments would preferably be made
during silent periods.
32Packet loss
- problem in heavily loaded networks where packet
loss rates may be up to 10 - speech codecs are robust to random bit errors,
but not on loosing complete speech frames - forward error correction for voice frames is not
relevant in IP networks because it can not handle
losing whole packets - interleaving cannot be used because of extra
delay - retransmission are not possible because of the
real-time requirements - current solutions usually include error detection
function, usually a plain CRC check, which can be
used as an indication for packet loss recovery
procedure.
33Echo
- Echo cancelling should be used when the
round-trip delay exceeds 50 ms - in VoIP, echo cancelling becomes complicated, as
the delays are longer and may vary gt VoIP
terminals must implement echo cancelling
34Tandem effect
- intermediate speech coding and decoding phases
deteriorate speech quality considerably - most evident in low bit rate codecs
- especially important in cases where VoIP is
interworking with other networks that use speech
coding, (mobile networks, etc.) - e.g. transcoding between G.723.1 and a GSM codec
produces poor speech quality. - TFO tandem-free operation
- standardization going on in ETSI for TFO with GSM
networks - considerable savings can be made by TFO in
operational costs
35Wideband Speech
- fundamental bandwidth limitation in public
switched telephone network prevents the speech
quality to be further enhanced - gt most current codecs achieve good performance
only to narrowband speech where the audio
bandwidth is limited to 3.4 kHz - wideband speech coding where audio bandwidth is
extended to 7 kHz - wideband coding exceeds wireline quality.
36Why Wideband Speech ?
- Wideband speech supplies superior speech quality
over current narrow band speech services exceeds
the quality of wireline phones - In narrowband speech (100-3600 Hz band) important
high frequency components lost (eg. in
's'-sounds) - Wideband uses 50 - 7000 Hz band thus improving
naturalness, presence and intelligibility - wideband speech offers superior voice quality
over the existing narrow band services (cellular
systems, PSTN) - especially suitable for applications with high
quality audio parts
37Wideband vs. narrowband
38Wideband speech quality
- results indicate that there is a significant
benefit in the wideband solution over narrowband - increased audio bandwidth provided by the wide
speech bandwidth will create a effect of
proximity between the users - it will almost completely eliminate the feeling
of "talking over a wire" of the wireline network. - Codec MOS
- GSM EFR (narrowband 3.4kHz) 3.3
- G.722, 48 kbit/s (wideband 7kHz) 4.06
- G.722, 56 kbit/s (wideband 7kHz) 4.57
39Nokia AMR WB Speech Codec
- Collaboration with VoiceAge (University of
Sherbrooke in Canada) - ACELP technology very similar to AMR NB and GSM
EFR - Multirate codec 9 modes
- Same coding algorithm in each mode
- Very high code and data ROM reuse between modes
(much better than in the AMR NB codec) - VAD (integrated into the speech codec) and
comfort noise generation - Link Adaptation
- 9 speech codec modes
- 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85,
23.05, 23.85 kbit/s
40AMR WB Standardization in 3GPP
- Initiated based on feasibility study in SMG11
(2Q-1999) - Initially 9 candidates, 5 companies were
qualified into the selection phase Ericsson,
FDNS-consortium (FT, DT, Nortel, Siemens),
Motorola, Nokia and Texas Instruments - Nokia WB codec selected as the best codec in 3GPP
TSG S4 meeting in October 2000-gt will be
standadized - The final specifications are approved in March
2001 (R4) - The selected Nokia WB codec has also been
approved to proceed into the ITU WB codec
selection in March 2001
41Speech Codec Bit Allocation into Parameter Groups
42AMR WB Speech Quality vs. ITU G722 WB
Speech Quality
G722 64 kbit/s
G722 56 kbit/s
G722 48 kbit/s
Nokia AMR WB
3
23.85
23.05
19.85
18.25
15.85
14.25
12.65
8.85
6.6
AMR WB Bit rates kbit/s
43AMR-WB Speech Quality in GSM Channel
Subjective Speech Quality Degradation in Function
of Channel Quality
Subjective Speech Quality
AMR-WB
AMR-NB
EFR
Error-free
13
10
7
4
Carrier to Interference Ratio (C/I) dB
44Applications for Wide Band Speech
- Wide band telephony
- AMR NB equal to PSTN speech quality
- AMR WB improves the quality and provides
naturalness - Conferencing (Conversational multimedia)
- Quality improvement over the current codecs
(G.722 at 48 and 56 kbit/s) - Bit rate drops to half or less compared to G.722
- Streaming
- Low complexity, low bit rate solution for
browsing type of applications
45ITU-T wideband activity around 16 kbit/s
- In 1999, the following guidelines have been
considered relevant in ITU-T for the new wideband
activity around 16 kbit/s (12, 16, 20, and 24
kbit/s) - Input and output audio signals should have a
bandwidth of 7 kHz at a sampling rate of 16 kHz. - Primary signals of interest are clean speech and
speech in background noise. - High speech quality with the objective of
equivalence to G.722 at 56/64 kbit/s. - 16 kbit/s is the main bit-rate. It is required
that the ability of the candidate to scale in
bit-rate to lower bit-rates (less then 16 kbit/s)
and up to 24 kbit/s with no fundamental changes
in either the technology or the algorithm used. - Robustness to frame erasures and random bit
errors. - Low algorithmic delay (frame size of 20ms or
integer sub-multiples) - The applications for the new activity were
considered as follows Voice over IP (VoIP) and
Internet Applications, PSTN applications, Mobile
Communications, ISDN wideband telephony, and ISDN
videotelephony and video-conferencing.
46Introduction of wideband into systems
- implementation of WB requires
- 16 kHz sampling frequency in A/D and D/A
- Acoustic design of handset
- Tandem Free Operation (TFO)