Advanced Speech Coding for VoIP

About This Presentation

Title:

Advanced Speech Coding for VoIP

Description:

The vocal tract forms the tube, which is characterized by resonances, which are called formants. ... there can be conversation or music as the background noise. ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 47

Provided by: teropii

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Speech Coding for VoIP

1
Advanced Speech Coding for VoIP

Contents
1) speech codec properties
2) current speech coding standards
advanced speech coding methods
3) speech quality factors in VoIP
4) wideband speech coding over VoIP
Tero Piirainen, tp105475_at_cs.tut.fi
Speech coding basics can be found in "Speech
Coding for VoIP" by Konsta Koppinen (VoIP seminar
20.10.2000)

2
Critical speech codec properties

1) Bit rate
2) Codec complexity
3) Speech quality
intelligibility
echo
delay

3
Bit rate

wireline quality speech is 64 kbps PCM coded
human speech contains lots of redundancy,
removing redundancy saves bits
normal speech can be compressed effecively 110
maintaining wireline quality, with silence
suppression up to 120
advanced coding means more delay and more
complexity
backgound noise makes it much more difficult to
code speech
low bit rate codecs perform worse in noisy
environments

4
Complexity

better coding ( lower bit rate, better quality)
requires more processing
low bit rate codecs require 20-40 MIPS
at the same time, processing power is needed also
for
echo cancellation
noise suppression
etc.
minimizing complexity means minimizing hardware /
CPU MHz requirements

5
Speech quality

Speech quality
intelligibility
echo
delay
Intelligibility can be measured by MOS
subjective listener tests, rating ranging 1-5
1 bad
2 poor
3 fair
4 good ( wireline quality)
5 excellent

6
Speech coding standards

G.711
PCM wireline quality
high bit rate (64 kbps)
G.723.1
wireline quality at 15.4 kbps
chosen as default IP telephony speech codec by
International Multimedia Teleconferencing
Consortium (IMTC) Voice over IP (VoIP) forum
heavy computation (30 MIPS)
narrowband reference codec
G.729A
almost wireline quality at 8 kbps
low delay 35 ms low complexity

7
Speech coding standards

G.722
SB-ADPCM Sub-Band Adaptine Differential Pulse
Code Modulation
wideband codec, sampling 16 khz audio signals
bit rate 48..64 kbps
used in many applications that require audio
frequency bandwidth coding, such as video
conferencing and multimedia
wideband reference codec
ETSI GSM AMR
variable rate speech coding suitable for packet
networks
based on the earlier ETSI GSM speech codecs (FR,
HR, EFR)
offer robust coding and wireline quality whilst
increasing network capacity
adaptive coding to one of eight data rate modes
4.75...12.2 kbps
AMR speech coding algorithm is based on ACELP
(Algebraic-Codebook-Excited Linear Predictive
Coding)

8
ITU-T SpC summary
9
ETSI SpC summary
10
Speech codecs in H.323

H.323 mandatory codec
G.711
H.323 supported speech codecs
G.722
G.723
G.728
G.729
GSM codecs (FR, EFR, AMR)
also audio codecs (MPEG1,..)

11
Speech production
12
Speech Synthesis Model
13
Linear prediction

The vocal tract forms the tube, which is
characterized by resonances, which are called
formants.
LPC analyzes the speech signal by estimating the
formants
The LPC parameters are transmitted and used as an
input to LPC synthesis in the receiver end
Because speech signals vary, LPC is done in short
frames, normally 30 to 50 frames per second.

14
Linear Prediction
15
LPC example - unvoiced sound
16
LPC spectrum over time
Intensity
Time
Frequency
17
Long term correlation

Vocal cords produces the signal, which is
characterized by its intensity (loudness) and
frequency (pitch).
Long term correlation is represented by lag. Lag
is the number of samples between long-term
periods in continuos signal.
The range of lag values for range between 20-150
corresponding to the frequency range 400-50 Hz.

18
Analysis-by-Synthesis Coding

In analysis-by-synthesis speech coding method
encoder includes a local decoder used for speech
synthesis.
Input speech data is analyzed to obtain required
coefficients for synthesis filters.
Excitation vectors are generated and passed
through the local decoder i.e. synthesis filters.
The synthesized speech for each excitation vector
is subtracted from the original speech to form an
objective error.
Objective error is spectrally weighted to obtain
perceptually more meaningful measure of the
coding error. Excitation vector and gain that
minimize the subjective error are selected.
The search of long-term periodic component in
speech signal using analysis-by-synthesis method
can be interpreted as an use of an adaptive
codebook. The codebook is indexed by lag and the
gain corresponds to the excitation gain.

19
Basics of Adaptive Codebook

LTP-state memory can be interpreted as an
adaptive codebook in which the consecutive
codevectors differ only by one new value and a
shift.
LAG is an index to the codebook/delay line and
GAIN is the excitation gain.
Excitation from the adaptive codebook is combined
with fixed excitation.
Delay line is updated with the "best" codevector.

20
Virtual LAG
LAG
INPUT SPEECH
LAG
VIRTUAL LAG
USED LAG

In simple case, LAG is bigger than subframe
length
If smaller LAG values are used, virtual LAG trick
is needed because in decoder only past samples
are available
For samples for which delayed samples would be
inside current frame LAG value multiples are
used (utilizing periodicity)
Enhances voices with high pitch (children, female)

21
Basics of vector quantization

Each vector (e.g. LPC coefficients) can get
whatever values in k- dimensional space.
A vector is replaced and represented by a
centroid
Centroid is one vector in the parameter space for
which distance to it is at minimum for a cluster
of vectors
Discrete amount of centroids quantization from
Rk to C, where C is amount of centroids
Table of centroids is called a codebook

22
Basics of vector quantization

Quantization with a codebook
For each vector a centroid or codevector to
minimize
distortion between original
sample and the codevector
is searched.
Codevector is represented by its index in the
codebook
Exhaustive search codebookhas exponentially
growingrequirements for calculationand storage
complexity
by using specific codebookstructure, the search
can be fastened and complexity andstorage grow
be made linear.
Example binary tree stucture.

23
Two level vector quantizer

Speech codec parameters can be quantized as
vectors instead of quantizing each parameter
separately. Vector quantization results savings
in output bit-rate.
The best quatization vector is searched from
predefined codebook.
In order to decrease the complexity of search,
the search can be done in stages. The original
vector is divided into smaller parts (vector
splitting, band splitting etc.)
For example in GSM HR speech codec, four best
candidate vectors out of 64 are selected in
prequantizer. In the next phase, 4 x 32 vectors
are evaluated and the best is chosen.

24
Silence suppression

In two-way conversation 60 of the time is only
background noise, the silent periods can be
suppressed without worsening quality.
VAD voice activity detection
detects voice activity only speech is coded
transmitted
CNG comfort noise generation
completely silent periods feel uncomfortable by
the receiver
CNG algorithms fills the silent periods with
generated noise
some noise parameters are transmitted to maintain
realistic noise characteristics
Lowers the usage of bandwidth gt well suited for
packet communication.
VAD performance
If the VAD sensitive is low, the algorithm will
fail to notice the beginning of speech gt
front-end-clipping.
If the VAD is too sensitive gt inefficiency.
The performance of the VAD algorithm becomes
apparent in noisy environments like in a office,
where there can be conversation or music as the
background noise.

25
Example Analysis by Synthesis Coding (EFR)
Input Speech
LSP
LPC
GAIN
PULSE POSITIONS
LPC
LAG
GAIN
26
Main Blocks of GSM EFR Codec

High-pass filtering of input speech frame in
order to remove DC offset.
LPC analysis two times per speech frame (every 10
ms) i.e. two set of coefficients are calculated
for 10 th order linear prediction filter. LPC
coefficients are represented as Line Spectrum
Pairs (LSP) and quantized.
Open loop lag search is done two times per speech
frame (every 10 ms) producing candidate LAG
values
Closed loop lag search is done for each sub-frame
(every 5 ms) from the basis of candidate lags.
Optimization of LTP parameters (LTP lag and gain)
is done using analysis-by-synthesis search
(adaptive codebook). Lag values can be fractional
(non-integer) with accuracy of 1/6th. In
addition, virtual lag is used i.e lag can be
less that the size of subframe (pitch is over 200
Hz). Lag value for the first subframe is fully
coded. For other three subframes, delta coding is
utilized.
Lag is a number of samples between long-term
periods in continuos signal. The range of lag
values for GSM EFR codec is 18-143 corresponding
to the frequency range 444-56 Hz. LTP resolution
is enhanced in EFR by using fractional LAG
values, which are computed from upsampled signal.

27
Main Blocks of GSM EFR Codec

If smaller lag values are used than subframe
length, virtual lag calculation is needed because
in decoder only past samples are available. For
samples for which delayed samples would be inside
current frame lag value multiples are used
utilizing speech signal periodicity. Virtual lag
enhances high pitch voices of children and women.
Codec architecture is called Algebraic Code
Excited Linear Prediction (ACELP). Excitation
vector utilizes Algebraic codebook. Vector is
formed by 10 non-zero (equal to -1 or 1) pulses
for each subframe (40 samples) in order to
minimize the coding error.
Synthesis filter includes effect of subjective
weighting filters.
Optimal combination pulse positions is determined
using analysis-by-synthesis search. Codebook
index i.e combination of pulse positions is
calculated for each subframe.
The gains are quantized. For algebraic codebook
gain, moving average prediction is used.

28
AMR in short

Improved speech quality in variable rate
Based on GSM FR, HR and EFR codecs.
Ability to trade speech quality and capacity
smoothly and flexibly by codec mode adaptation
Improved robustness to errors
Wideband option is under consideration
Narrow band codec specifications ready in
December 1998
Link adaption mechanism is required for measuring
channel quality and selecting speech codec modes

29
Speech quality in VoIP

These factors have major effect in speech
quality
Delay
Jitter
Packet loss
Echo
Tandeming

30
Delay

major factor in speech quality
Preferred maximum one-way delay is 200-400 ms,
without echo with echo maximum is 25 ms
three levels of delay
algorithmic delay
frame sampling lookahead delay typically 15-40
ms
processing delay
speech frame encoding decoding delay typically
5-10 ms
communications delay
channel delay between encoder decoder
typically ?

31
Jitter

variations in delay
data buffering must be used at network edges
ensuring that a constant stream of speech frames
can be reproduced
jitter may vary during a VoIP call
smart buffering, where the buffer size can be
changed during a call
constantly measures network condition in order to
decide on making these buffer adjustments
if the buffer size is increased additional speech
segments must be synthesised.
if buffer size is decreased some parts of the
speech signal must be dropped.
gt These adjustments would preferably be made
during silent periods.

32
Packet loss

problem in heavily loaded networks where packet
loss rates may be up to 10
speech codecs are robust to random bit errors,
but not on loosing complete speech frames
forward error correction for voice frames is not
relevant in IP networks because it can not handle
losing whole packets
interleaving cannot be used because of extra
delay
retransmission are not possible because of the
real-time requirements
current solutions usually include error detection
function, usually a plain CRC check, which can be
used as an indication for packet loss recovery
procedure.

33
Echo

Echo cancelling should be used when the
round-trip delay exceeds 50 ms
in VoIP, echo cancelling becomes complicated, as
the delays are longer and may vary gt VoIP
terminals must implement echo cancelling

34
Tandem effect

intermediate speech coding and decoding phases
deteriorate speech quality considerably
most evident in low bit rate codecs
especially important in cases where VoIP is
interworking with other networks that use speech
coding, (mobile networks, etc.)
e.g. transcoding between G.723.1 and a GSM codec
produces poor speech quality.
TFO tandem-free operation
standardization going on in ETSI for TFO with GSM
networks
considerable savings can be made by TFO in
operational costs

35
Wideband Speech

fundamental bandwidth limitation in public
switched telephone network prevents the speech
quality to be further enhanced
gt most current codecs achieve good performance
only to narrowband speech where the audio
bandwidth is limited to 3.4 kHz
wideband speech coding where audio bandwidth is
extended to 7 kHz
wideband coding exceeds wireline quality.

36
Why Wideband Speech ?

Wideband speech supplies superior speech quality
over current narrow band speech services exceeds
the quality of wireline phones
In narrowband speech (100-3600 Hz band) important
high frequency components lost (eg. in
's'-sounds)
Wideband uses 50 - 7000 Hz band thus improving
naturalness, presence and intelligibility
wideband speech offers superior voice quality
over the existing narrow band services (cellular
systems, PSTN)
especially suitable for applications with high
quality audio parts

37
Wideband vs. narrowband
38
Wideband speech quality

results indicate that there is a significant
benefit in the wideband solution over narrowband
increased audio bandwidth provided by the wide
speech bandwidth will create a effect of
proximity between the users
it will almost completely eliminate the feeling
of "talking over a wire" of the wireline network.
Codec MOS
GSM EFR (narrowband 3.4kHz) 3.3
G.722, 48 kbit/s (wideband 7kHz) 4.06
G.722, 56 kbit/s (wideband 7kHz) 4.57

39
Nokia AMR WB Speech Codec

Collaboration with VoiceAge (University of
Sherbrooke in Canada)
ACELP technology very similar to AMR NB and GSM
EFR
Multirate codec 9 modes
Same coding algorithm in each mode
Very high code and data ROM reuse between modes
(much better than in the AMR NB codec)
VAD (integrated into the speech codec) and
comfort noise generation
Link Adaptation
9 speech codec modes
6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85,
23.05, 23.85 kbit/s

40
AMR WB Standardization in 3GPP

Initiated based on feasibility study in SMG11
(2Q-1999)
Initially 9 candidates, 5 companies were
qualified into the selection phase Ericsson,
FDNS-consortium (FT, DT, Nortel, Siemens),
Motorola, Nokia and Texas Instruments
Nokia WB codec selected as the best codec in 3GPP
TSG S4 meeting in October 2000-gt will be
standadized
The final specifications are approved in March
2001 (R4)
The selected Nokia WB codec has also been
approved to proceed into the ITU WB codec
selection in March 2001

41
Speech Codec Bit Allocation into Parameter Groups
42
AMR WB Speech Quality vs. ITU G722 WB
Speech Quality
G722 64 kbit/s
G722 56 kbit/s
G722 48 kbit/s
Nokia AMR WB
3
23.85
23.05
19.85
18.25
15.85
14.25
12.65
8.85
6.6
AMR WB Bit rates kbit/s
43
AMR-WB Speech Quality in GSM Channel
Subjective Speech Quality Degradation in Function
of Channel Quality
Subjective Speech Quality
AMR-WB
AMR-NB
EFR
Error-free
13
10
7
4
Carrier to Interference Ratio (C/I) dB
44
Applications for Wide Band Speech

Wide band telephony
AMR NB equal to PSTN speech quality
AMR WB improves the quality and provides
naturalness
Conferencing (Conversational multimedia)
Quality improvement over the current codecs
(G.722 at 48 and 56 kbit/s)
Bit rate drops to half or less compared to G.722
Streaming
Low complexity, low bit rate solution for
browsing type of applications

45
ITU-T wideband activity around 16 kbit/s

In 1999, the following guidelines have been
considered relevant in ITU-T for the new wideband
activity around 16 kbit/s (12, 16, 20, and 24
kbit/s)
Input and output audio signals should have a
bandwidth of 7 kHz at a sampling rate of 16 kHz.
Primary signals of interest are clean speech and
speech in background noise.
High speech quality with the objective of
equivalence to G.722 at 56/64 kbit/s.
16 kbit/s is the main bit-rate. It is required
that the ability of the candidate to scale in
bit-rate to lower bit-rates (less then 16 kbit/s)
and up to 24 kbit/s with no fundamental changes
in either the technology or the algorithm used.
Robustness to frame erasures and random bit
errors.
Low algorithmic delay (frame size of 20ms or
integer sub-multiples)
The applications for the new activity were
considered as follows Voice over IP (VoIP) and
Internet Applications, PSTN applications, Mobile
Communications, ISDN wideband telephony, and ISDN
videotelephony and video-conferencing.