Title: Class 5 Speech Coding or how to reduce bits count without loss of much quality
1Class 5Speech Codingor how to reduce bits
countwithout loss of much quality
2Information rate of speech
If we neglect the issues of speech quality and
individual characterization of speech (the
ability to recognize the speaker), as well as
other issues like coarticulation, intonation,
emotion, speaker rate variability, etc. which
impinge on quality, then the ability to recognize
the phonetic content should be satisfied with a
rate of about 72 bits/second. Since some 40-50
different phonemes can be coded with 6 bits and
that the average speaking rate is about 12
phonemes/second we obtain 61272 b/s. We
clearly introduce errors in the fragmentation of
speech and its coding and processing by
neglecting factors such as prosody and pitch
variability. They affect the quality of the
reconstituted signal.
3Redundancies in speech
- Redundancies to exploit in compressing speech
- With the exception of closures our sampling
frequency Fs is gtgt than vocal tract rate of
change - F0 (or perceived pitch) changes slowly as
compared to windowing rate - Adjacent windows correlate rather well
- Spectral waveform changes slowly and most of the
energy is at the low end of frequencies so it
changes even more slowly there (important part of
speech) - It is possible to model phones as periodic/noisy
filtered excitation and still obtain reasonable
quality - Speech parameters may be weighted since they
occur nonuniformly (different probabilities) - The ear is insensitive to phase so it can be
discarded
4Quantization
- We usually want to encode the speech signal in as
simple a form as possible in order to do either
on of three things transmit it, process it or
store it. We do this because of the finite
capacity of the channel, processor or storage
media. This concept holds for either
speech-to-text (STT) or synthesis (
text-to-speech or TTS) and also for speaker
recognition. We use sample and hold electronic
circuits to find the analog value of the signal
but we must convert it to digital form by means
of an analog to digital converter (A/D) before
trying to compress the signal and later a digital
to analog (D/A) converter after we decompress it
to retrieve an analog signal. Remember that a
signal of zero amplitude goes midrange of the
binary scale so positive and negative values may
be represented.
5Quantization noise
- The first decision when we digitize the signal is
how many bits of information should we use.
Since we are converting an analog signal into a
finite number of levels we should expect
inaccuracies in the conversion. We consider
those inaccuracies to create a noise or error in
the signal since seldom would a power of 2
coincide with the analog signal with a
conservative number of bits. Most frequently we
use between 8 and 16 bits to encode the analog
sample. The larger the number of bits the smaller
the size of the ? step and the smaller the noise,
at the cost of increasing the rate of signal
conversion and transmission.
6SNR of quantization noise
- Quantization errors e(n) or the differences
between the actual value of the signal and the
value of the digitally encoded waveform x(n) may
be compared as a ratio of the squares of x(n) and
e(n) and summed over the N samples in the window.
The sum is the power ratio inverse of the
errors over a window, which is larger (because it
is an inverse) for the larger peak-to-peak
(2Xmax) signals and depends also on the size ?
of the quanta of the steps. A 10logarithm may
be taken of this power ratio to yield decibels
(db).
7Considerations about samples to be quantized
- It is important to remember that without
compression a 300-3300 kHz analog signal sampled
at 8khz and quantized to 8 bits (256 quanta)
would take 20 times the bandwidth in digital form
than the original analog - In addition, when vector-quantizing LPC
parameters we may choose to use spectral
perception metrics (mel or bark scale) to bias
the distance metric. - The choice of the segments to be vector-quantized
hopefully comes from a variety of utterances from
different speakers and rich in a balance of
phonemes, speech speed, regional accents, etc. to
be representative of speaker-independent speech
(TIMIT) . Also samples are scrambled to avoid
initial biases. - The number of codebook entries is limited by the
time it takes to search it and the desired
compression.
8Nonuniform quanta in the same sample
- So far we have assumed using the same ? for all
steps in all samples but since the quantization
error is larger for smaller values of x(t) that
is, far from Xmax , we can keep doing that and
use a logarithm-like function (why not
logarithmic? See Fig. 7.8 in text) to create a
set of smaller error at small signal values and
similarly at large ones. This process is called
companding at the source encoding end and
decompanding at the decoding (D/A) end. The
net effect is to make the sum of the quantization
errors smaller and more uniform percentage-wise.
Fig. 7.9 shows two approaches to modify the
logarithmic scaling . (A-law and µ-law as shown
in Figure 7.9 in the text.)
9Where is the power in the frequency spectrum of
speech in general?
Notice that the frequency scale is logarithmic in
this figure. Speech has in general higher power
at the lower frequencies for sonorants and less
power above 3.3kHz, as shown here.
10Vector Quantization and applicable approaches
- Once an utterance has been divided into T
segments of 1N samples each it is possible to
1) find a reduced set of representatives for
groups of the T segments (averaged over a set of
many sample utterances) of groups of segments
originating from the large set of utterances
(time-domain or waveform coding), or 2) work on
the spectral waveform and obtain a reduced set of
representative group of spectral parameters
representing the segments of the broad set of
utterances in the speech corpus. A third
variation of 1) consists of 3) separating the
segments that originate from voiced and
non-sonorant parts of the speech (excitation)
from those that come mostly from the vocal tract
shape (called source coders or vocoders) . 2)
and 3) use spectral approaches.
11Codebooks
The process to compute the set of appropriate
representative codewords (called a codebook)
via vector quantization is usually based on
clustering the set of vector samples (regardless
of what they are) using the K-means or similar
variation to minimize the error that, in the case
of speech, we call distortion distance (the
square root of the distance between a centroid
and the sample). The simplest K-means (or a
variation thereof) approach is to divide all
samples in half and find the centroid for each
half. Then those two centroids are used to
divide again the samples in half and four
centroids are found and so on for m-times to
encode 2m centroids each in m binary dimensions.
Clearly n-m bit reduction is accomplished where
mltn. Improved distortion is obtained if the
inverse of the variance of the distance is
considered (Mahalanobis distance).
12Considerations and metrics on speech quality
When we code sample by sample it is easy to
objectively evaluate the SNR or even the
segmental SNR. Subjective measures are more
difficult and significant , the most important of
which is intelligibility. Fortunately, all but
the lowest-bit-rate coders do well in
intelligibility (fraction of words or phones
correctly perceived.) Two standard additional
subjective measures are 1) Mean opinion score or
MOS on intelligibility (a five point opinion
survey rating with midpoint at 0 db SNR and each
point 6 db) and 2) the opinion-equivalent
Quality value or QOP for decoded speech (in which
the original signal with added modulated noise is
compared to the decoded signal for equivalent
perceptual quality), or 3) the Articulation
Equivalent Loss or AEN for synthetic speech
(which lowers intensity of the synthetic signal
to intelligibly recognize 80 of synthetic
speech.) There are also the diagnostic rhyme
test (DRT) testing for intelligibility of
consonants (100 scale) and the diagnostic
acceptability test (DAM) that rates naturalness
along perceptual scales.
13PCM and log PCM
- Time waveform coding methods are generally
simpler to implement than spectral waveform
approaches. Both PCM and log PCM digitize one
sample at a time, as they come, not taking
advantage of the redundancies among the
neighboring samples after (feedforward) or before
(feedback). If we take into account those
surrounding samples some delays are unavoidable
because of the buffering needed and the on time
PCM and log PCM coding is not possible.
14Differential (DPCM) or Delta (DM)?
- To take advantage of the correlation between
adjacent samples of s(n) we can quantize only the
amplitude difference between adjacent samples (?)
or delta (which is quantized), giving rise to
DPCM which is usually called Differential
Modulation. It is yet possible to encode ? with
fewer bits. Actually, Delta Modulation is a
particular popular case of Differential PCM where
we one-bit encode the slope of the difference. - Both Delta Modulation and Continuously Variable
Slope Delta Modulation (CVSD) are time waveform
techniques using one bit for quantizing. An
alternative to CVSD is the simpler linear Delta
Modulation. See next page for the more commonly
used CVSD.
15Adaptive Delta Modulation and Continuously
Variable Slope Delta modulation (ADM/CVSD)
- ADM/CVSD is actually delta modulation with an
continuously variable adaptive quantizer, thus
allowing to represent small signals with accuracy
without sacrificing the accuracy of larger ones.
In Linear Delta Modulation for input s(n) a zero
bit may be used for a negative slope and a 1 for
a positive one, and one fixed ? is added to the
cumulative s(n-1) when decoded to reproduce the
sequence. - In CVSD we adjust (adapt) ? to avoid slope
overload and granular noise in the encoded
signal. An excellent tutorial from MX.COM, INC.
has been placed in the web site or available from
the web under Continuously Variable Slope Delta
Modulation a Tutorial which is required
reading. (See Homework).
16Summary on subjective quality metrics for speech
17Most Commonly Used Time Waveform Coders
18Comments on time based waveform coders
We can improve in SNR and rate by adding
logarithmic quantizing to plain PCM. We can
obtain lower complexity and bit transmission rate
at the price of lower (toll) quality by using
adaptive delta modulation with continuously
variable delta slope.
19Low Delay-Code Excited Linear Predictor (LD-CELP)
Since low-bit-rate coders and decoders go through
buffering and computation they incur in delays
un-acceptable for real time. Lower delays are
obtained by making the excitation vectors very
short (5 samples or 0.625 ms). It also uses
backward adaptation for the predictor. LD-CELP
yields quality equal to 32 kb/s ACPCM increasing
the computation but at half the bit rate. Called
G.728 as an ITU-T standard for speech coding it
operates at 16 kbit/s. The linear prediction is
calculated backwards with a 50th order LPC
filter. The excitation is generated with gain
scaled VQ.. The complexity of the codec is 30
MIPS. 2 kilobytes of RAM is needed for codebooks.
20Most Common Spectral Waveform Coder and Comments
In this and the preceding and following slides we
present the 11 most common types of speech coders
grouped in 3 categories. Coders using the speech
spectrum are classified in waveform coders and
vocoders (next chart). Fidelity and SNR (MOS
4/5) are emphasized in waveform coders at the
cost of higher bit rates when compared with
vocoders where the emphasis is on subjective
quality at low bit rates. Low Delay Code-Excited
Linear Prediction are popular spectral waveform
coders using vector quantization.
21Channel, Homomorphic or Linear Predictive Models
of Vocoders
- All vocoders require a binary voicing decision (1
bit in the side information). - Channel vocoders utilize the equal separation of
harmonics by only transmitting the magnitude of
F0 (5-6 bits) and some 5 bits/channel. See page
26. - Homomorphic (Cepstal) Vocoders it is called
homomorphic because it is a nonlinear mapping
(logarithm) to a different domain (separating
excitation e(n) and vocal tract response h(n)
with time windows) to which linear filtering
techniques are applied, followed by (an
exponential) mapping back to the original domain.
See page 28. - LP Vocoders many vocoder use LP in combination
with other techniques such as VQ.
22Linear Predictive Coding Revisited (See pages
24,25,27 in class 4)
LPC is a general technique for data compression
that can be applied to any waveform, time-based
or spectral. It consists on using the samples of
a segment of the waveform to be represented by
the coefficients of
We have seen that we use the poles in its
autoregressive (AR) computation and neglect the
zeros and therefore the phase of the signal
(why?). Given the result of the LPC computation
is representative of the vocal tract originating
the signal we saw we could code a voiced or
unvoiced segment with pulsed LPC or noise (as we
will see next they can be used in vocoders also).
23Vector-Sum-Excited Linear Prediction
- VSELP is a is a speech coding method used in the
IS-54 standard TDMA cell phones in the United
States and a variation in Japan. It was also used
in the first version of RealAudio for audio over
the Internet. IS-54 VSELP specifies an encoding
of each 20 ms of speech into 159-bit frames, thus
achieving a raw data rate of 7.95 kbit/s. In an
actual TDMA cell phone, the vocoder output is
packaged with error correction and signaling
information, resulting in an over-the-air data
rate of 16.2 kbit/s. For internet audio, each
159-bit frame is stored in 20 bytes, leaving 1
bit unused. The resulting file thus has a data
rate of exactly 8 kbit/s. A major drawback of
VSELP is its limited ability to encode non-speech
sounds, so that it performs poorly when encoding
speech in the presence of background noise. For
this reason, use of VSELP is being gradually
phased out in favor of newer codecs. (From
Wikipedia with changes.)
24CODECs (COder-DECoders)
- A CODEC is algorithm which may be implemented in
hardware or most frequently today in software for
COding and DECoding a digital or analog signal
(less frequently now) often to transform it for
transmission, storage, error correction or
encryption, although most frequently for
compression. The quality of the reproduction
(audio, video, data) is an important
consideration in the process and relates to the
bit rate of the CODEC. There are lossy and
lossless CODECs, the latter used mostly for
archiving data. The computational requirements
of the algorithm implementation (MIPS?) is
another important consideration particularly if
real time is required.
25Code-excited linear prediction vocoder
- CELP is a term used mostly to designate a class
of algorithms for coding speech in the 4-8 kb/s
range (or at 16 kb/s for almost toll quality)
based mostly on using standard linear prediction
filters but with a two codebook (one fixed, the
other variable) VQ-guided excitation in a
perceptually weighted domain. VQ is applied
separately to the excitation and to the residual
(10 bit codebook). Variations of CELP include
LD-CELP, CS-CELP and CS-ACELP plus others. The
success the CELP approach is due to the superior
excitation model. It also requires an F0
estimator. This excitation is optimized by
minimizing the squared error.
26Conjugate structure algebraic code-excited LP
- In the algebraic CELP (ACELP) the residual
samples are not VQ-ed but derived directly from
an algebraic computation to be used in exciting
the LP synthesizer accelerating the search for
optimal excitation. The main advantage of ACELP
is that the algebraic codebook it uses can be
made very large (gt 50 bits) without running into
storage or CPU time problems. A 16-bit algebraic
codebook is used in the innovative codebook
search, the aim of which is to find the best
innovation and gain parameters. The innovation
vector contains, at most, four non-zero pulses.
In ACELP a block of N speech samples is
synthesized by filtering an appropriate
innovation sequence from a codebook, scaled by a
gain factor, through two time varying filters,
one a long-term or pitch, synthesis filter and
the other a shorter term synthesis filter. The
Conjugate Structure ACELP could yield
toll-quality speech with a 10th order LPC and
other requirements
27Channel Vocoder (CV)
The input signal is filtered by a number of
non-overlapping contiguous non-uniform width
band-pass filters plus voicing and pitch
detectors output (5-6 bit for F0) which will
allow one-bit switching at the synthesis end.
The example is from the telephony band with the
output of the filters rectified and filtered by
low-pass 0-20 Hz filters (requiring at least 40
samples/sec).
28LPC U. S. Government 10E Algorithm
Also known as G.728 is a ITU-T standard for
speech coding operating at 16 kbit/s. Technology
used is LD-CELP, Low Delay Code Excited Linear
Prediction. Delay of the codec is only 5 samples
(0.625 ms). The linear prediction is calculated
backwards with a 50th order LPC filter. The
excitation is generated with gain scaled VQ. The
standard was finished in 1992 in the form of
algorithm exact floating point code. In 1994 a
bit exact fixed point codec was released. G.728
passes low bit rate modem signals up to 2400
bit/s. Also network signaling goes through. The
complexity of the codec is 30 MIPS. 2 kilobytes
of RAM is needed for codebooks. (From Wikipedia.)
29Homomorphic or Cepstral vocoders
Cepstral vocoders show little use because they
produce synthetic quality speech quality at high
complexity but they have a low bit rate (4.8
kb/s)
30Most Commonly Used Vocoders
(Low use)
Use Vector Quantization
31Comments on vocoders
Vocoders function on the model of speech that
separates the excitation and the vocal tract
response. In general, vocoders main bit stream
consists of a crude but effective model that
includes 1) a voiced/unvoiced bit, 2) F0
information, and 3) the spectrum amplitude or
equivalently the windows average energy. VQ
further aids in reducing the bit rate
significantly and in CELP all spectral and
prosodic parameters are coded in a group in the
speech frame, exploiting interparameter
redundancies.
32(No Transcript)
33Homework 4 (Midterm on second half of class on
October 10)
- Read the tutorial from MX.com and answer the
following questions a) what is the compromise
when the sampling frequency gets closer to the
Nyquist-Shannon limit? b) In Fig. 4, what is the
meaning of Si 0 dB ? c) What are the two kinds
of Delta Modulation? Briefly describe each in
your own words - Compare the advantages and disadvantages of DPCM,
DM, ADM and ADM/VCSD. Explain the concepts of
adaptive, differential and delta in modulation
types . - What are the advantages and disadvantages of
increasing the number of channels in a channel
vocoder? - Give at least one example of coding techniques
and their bit rates each that are used in
telephony, mobile or cell phones, the internet
and CD recording. Give your source of
information.