Class 5 Speech Coding or how to reduce bits count without loss of much quality - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Class 5 Speech Coding or how to reduce bits count without loss of much quality

Description:

If we neglect the issues of speech quality and individual characterization of ... in reducing the bit rate significantly and in CELP all spectral and prosodic ... – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 34
Provided by: ogar1
Category:

less

Transcript and Presenter's Notes

Title: Class 5 Speech Coding or how to reduce bits count without loss of much quality


1
Class 5Speech Codingor how to reduce bits
countwithout loss of much quality
2
Information rate of speech
If we neglect the issues of speech quality and
individual characterization of speech (the
ability to recognize the speaker), as well as
other issues like coarticulation, intonation,
emotion, speaker rate variability, etc. which
impinge on quality, then the ability to recognize
the phonetic content should be satisfied with a
rate of about 72 bits/second. Since some 40-50
different phonemes can be coded with 6 bits and
that the average speaking rate is about 12
phonemes/second we obtain 61272 b/s. We
clearly introduce errors in the fragmentation of
speech and its coding and processing by
neglecting factors such as prosody and pitch
variability. They affect the quality of the
reconstituted signal.
3
Redundancies in speech
  • Redundancies to exploit in compressing speech
  • With the exception of closures our sampling
    frequency Fs is gtgt than vocal tract rate of
    change
  • F0 (or perceived pitch) changes slowly as
    compared to windowing rate
  • Adjacent windows correlate rather well
  • Spectral waveform changes slowly and most of the
    energy is at the low end of frequencies so it
    changes even more slowly there (important part of
    speech)
  • It is possible to model phones as periodic/noisy
    filtered excitation and still obtain reasonable
    quality
  • Speech parameters may be weighted since they
    occur nonuniformly (different probabilities)
  • The ear is insensitive to phase so it can be
    discarded

4
Quantization
  • We usually want to encode the speech signal in as
    simple a form as possible in order to do either
    on of three things transmit it, process it or
    store it. We do this because of the finite
    capacity of the channel, processor or storage
    media. This concept holds for either
    speech-to-text (STT) or synthesis (
    text-to-speech or TTS) and also for speaker
    recognition. We use sample and hold electronic
    circuits to find the analog value of the signal
    but we must convert it to digital form by means
    of an analog to digital converter (A/D) before
    trying to compress the signal and later a digital
    to analog (D/A) converter after we decompress it
    to retrieve an analog signal. Remember that a
    signal of zero amplitude goes midrange of the
    binary scale so positive and negative values may
    be represented.

5
Quantization noise
  • The first decision when we digitize the signal is
    how many bits of information should we use.
    Since we are converting an analog signal into a
    finite number of levels we should expect
    inaccuracies in the conversion. We consider
    those inaccuracies to create a noise or error in
    the signal since seldom would a power of 2
    coincide with the analog signal with a
    conservative number of bits. Most frequently we
    use between 8 and 16 bits to encode the analog
    sample. The larger the number of bits the smaller
    the size of the ? step and the smaller the noise,
    at the cost of increasing the rate of signal
    conversion and transmission.

6
SNR of quantization noise
  • Quantization errors e(n) or the differences
    between the actual value of the signal and the
    value of the digitally encoded waveform x(n) may
    be compared as a ratio of the squares of x(n) and
    e(n) and summed over the N samples in the window.
    The sum is the power ratio inverse of the
    errors over a window, which is larger (because it
    is an inverse) for the larger peak-to-peak
    (2Xmax) signals and depends also on the size ?
    of the quanta of the steps. A 10logarithm may
    be taken of this power ratio to yield decibels
    (db).

7
Considerations about samples to be quantized
  • It is important to remember that without
    compression a 300-3300 kHz analog signal sampled
    at 8khz and quantized to 8 bits (256 quanta)
    would take 20 times the bandwidth in digital form
    than the original analog
  • In addition, when vector-quantizing LPC
    parameters we may choose to use spectral
    perception metrics (mel or bark scale) to bias
    the distance metric.
  • The choice of the segments to be vector-quantized
    hopefully comes from a variety of utterances from
    different speakers and rich in a balance of
    phonemes, speech speed, regional accents, etc. to
    be representative of speaker-independent speech
    (TIMIT) . Also samples are scrambled to avoid
    initial biases.
  • The number of codebook entries is limited by the
    time it takes to search it and the desired
    compression.

8
Nonuniform quanta in the same sample
  • So far we have assumed using the same ? for all
    steps in all samples but since the quantization
    error is larger for smaller values of x(t) that
    is, far from Xmax , we can keep doing that and
    use a logarithm-like function (why not
    logarithmic? See Fig. 7.8 in text) to create a
    set of smaller error at small signal values and
    similarly at large ones. This process is called
    companding at the source encoding end and
    decompanding at the decoding (D/A) end. The
    net effect is to make the sum of the quantization
    errors smaller and more uniform percentage-wise.
    Fig. 7.9 shows two approaches to modify the
    logarithmic scaling . (A-law and µ-law as shown
    in Figure 7.9 in the text.)

9
Where is the power in the frequency spectrum of
speech in general?
Notice that the frequency scale is logarithmic in
this figure. Speech has in general higher power
at the lower frequencies for sonorants and less
power above 3.3kHz, as shown here.
10
Vector Quantization and applicable approaches
  • Once an utterance has been divided into T
    segments of 1N samples each it is possible to
    1) find a reduced set of representatives for
    groups of the T segments (averaged over a set of
    many sample utterances) of groups of segments
    originating from the large set of utterances
    (time-domain or waveform coding), or 2) work on
    the spectral waveform and obtain a reduced set of
    representative group of spectral parameters
    representing the segments of the broad set of
    utterances in the speech corpus. A third
    variation of 1) consists of 3) separating the
    segments that originate from voiced and
    non-sonorant parts of the speech (excitation)
    from those that come mostly from the vocal tract
    shape (called source coders or vocoders) . 2)
    and 3) use spectral approaches.

11
Codebooks
The process to compute the set of appropriate
representative codewords (called a codebook)
via vector quantization is usually based on
clustering the set of vector samples (regardless
of what they are) using the K-means or similar
variation to minimize the error that, in the case
of speech, we call distortion distance (the
square root of the distance between a centroid
and the sample). The simplest K-means (or a
variation thereof) approach is to divide all
samples in half and find the centroid for each
half. Then those two centroids are used to
divide again the samples in half and four
centroids are found and so on for m-times to
encode 2m centroids each in m binary dimensions.
Clearly n-m bit reduction is accomplished where
mltn. Improved distortion is obtained if the
inverse of the variance of the distance is
considered (Mahalanobis distance).
12
Considerations and metrics on speech quality
When we code sample by sample it is easy to
objectively evaluate the SNR or even the
segmental SNR. Subjective measures are more
difficult and significant , the most important of
which is intelligibility. Fortunately, all but
the lowest-bit-rate coders do well in
intelligibility (fraction of words or phones
correctly perceived.) Two standard additional
subjective measures are 1) Mean opinion score or
MOS on intelligibility (a five point opinion
survey rating with midpoint at 0 db SNR and each
point 6 db) and 2) the opinion-equivalent
Quality value or QOP for decoded speech (in which
the original signal with added modulated noise is
compared to the decoded signal for equivalent
perceptual quality), or 3) the Articulation
Equivalent Loss or AEN for synthetic speech
(which lowers intensity of the synthetic signal
to intelligibly recognize 80 of synthetic
speech.) There are also the diagnostic rhyme
test (DRT) testing for intelligibility of
consonants (100 scale) and the diagnostic
acceptability test (DAM) that rates naturalness
along perceptual scales.
13
PCM and log PCM
  • Time waveform coding methods are generally
    simpler to implement than spectral waveform
    approaches. Both PCM and log PCM digitize one
    sample at a time, as they come, not taking
    advantage of the redundancies among the
    neighboring samples after (feedforward) or before
    (feedback). If we take into account those
    surrounding samples some delays are unavoidable
    because of the buffering needed and the on time
    PCM and log PCM coding is not possible.

14
Differential (DPCM) or Delta (DM)?
  • To take advantage of the correlation between
    adjacent samples of s(n) we can quantize only the
    amplitude difference between adjacent samples (?)
    or delta (which is quantized), giving rise to
    DPCM which is usually called Differential
    Modulation. It is yet possible to encode ? with
    fewer bits. Actually, Delta Modulation is a
    particular popular case of Differential PCM where
    we one-bit encode the slope of the difference.
  • Both Delta Modulation and Continuously Variable
    Slope Delta Modulation (CVSD) are time waveform
    techniques using one bit for quantizing. An
    alternative to CVSD is the simpler linear Delta
    Modulation. See next page for the more commonly
    used CVSD.

15
Adaptive Delta Modulation and Continuously
Variable Slope Delta modulation (ADM/CVSD)
  • ADM/CVSD is actually delta modulation with an
    continuously variable adaptive quantizer, thus
    allowing to represent small signals with accuracy
    without sacrificing the accuracy of larger ones.
    In Linear Delta Modulation for input s(n) a zero
    bit may be used for a negative slope and a 1 for
    a positive one, and one fixed ? is added to the
    cumulative s(n-1) when decoded to reproduce the
    sequence.
  • In CVSD we adjust (adapt) ? to avoid slope
    overload and granular noise in the encoded
    signal. An excellent tutorial from MX.COM, INC.
    has been placed in the web site or available from
    the web under Continuously Variable Slope Delta
    Modulation a Tutorial which is required
    reading. (See Homework).

16
Summary on subjective quality metrics for speech
17
Most Commonly Used Time Waveform Coders
18
Comments on time based waveform coders
We can improve in SNR and rate by adding
logarithmic quantizing to plain PCM. We can
obtain lower complexity and bit transmission rate
at the price of lower (toll) quality by using
adaptive delta modulation with continuously
variable delta slope.
19
Low Delay-Code Excited Linear Predictor (LD-CELP)
Since low-bit-rate coders and decoders go through
buffering and computation they incur in delays
un-acceptable for real time. Lower delays are
obtained by making the excitation vectors very
short (5 samples or 0.625 ms). It also uses
backward adaptation for the predictor. LD-CELP
yields quality equal to 32 kb/s ACPCM increasing
the computation but at half the bit rate. Called
G.728 as an ITU-T standard for speech coding it
operates at 16 kbit/s. The linear prediction is
calculated backwards with a 50th order LPC
filter. The excitation is generated with gain
scaled VQ.. The complexity of the codec is 30
MIPS. 2 kilobytes of RAM is needed for codebooks.
20
Most Common Spectral Waveform Coder and Comments
In this and the preceding and following slides we
present the 11 most common types of speech coders
grouped in 3 categories. Coders using the speech
spectrum are classified in waveform coders and
vocoders (next chart). Fidelity and SNR (MOS
4/5) are emphasized in waveform coders at the
cost of higher bit rates when compared with
vocoders where the emphasis is on subjective
quality at low bit rates. Low Delay Code-Excited
Linear Prediction are popular spectral waveform
coders using vector quantization.
21
Channel, Homomorphic or Linear Predictive Models
of Vocoders
  • All vocoders require a binary voicing decision (1
    bit in the side information).
  • Channel vocoders utilize the equal separation of
    harmonics by only transmitting the magnitude of
    F0 (5-6 bits) and some 5 bits/channel. See page
    26.
  • Homomorphic (Cepstal) Vocoders it is called
    homomorphic because it is a nonlinear mapping
    (logarithm) to a different domain (separating
    excitation e(n) and vocal tract response h(n)
    with time windows) to which linear filtering
    techniques are applied, followed by (an
    exponential) mapping back to the original domain.
    See page 28.
  • LP Vocoders many vocoder use LP in combination
    with other techniques such as VQ.

22
Linear Predictive Coding Revisited (See pages
24,25,27 in class 4)
LPC is a general technique for data compression
that can be applied to any waveform, time-based
or spectral. It consists on using the samples of
a segment of the waveform to be represented by
the coefficients of
We have seen that we use the poles in its
autoregressive (AR) computation and neglect the
zeros and therefore the phase of the signal
(why?). Given the result of the LPC computation
is representative of the vocal tract originating
the signal we saw we could code a voiced or
unvoiced segment with pulsed LPC or noise (as we
will see next they can be used in vocoders also).
23
Vector-Sum-Excited Linear Prediction
  • VSELP is a is a speech coding method used in the
    IS-54 standard TDMA cell phones in the United
    States and a variation in Japan. It was also used
    in the first version of RealAudio for audio over
    the Internet. IS-54 VSELP specifies an encoding
    of each 20 ms of speech into 159-bit frames, thus
    achieving a raw data rate of 7.95 kbit/s. In an
    actual TDMA cell phone, the vocoder output is
    packaged with error correction and signaling
    information, resulting in an over-the-air data
    rate of 16.2 kbit/s. For internet audio, each
    159-bit frame is stored in 20 bytes, leaving 1
    bit unused. The resulting file thus has a data
    rate of exactly 8 kbit/s. A major drawback of
    VSELP is its limited ability to encode non-speech
    sounds, so that it performs poorly when encoding
    speech in the presence of background noise. For
    this reason, use of VSELP is being gradually
    phased out in favor of newer codecs. (From
    Wikipedia with changes.)

24
CODECs (COder-DECoders)
  • A CODEC is algorithm which may be implemented in
    hardware or most frequently today in software for
    COding and DECoding a digital or analog signal
    (less frequently now) often to transform it for
    transmission, storage, error correction or
    encryption, although most frequently for
    compression. The quality of the reproduction
    (audio, video, data) is an important
    consideration in the process and relates to the
    bit rate of the CODEC. There are lossy and
    lossless CODECs, the latter used mostly for
    archiving data. The computational requirements
    of the algorithm implementation (MIPS?) is
    another important consideration particularly if
    real time is required.

25
Code-excited linear prediction vocoder
  • CELP is a term used mostly to designate a class
    of algorithms for coding speech in the 4-8 kb/s
    range (or at 16 kb/s for almost toll quality)
    based mostly on using standard linear prediction
    filters but with a two codebook (one fixed, the
    other variable) VQ-guided excitation in a
    perceptually weighted domain. VQ is applied
    separately to the excitation and to the residual
    (10 bit codebook). Variations of CELP include
    LD-CELP, CS-CELP and CS-ACELP plus others. The
    success the CELP approach is due to the superior
    excitation model. It also requires an F0
    estimator. This excitation is optimized by
    minimizing the squared error.

26
Conjugate structure algebraic code-excited LP
  • In the algebraic CELP (ACELP) the residual
    samples are not VQ-ed but derived directly from
    an algebraic computation to be used in exciting
    the LP synthesizer accelerating the search for
    optimal excitation. The main advantage of ACELP
    is that the algebraic codebook it uses can be
    made very large (gt 50 bits) without running into
    storage or CPU time problems. A 16-bit algebraic
    codebook is used in the innovative codebook
    search, the aim of which is to find the best
    innovation and gain parameters. The innovation
    vector contains, at most, four non-zero pulses.
    In ACELP a block of N speech samples is
    synthesized by filtering an appropriate
    innovation sequence from a codebook, scaled by a
    gain factor, through two time varying filters,
    one a long-term or pitch, synthesis filter and
    the other a shorter term synthesis filter. The
    Conjugate Structure ACELP could yield
    toll-quality speech with a 10th order LPC and
    other requirements

27
Channel Vocoder (CV)
The input signal is filtered by a number of
non-overlapping contiguous non-uniform width
band-pass filters plus voicing and pitch
detectors output (5-6 bit for F0) which will
allow one-bit switching at the synthesis end.
The example is from the telephony band with the
output of the filters rectified and filtered by
low-pass 0-20 Hz filters (requiring at least 40
samples/sec).
28
LPC U. S. Government 10E Algorithm
Also known as G.728 is a ITU-T standard for
speech coding operating at 16 kbit/s. Technology
used is LD-CELP, Low Delay Code Excited Linear
Prediction. Delay of the codec is only 5 samples
(0.625 ms). The linear prediction is calculated
backwards with a 50th order LPC filter. The
excitation is generated with gain scaled VQ. The
standard was finished in 1992 in the form of
algorithm exact floating point code. In 1994 a
bit exact fixed point codec was released. G.728
passes low bit rate modem signals up to 2400
bit/s. Also network signaling goes through. The
complexity of the codec is 30 MIPS. 2 kilobytes
of RAM is needed for codebooks. (From Wikipedia.)
29
Homomorphic or Cepstral vocoders
Cepstral vocoders show little use because they
produce synthetic quality speech quality at high
complexity but they have a low bit rate (4.8
kb/s)
30
Most Commonly Used Vocoders
(Low use)
Use Vector Quantization
31
Comments on vocoders
Vocoders function on the model of speech that
separates the excitation and the vocal tract
response. In general, vocoders main bit stream
consists of a crude but effective model that
includes 1) a voiced/unvoiced bit, 2) F0
information, and 3) the spectrum amplitude or
equivalently the windows average energy. VQ
further aids in reducing the bit rate
significantly and in CELP all spectral and
prosodic parameters are coded in a group in the
speech frame, exploiting interparameter
redundancies.
32
(No Transcript)
33
Homework 4 (Midterm on second half of class on
October 10)
  • Read the tutorial from MX.com and answer the
    following questions a) what is the compromise
    when the sampling frequency gets closer to the
    Nyquist-Shannon limit? b) In Fig. 4, what is the
    meaning of Si 0 dB ? c) What are the two kinds
    of Delta Modulation? Briefly describe each in
    your own words
  • Compare the advantages and disadvantages of DPCM,
    DM, ADM and ADM/VCSD. Explain the concepts of
    adaptive, differential and delta in modulation
    types .
  • What are the advantages and disadvantages of
    increasing the number of channels in a channel
    vocoder?
  • Give at least one example of coding techniques
    and their bit rates each that are used in
    telephony, mobile or cell phones, the internet
    and CD recording. Give your source of
    information.
Write a Comment
User Comments (0)
About PowerShow.com