Title: Voice DSP Processing III
1VoiceDSPProcessing III
- Yaakov J. Stein
- Chief ScientistRAD Data Communications
2Voice DSP
- Part 1 Speech biology and what we can learn from
it - Part 2 Speech DSP (AGC, VAD, features, echo
cancellation) - Part 3 Speech compression techiques
- Part 4 Speech Recognition
3Voice DSP - Part 3
- Simple coders
- G.711 A-law m-law
- Delta
- ADPCM
- CELP coders
- LPC-10
- RELP/GSM
- CELP
- Other methods
- MBE
- MELP
- STC
- Waveform Interpolation
4Encoder Criteria
- Encoders can be compared in many ways
- the most important are
- Bit rate (Kbps)
- Speech quality (MOS)
- Delay (algorithmic framelookahead
computational propagation) - Computational Complexity
- Often less important
- Bit exactness (interoperability)
- Transcoding robustness
- Behavior on non-speech (babble noise, tones,
music) - Bit error robustness
5PSTN Quality Coders
- Rate ITU-T encoder
- 128 Kbps 16bit linear
sampling - 64 Kbps G.711 A-law/m-law 8bit
log sampling - 32 Kbps G.726 ADPCM
- 16 Kbps G.728 LDCELP
- 8 Kbps G.729 CS-ACELP
- 4 Kbps SG16Q21 ???
- toll quality MOS rating, but higher delay
6Digital Cellular Standards
7Military / Satellite Standards
8Voice DSP
9G.711
- 16 bit linear sampling at 8 KHz means 128 Kbps
- Minimal toll quality linear sampling is 12 bit
(96 Kbps) - 8 bit linear sampling (256 levels) is noticeably
noisy - Due to
- prevalence of low amplitudes
- logarithmic response of ear
- we can use logarithmic sampling
- Different standards for different places
10G.711 - cont.
North America m 255
- m-law
- A-law
- Although very different looking they are nearly
identical - G.711 approximates these expressions by 16
staircase straight-line segments - (8 negative and 8 positive)
- m-law horizontal segment through origin, A-law
vertical segment
Rest Of World A 87.56
11DPCM
- Due to low-pass character of speech
- differences are usually smaller than signal
values - and hence require fewer bits to quantize
- Simplest Delta-PCM (DPCM) quantize first
difference signal D - Delta-PCM quantize difference between signal
and prediction - sn p ( sn-1 , sn-2 , , sn-N ) S pi sn-i
- If predict using linear combination (FIR filter),
this is linear prediction - Delta-modulation (DM) use only sign of
difference (1bit DPCM) - Sigma-delta (1bit) oversample, DM, trade-off
rate for bits
i
12DPCM with prediction
- If the linear prediction works well, then the
prediction error - en sn - sn
- will be lower in energy and whiter than sn
itself ! - Only the error is needed for reconstruction,
- since the predictable portion can be predicted sn
sn en!
sn
prediction filter
13DPCM - post-filtering
- Simplest case
- if highly oversampled
- then previous sample sn-1 predicts sn well,
- so we can use DM,
- if sgn(en) lt 0 then -D else D
- For DM there is no way to encode zero prediction
error - so decoded signal oscillates wildly
- Standard remedy is a post-filter that low-pass
filters this noise - But there is a b i g g e r problem!
14Open-loop Prediction
- The encoder (linear predictor) is present in the
decoder - but there runs as feedback
- The decoders predictions are accurate with the
precise error en - but it gets the quantized error en and the models
diverge!
15Side Information
- There are two ways to solve the problem ...
- The first way is to send the prediction
coefficients - from the encoder to the decoder
- and not to let the decoder derive them
- The coefficients sent are called side-information
- Using side-information means higher bit-rate
- (since both en and coefficients must be sent)
- The second way does not require increasing bit
rate
16Closed-loop Prediction
- To ensure that the encoder and decoder stay
in-sync - we put the decoder into the encoder
- Thus the encoders predictions are identical to
the decoders - and no model difference accumulates
en
sn
en
sn
Q
IQ
IQ
PF
PF
17Two types of error
- For DM there are two types of error (depending on
step size)
D too small
D OK
D too large
18Adaptive Step Size
- Speech signals are very nonstationary
- We need to adapt the step size to match signal
behavior - Increase D when signal changes rapidly
- Decrease D when signal is relatively constant
- Simplest method (for DM only)
- If present bit is the same as previous multiply D
by K (K1.5) - If present bit is different, divide D by K
- Constrain D to a predefined range
- More general method
- Collect N samples in buffer (N 128 512)
- Compute standard deviation in buffer
- Set D to a fraction of standard deviation
- Send D to decoder as side-information or
- Use backward adaptation (closed-loop D
computation)
19ADPCM
- G.726 has
- Adaptive predictor
- Adaptive quantizer and inverse quantizer
- Adaptation speed control
- Tone and transition detector
- Mechanism to prevent loss from tandeming
- Computational complexity relatively high (10
MIPS) - 24 and 16 Kbps modes defined, but not toll
quality - G.727 same rates but embedded for packetize
networks - ADPCM only used general low-pass characteristic
of speech - What is the next step?
20Scalar Quantization
- Standard A/D has preset, evenly distributed
levels - G.711 has preset, non-evenly distributed levels
- With a criterion we can make an adaptive
quantizer - Simplest criterion minimum squared quantization
error - en sn - sn E lt en2 gt
- Need algorithm to find optimal placement of
levels EM-type algorithms -
21Vector Quantization
- We can do the same thing in higher dimensions
- Here we wish to match input data xi i 1
.. N - to a codebook of codewords Cj j 1 .. M
- with Minimal Mean Squared Error
- E Si1..N xi - C 2
- where C is the codeword closest to xi in the
codebook
xi
22LBG Algorithm for VQ
- Input xi i 1 .. N clustering, unsupervised
learning - Randomly initialize codebook Cj j 1 .. M
- Loop until converge
- Classification Step
- for i 1 .. N
- for j 1 .. M
- compute Dij2 xi - Cj 2
- classify xi to Cj with minimal Dij2
- Expectation Step
- for j 1 .. M correct center Cj S
i e Cj xi
1
Nj
23Speech Application of VQ
- OK, I understand what to do with scalar
quantization - what is VQ good for ?
- We could try to simply VQ frames of speech
samples - but this doesnt work well !
- We can VQ spectra or sub-band components
- We often VQ parameter sets (e.g. LPC
coefficients) - We also VQ model error signals
24Voice DSP
25LPC-10
- Based on 10th order LPC (obviously) Bishnu
Atal - 180 sample blocks are encoded into 54 bits
- Pitch U/V (found using AMDF) 7 bits
- Gain
5 bits - 10 reflection coefficients found by covariance
method - first two coefficients converted to log area
ratios - L1, L2, a3, a4 5 bits each
- a5, a6, a7, a8 4 bits each
- a9 3 bits a10 2 bits 41 bits
- 1 sync bit 1
bit - 54 bits 44.44 times per second results in 2400
bps - By using VQ could reduce bit rate to under 1
Kbps! - LPC-10 speech is intelligible, but synthetic
sounding - and much of the speaker identity is lost !
26The Residual
- Recover sn by adding back the residual error
signal - sn sn en
- So if we send en as side-information we can
recover sn - en is smaller than sn so may require fewer bits
! - But en is whiter than sn so may require many
bits! - The question has now become
- How can we compress the residual?
27Encoding the Residual
- RELP (6-9.6 Kbps)
- Low-pass filter and downsample residual to 1 KHz
- Encode using ADPCM
- VQ-RELP (4.8 Kbps)
- VQ coding of residual
- RELP (4.8 Kbps)
- Perform FFT on residual
- Baseband coding
- RPE-LTP (GSM-FR at 13 Kbps)
- Residual Pulse Excitation - Long Term Predictor
- Perform Long Term Prediction (pitch recovery)
- Subtract to obtain new residual
- Decimate by 3, use phase with maximum energy
- Extract 6-bit overall gain
- Encode remainder with 3 bits/sample
28Residual and Excitation
- Synthesis filter sn
en S am sn-m - Analysis filter rn
sn - S am sn-m - So rn en !
excitation
residual
Note all-zero filter is the inverse of the
all-pole filter
29CELP
- Atals idea
- Find a way to efficiently encode the excitation !
- Questions
- How can we find the excitation?
- Theoretically, by algebra (invert the filter!)
- How can we efficiently encode the residual?
- VQ - Code Excited Linear Prediction
- How can we efficiently find the best codeword?
- Exhaustive search
30CELP - cont.
- Atal and friends (Schroeder, Remde, Singhal,
etc.) discoveries - Even random codebooks work well Gaussian,
uniform - Dont need large codebooks e.g. 1024 codewords
for 40 samples - Can center-clip with little loss
- Codebook with constant amplitude almost as good
- So we can use codebooks with structure (and save
storage/search/bits) - Multipulse (MP)
-
- Constant Amplitude Pulse
Regular Pulse (RP)
31Special Excitations
- Shift technique reduces random CB operations from
O(N2) to O(N) - a b c d e f c d e f g h e f g h I j ...
- Using a small number of 1 amplitude pulses
leads to MIPS reduction - Since most values are zero, there are few
operations - Since amplitudes 1 no true multiplications
- In a CB containing CW and -CW we can save half
- Algebraic codebooks exploit algebraic structure
- Example choose pulses according to Hadamard
matrix - Using FHT reduces computation
- Conjugate structure codebooks
- Excitation is sum of codewords from two related
CBs
32Analysis by Synthesis
- Finding the best codeword by exhaustive search
sn
Compute energy
-
LPC
find minimum
33Perceptual Weighting
- The criterion for selecting the best codeword
should be perceptual - not simply the energy of the difference signal!
- We perceptually weight the signal and the
synthesized signal
sn
PW
-
Since PW is a filter we need use it only once
CB
LPC
34Perceptual Weighting - cont.
- The most important PW effect is masking
- Coding error energy near formants is not heard
anyway - so we allow higher error near formants
- but demand lower perceivable error energy
- To do this we de-emphasize according to the LPC
spectrum! - Simplest filter is 1 - S ai z-I where ai are
the LPC coefficients - How do we take the critical bandwidth into
account? - We perform bandwidth expansion Denominator
expansion gt numerator 1 - S g1i ai z-I - 1 - S g2i ai z-I
BW - ln(g) Fs p
1 gt g1 gt g2 gt 0
Typical values g1 0.9 g2 0.6
35Post-filter
- Not related to the subject, but if we are already
here - In order to increase the subjective quality of
many coders - post-filters are often used to emphasize the
formant structure - These have the same form as the perceptual
weighting filter - but 1 gt g2 gt g1 gt 0 with typical values g1 0.5
g2 0.75 - Denominator expansion lt numerator!
- the post-filter also reinforces tilt
- which should then be compensated by an IIR filter
- since the spectral valleys are de-emphasized
- we should change the PW filter parameters g1 and
g2 - Originally proposed for ADPCM !
36Subframes
- Coders with large frames (gt 10 ms) need a long
excitation signal - and hence a lot of bits to encode
- An alternative is to divide the frame into (2-4)
subframes - each of which has its own codeword excitation
frame n-1
frame n1
frame n
We really should recompute LPC per subframe but
we can get away with interpolating !
37Lookahead
- If we are already dividing up the frame
- we can compute the LPC based on a shifted frame
- This is called lookahead, and it adds processing
delay ! - To decrease delay we can use backward looking IIR
filter - and then we neednt send/store the LPC
coefficients at all!
------- LPC -------
------- LPC -------
CW
CW
CW
CW
CW
CW
CW
CW
38What happened to the pitch?
- Unlike LPC, the ABS CELP coder is excited by
codebook - Where does the pitch come from?
- Random CB minimi zation will prefer good
excitation - Regular/Multi pulse pulse spacing (not enough
pulses for high pitch) - But this is usually not enough (residual has
pitch periodicity) - Two solutions
- Adaptive codebook (Klejn, etal)
- Long term prediction (Atal Singhal)
- Both of these reinforce the pitch component
39Adaptive CB
- Adaptive codebook is repetitions of previous
excitations - Total excitation is weighted sum of stochastic CB
(random, MP, RP, etc) - and adaptive CB
Adaptive CB
Ga
LPC
Gs
Fixed CB
40Long Term Prediction
- Using long-term (pitch predictor) and short-term
(LPC) prediction - Long term predictor may have only
- one delay, but then non-integer
- 1
- 1 - b z - d
sn
pitch predictor
gain
-
codebook
LPC
perceptual weighting
error computation
41Federal Standard CELP
- FS 1016 at 4.8 Kbps has MOS 3.2
- Developed by ATT Bell Labs for DOD 144 bits /
30 ms frame - 10th order LPC on 30 ms Hamming window
- no pre-emphasis, additional 15 Hz BW expansion
(quality and LSP robustness) - Conversion to LSP and nonuniform scalar
quantization to 34 bits - 4 subframes (7.5 ms) LSP interpolation
- 512 entry fixed CB - static -1,0,1 from
center-clipped Gaussian - 5 bit nonuniform quantized gain 56 bits
- 256 entry adaptive CB - 8 bits 5 bit nonuniform
quantized gain 48 bits - optional noninteger delays, optional
- Perceptual weighting
- Postfilter spectral tilt compensation,
removable for noise or tandeming - FEC 4 bits SYNC 1 bit reserved 1 bit
42G.728
- 16 Kbps with MOS similar to G.726 at 32 Kbps
- Low 5 sample (0.625 msec) delay
- High computational complexity (about 30 MIPS)
- CELP with Backward LPC
- LPC order 50 (why not? - we dont transmit
side-information!) - Frame of 2.5 ms (20 samples)
- 4 subframes of 0.625 ms (5 samples)
- Perceptual weighting
- Only 10 bit index to fixed CB is transmitted
- 10 bits per 0.625 ms is 16 Kbps !
43G.729
- 8 Kbps toll-quality coder for DSVD and VoIP
- Computational complexity 20 MIPS, but G.729a is
about 10 MIPS - frame 10 ms (80 samples) lookahead 5 ms (1
subframe) - LPC, LSP, VQ, LSP interpolation
- CS-ACELP CB (Interleaved single pulse
permutation) 4 1 pulses / subframe - closed loop pitch prediction and adaptive CB
(delaygain) - 2 (40 sample) subframes per frame
- For each frame the encoder outputs 80 bits
- LSF coefficients 18 bits pitch
8 bits gain CB 14 bits - adaptive CB 5 bits parity check 1
bit - pulse positions 26 bits pulse signs 8
bits
44G.729 annexes
- A Compatible reduced complexity encoder with
minimal MOS reduction - B VAD and CNG
- C Floating point implementation
- D 6.4 Kbps version
- similar to G.729 but 64 output bits per frame,
quality better than G.726 at 24Kbps - LSF coefficients 18b pitchadaptive CB 84b
gain CB 12b fixed CB 22b - E 11.8 Kbps coder for high quality and music
45G.723.1
- 6.4 (MP-MLQ) and 5.4 (ACELP) Kbps rates
- About 18 MIPS on DSP
- frame 30 ms (240 samples) lookahead 15 ms.
- LPC on 30 ms (240 sample) frames, LSP and VQ
- open-loop pitch computation on half-frames (120
sample) - excitation on 4 subframes (60 samples) per frame
- perceptual weighting and harmonic noise weighting
- fifth-order closed loop pitch predictor
- MP-MLQ 5 or 6 1 pulses / subframe, positions
all even or all odd - ACELP 4 1 pulses / subframe, positions differ
by 8 - Annex A VAD-CNG Annex B floating point
implementation
46Voice DSP
- Other
- Methods
- MBE/MELP
- STC/WI
47MBE coder
- LPC10 makes hard U/V decision - no mixed voicing
- Multi Band Excitation uses a different excitation
- harmonics of pitch frequency
- frequency-dependent binary U/V decision
- large number of sub-bands (gt16)
- Simultaneous ABS estimation of pitch and spectral
envelope - Then U/V decision made based on spectral fit
- Use of dynamic programming for pitch tracking
48MBE coder - cont.
- DVSI made various MBE, AMBE and IMBE for
satellite (INMARSAT) - Bit rates 2.4 - 9.6 Kbps (toll quality at 3.6
Kbps) - Integral FEC for bit-error robustness
- As an example
- 128 bits for each 20 ms frame
- pitch 8 bits
- U/V decisions K bits (K lt 12)
- spectral amplitudes (DCT) 75-K bits
- FEC (Golay codes) 45 bits
49MELP
- DOD wanted a new 2.4 Kbps coder with MOS similar
to FS1016 - Main problems with LPC10
- voicing determination errors
- no handling of partially voiced speech
- Unlike MBE MELP uses standard LPC model
- MELP excitation is pulse train plus random noise
- Soft decision in small number (5) of sub-bands
- Frame 22.5 ms (180 samples)
- 10th order LPC, 15 Hz BW expansion, LSF,
interpolation, VQ - pitch refinement
- 5 sub-bands (0-500-1000-2000-3000-4000Hz) pitch
and noise excitation - FEC
50Sinusoidal Transform Coder
- McAulay and Quatieri model
- instead of LPC use sum of sine waves
- sn Si 1 .. N Ai cos ( wi n fi )
- For each analysis frame (10 - 20 ms) need to
extract N Ai fi s - Voiced speech
- Use pitch and important harmonics from
pitch-synchronized STFT - Unvoiced speech
- Use peaks of STFT points where slope changes
from to - - At high bit-rates keep magnitudes, frequencies
and phases - At low bit-rates frequencies constrained and
phases modeled
51STC - cont.
- Sparse spectrum is updated at regularly spaced
times - Amplitude linearly interpolated between updates
- Interpolated phase must obey 4 conditions (w f
w f)
overlapped windowing
sn
FFT
sum of sinusoids
sn
peak picker
spectrum encoder
spectrum decoder
e.g. all-pole model
52STC - cont.
- Tracking the sinusoidal components
birth
frequency
death
time
53Waveform Interpolation
- Voiced speech is a sequence of pitch-cycle
waveforms - The characteristic waveform usually changes
slowly with time
Useful to think of waveform in 2d
time
Phase in pitch period
This waveform can be the speech signal or the LPC
residual
54WI - cont.
- Per frame LPC and pitch are extracted
- Represent CW by features (e.g. DFT coefficients)
- Alignment by circular shift until maximum
correlation - Separate treatment for voice and unvoiced
segments
LPC pitch tracking
sn
Characteristic waveform extraction
conversion to 1d
sn
2d CW alignment
waveform interpolation
quantization
decoding