Voice DSP Processing III - PowerPoint PPT Presentation

About This Presentation
Title:

Voice DSP Processing III

Description:

DSP Processing III Yaakov J. Stein Chief Scientist RAD Data Communications – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 55
Provided by: YJS5
Category:
Tags: dsp | iii | processing | voice

less

Transcript and Presenter's Notes

Title: Voice DSP Processing III


1
VoiceDSPProcessing III
  • Yaakov J. Stein
  • Chief ScientistRAD Data Communications

2
Voice DSP
  • Part 1 Speech biology and what we can learn from
    it
  • Part 2 Speech DSP (AGC, VAD, features, echo
    cancellation)
  • Part 3 Speech compression techiques
  • Part 4 Speech Recognition

3
Voice DSP - Part 3
  • Simple coders
  • G.711 A-law m-law
  • Delta
  • ADPCM
  • CELP coders
  • LPC-10
  • RELP/GSM
  • CELP
  • Other methods
  • MBE
  • MELP
  • STC
  • Waveform Interpolation

4
Encoder Criteria
  • Encoders can be compared in many ways
  • the most important are
  • Bit rate (Kbps)
  • Speech quality (MOS)
  • Delay (algorithmic framelookahead
    computational propagation)
  • Computational Complexity
  • Often less important
  • Bit exactness (interoperability)
  • Transcoding robustness
  • Behavior on non-speech (babble noise, tones,
    music)
  • Bit error robustness

5
PSTN Quality Coders
  • Rate ITU-T encoder
  • 128 Kbps 16bit linear
    sampling
  • 64 Kbps G.711 A-law/m-law 8bit
    log sampling
  • 32 Kbps G.726 ADPCM
  • 16 Kbps G.728 LDCELP
  • 8 Kbps G.729 CS-ACELP
  • 4 Kbps SG16Q21 ???
  • toll quality MOS rating, but higher delay

6
Digital Cellular Standards
7
Military / Satellite Standards
8
Voice DSP
  • Simple
  • coders

9
G.711
  • 16 bit linear sampling at 8 KHz means 128 Kbps
  • Minimal toll quality linear sampling is 12 bit
    (96 Kbps)
  • 8 bit linear sampling (256 levels) is noticeably
    noisy
  • Due to
  • prevalence of low amplitudes
  • logarithmic response of ear
  • we can use logarithmic sampling
  • Different standards for different places

10
G.711 - cont.
North America m 255
  • m-law
  • A-law
  • Although very different looking they are nearly
    identical
  • G.711 approximates these expressions by 16
    staircase straight-line segments
  • (8 negative and 8 positive)
  • m-law horizontal segment through origin, A-law
    vertical segment

Rest Of World A 87.56
11
DPCM
  • Due to low-pass character of speech
  • differences are usually smaller than signal
    values
  • and hence require fewer bits to quantize
  • Simplest Delta-PCM (DPCM) quantize first
    difference signal D
  • Delta-PCM quantize difference between signal
    and prediction
  • sn p ( sn-1 , sn-2 , , sn-N ) S pi sn-i
  • If predict using linear combination (FIR filter),
    this is linear prediction
  • Delta-modulation (DM) use only sign of
    difference (1bit DPCM)
  • Sigma-delta (1bit) oversample, DM, trade-off
    rate for bits

i
12
DPCM with prediction
  • If the linear prediction works well, then the
    prediction error
  • en sn - sn
  • will be lower in energy and whiter than sn
    itself !
  • Only the error is needed for reconstruction,
  • since the predictable portion can be predicted sn
    sn en!

sn
prediction filter
13
DPCM - post-filtering
  • Simplest case
  • if highly oversampled
  • then previous sample sn-1 predicts sn well,
  • so we can use DM,
  • if sgn(en) lt 0 then -D else D
  • For DM there is no way to encode zero prediction
    error
  • so decoded signal oscillates wildly
  • Standard remedy is a post-filter that low-pass
    filters this noise
  • But there is a b i g g e r problem!

14
Open-loop Prediction
  • The encoder (linear predictor) is present in the
    decoder
  • but there runs as feedback
  • The decoders predictions are accurate with the
    precise error en
  • but it gets the quantized error en and the models
    diverge!

15
Side Information
  • There are two ways to solve the problem ...
  • The first way is to send the prediction
    coefficients
  • from the encoder to the decoder
  • and not to let the decoder derive them
  • The coefficients sent are called side-information
  • Using side-information means higher bit-rate
  • (since both en and coefficients must be sent)
  • The second way does not require increasing bit
    rate

16
Closed-loop Prediction
  • To ensure that the encoder and decoder stay
    in-sync
  • we put the decoder into the encoder
  • Thus the encoders predictions are identical to
    the decoders
  • and no model difference accumulates

en
sn
en
sn
Q
IQ
IQ
PF
PF
17
Two types of error
  • For DM there are two types of error (depending on
    step size)

D too small
D OK
D too large
18
Adaptive Step Size
  • Speech signals are very nonstationary
  • We need to adapt the step size to match signal
    behavior
  • Increase D when signal changes rapidly
  • Decrease D when signal is relatively constant
  • Simplest method (for DM only)
  • If present bit is the same as previous multiply D
    by K (K1.5)
  • If present bit is different, divide D by K
  • Constrain D to a predefined range
  • More general method
  • Collect N samples in buffer (N 128 512)
  • Compute standard deviation in buffer
  • Set D to a fraction of standard deviation
  • Send D to decoder as side-information or
  • Use backward adaptation (closed-loop D
    computation)

19
ADPCM
  • G.726 has
  • Adaptive predictor
  • Adaptive quantizer and inverse quantizer
  • Adaptation speed control
  • Tone and transition detector
  • Mechanism to prevent loss from tandeming
  • Computational complexity relatively high (10
    MIPS)
  • 24 and 16 Kbps modes defined, but not toll
    quality
  • G.727 same rates but embedded for packetize
    networks
  • ADPCM only used general low-pass characteristic
    of speech
  • What is the next step?

20
Scalar Quantization
  • Standard A/D has preset, evenly distributed
    levels
  • G.711 has preset, non-evenly distributed levels
  • With a criterion we can make an adaptive
    quantizer
  • Simplest criterion minimum squared quantization
    error
  • en sn - sn E lt en2 gt
  • Need algorithm to find optimal placement of
    levels EM-type algorithms

21
Vector Quantization
  • We can do the same thing in higher dimensions
  • Here we wish to match input data xi i 1
    .. N
  • to a codebook of codewords Cj j 1 .. M
  • with Minimal Mean Squared Error
  • E Si1..N xi - C 2
  • where C is the codeword closest to xi in the
    codebook

xi
22
LBG Algorithm for VQ
  • Input xi i 1 .. N clustering, unsupervised
    learning
  • Randomly initialize codebook Cj j 1 .. M
  • Loop until converge
  • Classification Step
  • for i 1 .. N
  • for j 1 .. M
  • compute Dij2 xi - Cj 2
  • classify xi to Cj with minimal Dij2
  • Expectation Step
  • for j 1 .. M correct center Cj S
    i e Cj xi

1
Nj
23
Speech Application of VQ
  • OK, I understand what to do with scalar
    quantization
  • what is VQ good for ?
  • We could try to simply VQ frames of speech
    samples
  • but this doesnt work well !
  • We can VQ spectra or sub-band components
  • We often VQ parameter sets (e.g. LPC
    coefficients)
  • We also VQ model error signals

24
Voice DSP
  • CELP
  • coders

25
LPC-10
  • Based on 10th order LPC (obviously) Bishnu
    Atal
  • 180 sample blocks are encoded into 54 bits
  • Pitch U/V (found using AMDF) 7 bits
  • Gain
    5 bits
  • 10 reflection coefficients found by covariance
    method
  • first two coefficients converted to log area
    ratios
  • L1, L2, a3, a4 5 bits each
  • a5, a6, a7, a8 4 bits each
  • a9 3 bits a10 2 bits 41 bits
  • 1 sync bit 1
    bit
  • 54 bits 44.44 times per second results in 2400
    bps
  • By using VQ could reduce bit rate to under 1
    Kbps!
  • LPC-10 speech is intelligible, but synthetic
    sounding
  • and much of the speaker identity is lost !

26
The Residual
  • Recover sn by adding back the residual error
    signal
  • sn sn en
  • So if we send en as side-information we can
    recover sn
  • en is smaller than sn so may require fewer bits
    !
  • But en is whiter than sn so may require many
    bits!
  • The question has now become
  • How can we compress the residual?

27
Encoding the Residual
  • RELP (6-9.6 Kbps)
  • Low-pass filter and downsample residual to 1 KHz
  • Encode using ADPCM
  • VQ-RELP (4.8 Kbps)
  • VQ coding of residual
  • RELP (4.8 Kbps)
  • Perform FFT on residual
  • Baseband coding
  • RPE-LTP (GSM-FR at 13 Kbps)
  • Residual Pulse Excitation - Long Term Predictor
  • Perform Long Term Prediction (pitch recovery)
  • Subtract to obtain new residual
  • Decimate by 3, use phase with maximum energy
  • Extract 6-bit overall gain
  • Encode remainder with 3 bits/sample

28
Residual and Excitation
  • Synthesis filter sn
    en S am sn-m
  • Analysis filter rn
    sn - S am sn-m
  • So rn en !

excitation
residual
Note all-zero filter is the inverse of the
all-pole filter
29
CELP
  • Atals idea
  • Find a way to efficiently encode the excitation !
  • Questions
  • How can we find the excitation?
  • Theoretically, by algebra (invert the filter!)
  • How can we efficiently encode the residual?
  • VQ - Code Excited Linear Prediction
  • How can we efficiently find the best codeword?
  • Exhaustive search

30
CELP - cont.
  • Atal and friends (Schroeder, Remde, Singhal,
    etc.) discoveries
  • Even random codebooks work well Gaussian,
    uniform
  • Dont need large codebooks e.g. 1024 codewords
    for 40 samples
  • Can center-clip with little loss
  • Codebook with constant amplitude almost as good
  • So we can use codebooks with structure (and save
    storage/search/bits)
  • Multipulse (MP)
  • Constant Amplitude Pulse

Regular Pulse (RP)
31
Special Excitations
  • Shift technique reduces random CB operations from
    O(N2) to O(N)
  • a b c d e f c d e f g h e f g h I j ...
  • Using a small number of 1 amplitude pulses
    leads to MIPS reduction
  • Since most values are zero, there are few
    operations
  • Since amplitudes 1 no true multiplications
  • In a CB containing CW and -CW we can save half
  • Algebraic codebooks exploit algebraic structure
  • Example choose pulses according to Hadamard
    matrix
  • Using FHT reduces computation
  • Conjugate structure codebooks
  • Excitation is sum of codewords from two related
    CBs

32
Analysis by Synthesis
  • Finding the best codeword by exhaustive search

sn
Compute energy
-
LPC
find minimum
33
Perceptual Weighting
  • The criterion for selecting the best codeword
    should be perceptual
  • not simply the energy of the difference signal!
  • We perceptually weight the signal and the
    synthesized signal

sn
PW
-
Since PW is a filter we need use it only once
CB
LPC
34
Perceptual Weighting - cont.
  • The most important PW effect is masking
  • Coding error energy near formants is not heard
    anyway
  • so we allow higher error near formants
  • but demand lower perceivable error energy
  • To do this we de-emphasize according to the LPC
    spectrum!
  • Simplest filter is 1 - S ai z-I where ai are
    the LPC coefficients
  • How do we take the critical bandwidth into
    account?
  • We perform bandwidth expansion Denominator
    expansion gt numerator 1 - S g1i ai z-I
  • 1 - S g2i ai z-I

BW - ln(g) Fs p
1 gt g1 gt g2 gt 0
Typical values g1 0.9 g2 0.6
35
Post-filter
  • Not related to the subject, but if we are already
    here
  • In order to increase the subjective quality of
    many coders
  • post-filters are often used to emphasize the
    formant structure
  • These have the same form as the perceptual
    weighting filter
  • but 1 gt g2 gt g1 gt 0 with typical values g1 0.5
    g2 0.75
  • Denominator expansion lt numerator!
  • the post-filter also reinforces tilt
  • which should then be compensated by an IIR filter
  • since the spectral valleys are de-emphasized
  • we should change the PW filter parameters g1 and
    g2
  • Originally proposed for ADPCM !

36
Subframes
  • Coders with large frames (gt 10 ms) need a long
    excitation signal
  • and hence a lot of bits to encode
  • An alternative is to divide the frame into (2-4)
    subframes
  • each of which has its own codeword excitation

frame n-1
frame n1
frame n
We really should recompute LPC per subframe but
we can get away with interpolating !
37
Lookahead
  • If we are already dividing up the frame
  • we can compute the LPC based on a shifted frame
  • This is called lookahead, and it adds processing
    delay !
  • To decrease delay we can use backward looking IIR
    filter
  • and then we neednt send/store the LPC
    coefficients at all!

------- LPC -------
------- LPC -------
CW
CW
CW
CW
CW
CW
CW
CW
38
What happened to the pitch?
  • Unlike LPC, the ABS CELP coder is excited by
    codebook
  • Where does the pitch come from?
  • Random CB minimi zation will prefer good
    excitation
  • Regular/Multi pulse pulse spacing (not enough
    pulses for high pitch)
  • But this is usually not enough (residual has
    pitch periodicity)
  • Two solutions
  • Adaptive codebook (Klejn, etal)
  • Long term prediction (Atal Singhal)
  • Both of these reinforce the pitch component

39
Adaptive CB
  • Adaptive codebook is repetitions of previous
    excitations
  • Total excitation is weighted sum of stochastic CB
    (random, MP, RP, etc)
  • and adaptive CB

Adaptive CB
Ga
LPC
Gs
Fixed CB
40
Long Term Prediction
  • Using long-term (pitch predictor) and short-term
    (LPC) prediction
  • Long term predictor may have only
  • one delay, but then non-integer
  • 1
  • 1 - b z - d

sn
pitch predictor
gain
-
codebook
LPC
perceptual weighting
error computation
41
Federal Standard CELP
  • FS 1016 at 4.8 Kbps has MOS 3.2
  • Developed by ATT Bell Labs for DOD 144 bits /
    30 ms frame
  • 10th order LPC on 30 ms Hamming window
  • no pre-emphasis, additional 15 Hz BW expansion
    (quality and LSP robustness)
  • Conversion to LSP and nonuniform scalar
    quantization to 34 bits
  • 4 subframes (7.5 ms) LSP interpolation
  • 512 entry fixed CB - static -1,0,1 from
    center-clipped Gaussian
  • 5 bit nonuniform quantized gain 56 bits
  • 256 entry adaptive CB - 8 bits 5 bit nonuniform
    quantized gain 48 bits
  • optional noninteger delays, optional
  • Perceptual weighting
  • Postfilter spectral tilt compensation,
    removable for noise or tandeming
  • FEC 4 bits SYNC 1 bit reserved 1 bit

42
G.728
  • 16 Kbps with MOS similar to G.726 at 32 Kbps
  • Low 5 sample (0.625 msec) delay
  • High computational complexity (about 30 MIPS)
  • CELP with Backward LPC
  • LPC order 50 (why not? - we dont transmit
    side-information!)
  • Frame of 2.5 ms (20 samples)
  • 4 subframes of 0.625 ms (5 samples)
  • Perceptual weighting
  • Only 10 bit index to fixed CB is transmitted
  • 10 bits per 0.625 ms is 16 Kbps !

43
G.729
  • 8 Kbps toll-quality coder for DSVD and VoIP
  • Computational complexity 20 MIPS, but G.729a is
    about 10 MIPS
  • frame 10 ms (80 samples) lookahead 5 ms (1
    subframe)
  • LPC, LSP, VQ, LSP interpolation
  • CS-ACELP CB (Interleaved single pulse
    permutation) 4 1 pulses / subframe
  • closed loop pitch prediction and adaptive CB
    (delaygain)
  • 2 (40 sample) subframes per frame
  • For each frame the encoder outputs 80 bits
  • LSF coefficients 18 bits pitch
    8 bits gain CB 14 bits
  • adaptive CB 5 bits parity check 1
    bit
  • pulse positions 26 bits pulse signs 8
    bits

44
G.729 annexes
  • A Compatible reduced complexity encoder with
    minimal MOS reduction
  • B VAD and CNG
  • C Floating point implementation
  • D 6.4 Kbps version
  • similar to G.729 but 64 output bits per frame,
    quality better than G.726 at 24Kbps
  • LSF coefficients 18b pitchadaptive CB 84b
    gain CB 12b fixed CB 22b
  • E 11.8 Kbps coder for high quality and music

45
G.723.1
  • 6.4 (MP-MLQ) and 5.4 (ACELP) Kbps rates
  • About 18 MIPS on DSP
  • frame 30 ms (240 samples) lookahead 15 ms.
  • LPC on 30 ms (240 sample) frames, LSP and VQ
  • open-loop pitch computation on half-frames (120
    sample)
  • excitation on 4 subframes (60 samples) per frame
  • perceptual weighting and harmonic noise weighting
  • fifth-order closed loop pitch predictor
  • MP-MLQ 5 or 6 1 pulses / subframe, positions
    all even or all odd
  • ACELP 4 1 pulses / subframe, positions differ
    by 8
  • Annex A VAD-CNG Annex B floating point
    implementation

46
Voice DSP
  • Other
  • Methods
  • MBE/MELP
  • STC/WI

47
MBE coder
  • LPC10 makes hard U/V decision - no mixed voicing
  • Multi Band Excitation uses a different excitation
  • harmonics of pitch frequency
  • frequency-dependent binary U/V decision
  • large number of sub-bands (gt16)
  • Simultaneous ABS estimation of pitch and spectral
    envelope
  • Then U/V decision made based on spectral fit
  • Use of dynamic programming for pitch tracking

48
MBE coder - cont.
  • DVSI made various MBE, AMBE and IMBE for
    satellite (INMARSAT)
  • Bit rates 2.4 - 9.6 Kbps (toll quality at 3.6
    Kbps)
  • Integral FEC for bit-error robustness
  • As an example
  • 128 bits for each 20 ms frame
  • pitch 8 bits
  • U/V decisions K bits (K lt 12)
  • spectral amplitudes (DCT) 75-K bits
  • FEC (Golay codes) 45 bits

49
MELP
  • DOD wanted a new 2.4 Kbps coder with MOS similar
    to FS1016
  • Main problems with LPC10
  • voicing determination errors
  • no handling of partially voiced speech
  • Unlike MBE MELP uses standard LPC model
  • MELP excitation is pulse train plus random noise
  • Soft decision in small number (5) of sub-bands
  • Frame 22.5 ms (180 samples)
  • 10th order LPC, 15 Hz BW expansion, LSF,
    interpolation, VQ
  • pitch refinement
  • 5 sub-bands (0-500-1000-2000-3000-4000Hz) pitch
    and noise excitation
  • FEC

50
Sinusoidal Transform Coder
  • McAulay and Quatieri model
  • instead of LPC use sum of sine waves
  • sn Si 1 .. N Ai cos ( wi n fi )
  • For each analysis frame (10 - 20 ms) need to
    extract N Ai fi s
  • Voiced speech
  • Use pitch and important harmonics from
    pitch-synchronized STFT
  • Unvoiced speech
  • Use peaks of STFT points where slope changes
    from to -
  • At high bit-rates keep magnitudes, frequencies
    and phases
  • At low bit-rates frequencies constrained and
    phases modeled

51
STC - cont.
  • Sparse spectrum is updated at regularly spaced
    times
  • Amplitude linearly interpolated between updates
  • Interpolated phase must obey 4 conditions (w f
    w f)

overlapped windowing
sn
FFT
sum of sinusoids
sn
peak picker
spectrum encoder
spectrum decoder
e.g. all-pole model
52
STC - cont.
  • Tracking the sinusoidal components

birth
frequency
death
time
53
Waveform Interpolation
  • Voiced speech is a sequence of pitch-cycle
    waveforms
  • The characteristic waveform usually changes
    slowly with time

Useful to think of waveform in 2d
time
Phase in pitch period
This waveform can be the speech signal or the LPC
residual
54
WI - cont.
  • Per frame LPC and pitch are extracted
  • Represent CW by features (e.g. DFT coefficients)
  • Alignment by circular shift until maximum
    correlation
  • Separate treatment for voice and unvoiced
    segments

LPC pitch tracking
sn
Characteristic waveform extraction
conversion to 1d
sn
2d CW alignment
waveform interpolation
quantization
decoding
Write a Comment
User Comments (0)
About PowerShow.com