SPEECH ENHANCEMENT - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

SPEECH ENHANCEMENT

Description:

Orthogonality principle: Wiener-Hopf equation. 5/29/2006. Widex ... Orthogonality principle (frequency domain): Transfer function: MSE of the Wiener filter: ... – PowerPoint PPT presentation

Number of Views:2836
Avg rating:5.0/5.0
Slides: 72
Provided by: lichu3
Category:

less

Transcript and Presenter's Notes

Title: SPEECH ENHANCEMENT


1
SPEECH ENHANCEMENT
  • Chunjian Li
  • Aalborg University, Denmark

2
Introduction
  • Applications
  • Improving quality and intelligibility (hearing
    aids, cockpit comm., video conferencing ...)
  • Source coding (mobile phone, video conferencing,
    IP phone ...)
  • Pre-processor for other speech processing
    applications (speech recognition, speaker
    verification ...)

3
Introduction
  • Classification 1
  • Single channel
  • Multi-channel
  • with acoustic barrier (Adaptive
    Noise Canceling)
  • without acoustic barrier (Beam forming, ICA)
  • Classification 2
  • Spectral domain methods (Power Spectral
    Subtraction, Amplitude Spectral Subtraction,
    Autocorrelation Subtraction, Non-causal IIR
    Wiener Filtering)
  • Time domain methods (Adaptive noise canceling,
    Kalman Filtering, non-stationary Wiener
    filtering)

4
Spectral subtraction
5
Power Spectral Subtraction
  • Stochastic Model
  • Noise process broadband, stationary (or
    short-time stationary), uncorrelated to speech,
    additive.
  • Speech process short-time stationary.
  • Need short-time processing

6
Power Spectral Subtraction
  • Important relation in the Power Spectral Density
    (PSD) domain
  • This is true only when the noise is
    uncorrelated with the speech signal.
  • To be concise, the index m is dropped in the
    following discussion

7
Power Spectral Subtraction
(1)
Power Spectral Subtraction methods use the
noisy phase spectrum to synthesis the enhanced
signal
8
Generalized Spectral Subtraction and its variants
  • Generalization
  • Eq(1) can be written as

(2)
  • When a1 , eq(2) is called Amplitude Spectral
    Subtraction (Boll,1979).
  • Variant Correlation subtraction

9
Comments on Spectral Subtraction methods
  • Low complexity
  • Severe musical noise
  • Usually need further enhancement
  • - Smoothing in time and frequency
    Rectification

Amplitude Spectral Subtraction
Power Spectral Subtraction
Noisy speech sample (0 dB)
10
Comments on Spectral Subtraction methods
  • Oversuppressing and smoothing can reduce residual
    noise but result in distortion to the speech
    spectrum.

Oversuppressing ASS Oversuppressing
PSS Smoothing in time
11
Wiener Filter
12
Wiener Filtering
  • The non-causal infinite impulse response Wiener
    filter (hereafter as Non-causal IIR Wiener
    Filter) is recognized as a spectral domain method
    although the filtering problem started in time
    domain.
  • Non-causal IIR Wiener filter with AR modeling of
    speech can be employed in iterative manner, such
    that signal estimation and parameter estimation
    are done based on each other.

13
Non-causal IIR Wiener Filter
  • A linear Minimum Mean Squared Error Filter
  • Orthogonality principle

Wiener-Hopf equation
14
Non-causal Wiener Filter
  • Orthogonality principle (frequency domain)
  • Transfer function
  • MSE of the Wiener filter

15
Comments on Non-causal WF
  • Requires estimates of the power spectra of speech
    and noise.
  • Performance depends very much on the estimates of
    the speech and noise spectra.
  • WF over-suppress the speech spectrum, results in
    muffling effect.
  • WF does not process phase spectrum.

16
Comments on Non-causal WF
  • Muffling effect caused by over-suppression

Blue Original Black Wiener filter Green
Square-root Wiener filter
17
Comments on Non-causal WF
  • Roughness caused by phase noise
  • The phase spectrum is not processed, results in
    losing phase coherence in the voiced speech. The
    effect is called roughness or reverberance.
  • Samples of muffling and roughness

Clean samples Muffling Roughness Muffling
roughness
18
Iterative Wiener Filtering
  • A parametric method using an all-pole model
  • A sequential MAP estimator of both speech
    waveform and LP coefficients.
  • Lim, Oppenheim 1978

19
Iterative Wiener Filtering
  • All-pole modeling of speech
  • - Speech amplitude spectrum can be well modeled
    by an all-pole transfer function (the vocal
    tract) excited by white noise or pulse train (the
    glottal pulses). The coefficients of the all-pole
    model can be found by the Linear Prediction
    analysis, thus is called LP coef., and the
    excitation is called the residual.
  • - The LP model is of minimum phase, which is
    generally not the true phase of the vocal tract.

20
Iterative Wiener Filtering
  • The algorithm
  • Estimate the LP coef. from the noisy
    observation samples. Estimate the noise spectrum
    during non-speech activity.
  • Estimate the signal using the non-casual IIR WF
    given the current estimate of LP coef. and
    current estimate of the noise spectrum.
  • Estimate the LP coef. again given the current
    estimate of the waveform.
  • Iterate until the convergence criterion is
    satisfied.

21
Iterative Wiener Filtering
  • Comments
  • Convergence is not guaranteed, a heuristic stop
    criterion is needed
  • Results in unrealistically sharp formants and
    pole jittering
  • Suffer from musical noise
  • Need some kind of smoothing

10 dB noisy sample
Iterative WF
Iterative WF with smoothing
22
Further enhancement to IWF
  • Constrained IWF Hansen,Clements 1987
  • Apply spectral constraint inter-frame and
    intra-frame using LSP transformation.
  • Pole-zero modeling Flanagan 1972
  • Replace WF with Kalman filtering Gibson 1991
  • Vector quantization method Gibson 1988
  • Use HMM Ephraim 1988

23
Phase issues
  • The majority of the noise reduction methods only
    process amplitude spectrum, while the noisy phase
    spectrum is left unprocessed.
  • The reasons are
  • - Human ears are less sensitive to phase than to
    the amplitude spectrum.
  • - Masking of amplitude to phase (6dB/0.6rad
    threshold).

24
MMSE methods
25
MMSE approaches to speech enhancement
  • Wiener filtering MMSE amplitude spectrum
    estimator MMSE log-amplitude spectrum estimator
    Non-Gaussian prior MMSE approaches.
  • Being the dominant technique because of better
    performance than the Spectral Subtraction
    methods.
  • Need a priori info. of the speech and noise
    (i.e., covariance matrices or PSD).

26
MMSE amplitude spectrum estimator (Ephraim-Malah
filter)
  • Ephraim Malah, 1984
  • The basis of the noise reduction function of
    MELPe coding standard
  • Consists of two parts the Decision-Directed
    method estimating the speech spectrum, and the
    MMSE Short-Time Spectral Amplitude (STSA)
    estimator

27
MMSE STSA estimator
  • Assumptions
  • Stationary additive Gaussian noise with known
    PSD.
  • An estimate of the speech spectrum is available.
  • Spectral components (DFT coefficients) are
    statistically independent and each follows
    Gaussian distribution (the DFT amplitude follows
    Rayleigh distribution).
  • The DFT phase follows uniform distribution and is
    independent of the amplitude.
  • The signal model

Let ,
, denote the kth spectral component of
the noisy observation y(t), the signal x(t), and
the noise d(t).
28
MMSE STSA estimator
With the following PDFs
,
and Bayes rule, the estimator can be shown
to be
Where and denote the modified
Bessel functions of zero and first order, and
is defined by
29
MMSE STSA estimator
Where and are defined by
Where and
and are interpreted as the a priori
and a posteriori signal-to-noise ratio
respectively. is estimated by the
Decision-Directed method.
30
Decision-Directed method
  • An estimate of the a priori SNR.
  • A combination of Power Spectrum Subtraction, half
    wave rectification and inter-frame smoothing.
  • is usually chosen to be 0.98 in order to get
    the best smoothing performance. The higher the
    is, the less musical noise, but the more
    distortion to the speech.

31
Comments on the MMSE STSA estimator
  • Comparison of the suppression gains of Wiener
    filter and MMSE STSA
  • The instantaneous SNR can be interpreted as the a
    priori SNR estimated without smoothing.
  • WF gains do not vary with the instantaneous SNR,
    only vary with the a priori SNR. Whereas the MMSE
    STSA gains vary with both instantaneous SNR and a
    priori SNR.
  • When the a priori SNR is high, the MMSE STSA
    estimator has gain curves very close to the WF.
    When the a priori SNR is low, the MMSE STSA shows
    higher gain which is very much affected by the
    instantaneous SNR.

32
Comments on the MMSE STSA estimator
  • A comparison of the suppression gains of PSS, WF
    and MMSE STSA estimator

Estimated A priori SNR
Estimated A priori SNR
The MMSE STSA. Rpost denotes the A priori SNR
estimated without smoothing (the instantaneous
SNR).
Solid line power subtraction dashed
line Wiener filter.
33
Comments on the MMSE STSA estimator
  • The gain curve transit smoothly between the power
    subtraction curve and the Wiener curve. This
    transit is controlled by the un-smoothed estimate
    of a priori SNR (Rpost). The larger Rpost, the
    stronger the attenuation.
  • This counter-intuitive behavior manages to
    flatten the spurious spectral peaks caused by the
    noise at the low SNR part of the spectrum. While
    WF tends to sharpen the spurious peaks at the low
    SNR part of the spectrum.
  • The phase of the noisy speech is used as the
    phase of the enhanced speech, because of the
    assumption of uniform distributed phase. An
    independent MMSE estimate of the phasor has
    nonunity modulus, thus can not be combined with
    the MMSE STSA.
  • Suffer less musical noise than the WF.

34
MMSE Log-Spectral Amplitude Estimator
  • A modification to the MMSE STSA based on the fact
    that a distortion measure based on the
    mean-square error of the log-spectra is more
    suitable for speech processing.
  • Minimize the distortion measure
  • The MMSE LSA estimator can be shown to be
  • where , and
    are a priori SNR and a
  • posteriori SNR as defined before.

35
MMSE Log-Spectral Amplitude Estimator
  • Comparison of the suppression gains of MMSE STSA
    and MMSE LSA

- The gain curves of MMSE LSA are always lower
than that of MMSE STSA, resulting in lower
residual noise. - When the a priori SNR is high,
the gain curve of MMSE LSA is very flat which is
similar to Wiener filter. When the a priori SNR
is low, the gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the MMSE STSA
does.
Decision-Directed Wiener Filter
MMSE LSA
Noisy sample (0 dB)
36
MMSE estimator with non-Gaussian prior
How well does Gaussian model fit the real
probability distribution of DFT coefficients?
Histogram of speech DFT amplitude.
Histogram of noise (recorded from market place)
DFT amplitude.
The histograms are taken from one hour of speech
37
MMSE estimator with non-Gaussian prior
  • The probability density function of the DFT
    coefficients of speech can be better modeled by
    Supper-Gaussian functions (e.g. Gamma or Laplace)
    than the Gaussian function Rainer Martin 2002,
    2003.
  • An even more exact probability density function
    is the one tailored to fit the shape of the
    histogram of the DFT coefficients Lotter, Vary
    2003.
  • Using these density function in place of the
    Gaussian density function (for speech or noise
    processes) in the MMSE estimator can result in
    better noise reduction.
  • Non-Gaussian prior MMSE estimator is nonlinear,
    non-zero-phased.

38
MMSE estimator with non-Gaussian prior
  • Compared with WF
  • Better output SNR (Gaussian/Gamma)
  • Less musical noise (Laplace/Gamma)
  • Less distortion to the speech

39
Noise PSD estimation
Sørensen Andersen, EURASIP J. Applied Signal
Processing, 2005
40
Noise PSD estimation
  • Most conventional speech enhancement methods rely
    on good estimates of noise PSD.
  • When the noise is non-stationary, online tracking
    of the noise PSD is needed.

41
Methods
  • Minimum statistics Martin 2001
  • Minima Controlled Recursive Averaging Cohen
    Berdugo 2002
  • Connected Region Speech Presence Detection
    (CRSPD) Sørensen Andersen, 2005

42
CRSPD
43
Highway traffic noise (5 dB)
44
Smoothed periodogram
45
(No Transcript)
46
Estimated noise periodogram and noise PSD
47
Enhanced spectrogram
48
Demos
  • Highway traffic noise (0 dB)
  • Noisy sample
  • MMSE-LSA
  • CRSPD
  • Interior car noise (0 dB)
  • Noisy sample
  • MMSE-LSA
  • CRSPD

49
MMSE joint estimator for amplitude and phase
spectra
C. Li, S. V. Andersen, Inter-Frequency
Dependency in MMSE Speech Enhancement,
NORSIG04 C. Li, S. V. Andersen, A Block-Based
Linear MMSE Noise Reduction with a High Temporal
Resolution Modeling of the Speech Excitation,
EURASIP Journal on Applied Signal Processing
50
Why MMSE joint estimator?
  • Phase is found to be of importance for noise
    reduction of low SNR sources. Whereas Independent
    optimum amplitude estimator and optimum phase
    estimator do not coexist.
  • Finite frame length and temporal power
    localization introduce correlation between
    spectral components. This correlation can be
    exploited to improve the estimate of low SNR
    frequency bin.
  • Time localization can be modeled with the joint
    MMSE estimator, but can not be modeled by the
    frequency domain Wiener filter. Time localization
    indicates how much the phase is linearly related.

51
Formulation
Signal model
Where F is the inverse Fourier matrix, S is the
Fourier coefficients vector, and v is white
Gaussian noise vector.
The MMSE estimator of S can be shown to be
and being the spectral covariance matrix
of the signal and the noise, respectively (need
to be estimated).
52
Estimating covariance matrix
Let 1/A(Z) denote the transfer function of the
all pole model of speech, r denote the LPC
residual, and H denote the Toeplitz analysis
matrix consisting the coef. of A(Z), such as
The covariance matrix of r can be written as a
diagonal matrix with the square of r as its
diagonal elements. Then the covariance matrix of
s and S can be written respectively as
53
Joint estimator vs. spectral amplitude MMSE
estimators
  • In the joint estimator, the spectral covariance
    matrix is assumed to be a full matrix, while
    the Wiener filter and MMSE LSA estimator assume
    it is a diagonal matrix.
  • This allows the joint estimator exploits the
    correlation between frequency components, which
    is ignored by the frequency domain MMSE
    estimators.

54
Correlation of frequency components
The covariance matrix of the frequency components
55
Covariance in time and frequency
56
Estimated spectra
The TFE-MMSE estimator preserves the signal
spectrum better than the Wiener filter.
57
White Gaussian noise (10 dB)
58
Results
  • TFE-MMSE estimator
  • TFE-Kalman filtering
  • Compared to
  • WF
  • Noisy (10dB)

59
Iterative Kalman filtering
C. Li, S. V. Andersen, A Iterative Speech
Enhancement Scheme Based on Kalman Filtering,
EUSIPCO 2005
60
Motivation
  • Kalman filter is a time domain MMSE estimator,
    which is a joint amplitude-phase estimator when
    the non-stationarity of the signal (or the system
    noise) is faithfully represented.
  • We designed a non-stationary Kalman filter with
    high temporal resolution modeling of the
    excitation.

61
Two steps each iteration
62
The algorithm
  • Speech spectrum is estimated by the WPSS and LPC
    block.
  • Speech excitation is estimated by the WPSS and
    PEKF block.
  • Need only one iteration. Iterations are done
    sequentially, exploiting the correlation between
    consecutive spectra.

63
Results
64
Demo
  • White Gaussian noise (10 dB)
  • Noisy sample
  • Conventional Kalman filtering
  • Iterative Wiener filtering (EM)
  • Proposed IKF

65
Blind system identification for non-Gaussian
speech analysis
C. Li, S. V. Andersen, Efficient blind system
identification of non-Gaussian Autoregressive
models with HMM modeling of the excitation,
IEEE trans. Signal Processing 2006, submitted.
66
Motivation
  • Non-Gaussian model for voiced speech
  • Estimate the excitation model, vocal tract model
    parameters, and the noise variance jointly.
  • Blind identification. The only known is the noisy
    observation.

67
E-HMARM model
68
Estimated spectra
Input SNR is 15 dB, white Gaussian noise.
69
Convergence
70
What is the significance of the E-HMARM
  • A non-Gaussian speech analysis method (LPC is
    not)
  • A noise robust speech analysis method (LPC is
    not)
  • A blind deconvolution of the excitation from the
    vocal tract filter
  • Potential in single channel source separation
    (due to the non-Gaussian model)

71
Future work
  • Single channel BSS
  • Gaussian AR and Gaussian AR
  • Non-Gaussian AR and Gaussian AR
  • Non-Gaussian AR and non-Gaussian AR
  • More comprehensive non-Gaussian non-stationary
    speech models
  • Speech manipulation
  • Speech interpolation
Write a Comment
User Comments (0)
About PowerShow.com