Title: SPEECH ENHANCEMENT
1SPEECH ENHANCEMENT
- Chunjian Li
- Aalborg University, Denmark
2Introduction
- Applications
- Improving quality and intelligibility (hearing
aids, cockpit comm., video conferencing ...) - Source coding (mobile phone, video conferencing,
IP phone ...) - Pre-processor for other speech processing
applications (speech recognition, speaker
verification ...)
3Introduction
- Classification 1
- Single channel
- Multi-channel
- with acoustic barrier (Adaptive
Noise Canceling) - without acoustic barrier (Beam forming, ICA)
- Classification 2
- Spectral domain methods (Power Spectral
Subtraction, Amplitude Spectral Subtraction,
Autocorrelation Subtraction, Non-causal IIR
Wiener Filtering) - Time domain methods (Adaptive noise canceling,
Kalman Filtering, non-stationary Wiener
filtering)
4Spectral subtraction
5Power Spectral Subtraction
- Stochastic Model
- Noise process broadband, stationary (or
short-time stationary), uncorrelated to speech,
additive. - Speech process short-time stationary.
- Need short-time processing
-
6Power Spectral Subtraction
- Important relation in the Power Spectral Density
(PSD) domain - This is true only when the noise is
uncorrelated with the speech signal. -
- To be concise, the index m is dropped in the
following discussion
7Power Spectral Subtraction
(1)
Power Spectral Subtraction methods use the
noisy phase spectrum to synthesis the enhanced
signal
8Generalized Spectral Subtraction and its variants
- Generalization
- Eq(1) can be written as
(2)
- When a1 , eq(2) is called Amplitude Spectral
Subtraction (Boll,1979). - Variant Correlation subtraction
9Comments on Spectral Subtraction methods
- Low complexity
- Severe musical noise
- Usually need further enhancement
- - Smoothing in time and frequency
Rectification
Amplitude Spectral Subtraction
Power Spectral Subtraction
Noisy speech sample (0 dB)
10Comments on Spectral Subtraction methods
- Oversuppressing and smoothing can reduce residual
noise but result in distortion to the speech
spectrum.
Oversuppressing ASS Oversuppressing
PSS Smoothing in time
11Wiener Filter
12Wiener Filtering
- The non-causal infinite impulse response Wiener
filter (hereafter as Non-causal IIR Wiener
Filter) is recognized as a spectral domain method
although the filtering problem started in time
domain. - Non-causal IIR Wiener filter with AR modeling of
speech can be employed in iterative manner, such
that signal estimation and parameter estimation
are done based on each other.
13Non-causal IIR Wiener Filter
- A linear Minimum Mean Squared Error Filter
- Orthogonality principle
Wiener-Hopf equation
14Non-causal Wiener Filter
- Orthogonality principle (frequency domain)
- Transfer function
- MSE of the Wiener filter
15Comments on Non-causal WF
- Requires estimates of the power spectra of speech
and noise. - Performance depends very much on the estimates of
the speech and noise spectra. - WF over-suppress the speech spectrum, results in
muffling effect. - WF does not process phase spectrum.
16Comments on Non-causal WF
- Muffling effect caused by over-suppression
Blue Original Black Wiener filter Green
Square-root Wiener filter
17Comments on Non-causal WF
- Roughness caused by phase noise
- The phase spectrum is not processed, results in
losing phase coherence in the voiced speech. The
effect is called roughness or reverberance. - Samples of muffling and roughness
Clean samples Muffling Roughness Muffling
roughness
18Iterative Wiener Filtering
- A parametric method using an all-pole model
- A sequential MAP estimator of both speech
waveform and LP coefficients. - Lim, Oppenheim 1978
19Iterative Wiener Filtering
- All-pole modeling of speech
- - Speech amplitude spectrum can be well modeled
by an all-pole transfer function (the vocal
tract) excited by white noise or pulse train (the
glottal pulses). The coefficients of the all-pole
model can be found by the Linear Prediction
analysis, thus is called LP coef., and the
excitation is called the residual. - - The LP model is of minimum phase, which is
generally not the true phase of the vocal tract.
20Iterative Wiener Filtering
- The algorithm
- Estimate the LP coef. from the noisy
observation samples. Estimate the noise spectrum
during non-speech activity. - Estimate the signal using the non-casual IIR WF
given the current estimate of LP coef. and
current estimate of the noise spectrum. - Estimate the LP coef. again given the current
estimate of the waveform. - Iterate until the convergence criterion is
satisfied.
21Iterative Wiener Filtering
- Comments
- Convergence is not guaranteed, a heuristic stop
criterion is needed - Results in unrealistically sharp formants and
pole jittering - Suffer from musical noise
- Need some kind of smoothing
10 dB noisy sample
Iterative WF
Iterative WF with smoothing
22Further enhancement to IWF
- Constrained IWF Hansen,Clements 1987
- Apply spectral constraint inter-frame and
intra-frame using LSP transformation. - Pole-zero modeling Flanagan 1972
- Replace WF with Kalman filtering Gibson 1991
- Vector quantization method Gibson 1988
- Use HMM Ephraim 1988
23Phase issues
- The majority of the noise reduction methods only
process amplitude spectrum, while the noisy phase
spectrum is left unprocessed. - The reasons are
- - Human ears are less sensitive to phase than to
the amplitude spectrum. - - Masking of amplitude to phase (6dB/0.6rad
threshold).
24MMSE methods
25MMSE approaches to speech enhancement
- Wiener filtering MMSE amplitude spectrum
estimator MMSE log-amplitude spectrum estimator
Non-Gaussian prior MMSE approaches. - Being the dominant technique because of better
performance than the Spectral Subtraction
methods. - Need a priori info. of the speech and noise
(i.e., covariance matrices or PSD).
26MMSE amplitude spectrum estimator (Ephraim-Malah
filter)
- Ephraim Malah, 1984
- The basis of the noise reduction function of
MELPe coding standard - Consists of two parts the Decision-Directed
method estimating the speech spectrum, and the
MMSE Short-Time Spectral Amplitude (STSA)
estimator
27MMSE STSA estimator
- Assumptions
- Stationary additive Gaussian noise with known
PSD. - An estimate of the speech spectrum is available.
- Spectral components (DFT coefficients) are
statistically independent and each follows
Gaussian distribution (the DFT amplitude follows
Rayleigh distribution). - The DFT phase follows uniform distribution and is
independent of the amplitude.
Let ,
, denote the kth spectral component of
the noisy observation y(t), the signal x(t), and
the noise d(t).
28MMSE STSA estimator
With the following PDFs
,
and Bayes rule, the estimator can be shown
to be
Where and denote the modified
Bessel functions of zero and first order, and
is defined by
29MMSE STSA estimator
Where and are defined by
Where and
and are interpreted as the a priori
and a posteriori signal-to-noise ratio
respectively. is estimated by the
Decision-Directed method.
30Decision-Directed method
- An estimate of the a priori SNR.
- A combination of Power Spectrum Subtraction, half
wave rectification and inter-frame smoothing. - is usually chosen to be 0.98 in order to get
the best smoothing performance. The higher the
is, the less musical noise, but the more
distortion to the speech.
31Comments on the MMSE STSA estimator
- Comparison of the suppression gains of Wiener
filter and MMSE STSA
- The instantaneous SNR can be interpreted as the a
priori SNR estimated without smoothing. - WF gains do not vary with the instantaneous SNR,
only vary with the a priori SNR. Whereas the MMSE
STSA gains vary with both instantaneous SNR and a
priori SNR. - When the a priori SNR is high, the MMSE STSA
estimator has gain curves very close to the WF.
When the a priori SNR is low, the MMSE STSA shows
higher gain which is very much affected by the
instantaneous SNR.
32Comments on the MMSE STSA estimator
- A comparison of the suppression gains of PSS, WF
and MMSE STSA estimator
Estimated A priori SNR
Estimated A priori SNR
The MMSE STSA. Rpost denotes the A priori SNR
estimated without smoothing (the instantaneous
SNR).
Solid line power subtraction dashed
line Wiener filter.
33Comments on the MMSE STSA estimator
- The gain curve transit smoothly between the power
subtraction curve and the Wiener curve. This
transit is controlled by the un-smoothed estimate
of a priori SNR (Rpost). The larger Rpost, the
stronger the attenuation. - This counter-intuitive behavior manages to
flatten the spurious spectral peaks caused by the
noise at the low SNR part of the spectrum. While
WF tends to sharpen the spurious peaks at the low
SNR part of the spectrum. - The phase of the noisy speech is used as the
phase of the enhanced speech, because of the
assumption of uniform distributed phase. An
independent MMSE estimate of the phasor has
nonunity modulus, thus can not be combined with
the MMSE STSA. - Suffer less musical noise than the WF.
34MMSE Log-Spectral Amplitude Estimator
- A modification to the MMSE STSA based on the fact
that a distortion measure based on the
mean-square error of the log-spectra is more
suitable for speech processing. - Minimize the distortion measure
- The MMSE LSA estimator can be shown to be
-
- where , and
are a priori SNR and a - posteriori SNR as defined before.
35MMSE Log-Spectral Amplitude Estimator
- Comparison of the suppression gains of MMSE STSA
and MMSE LSA
- The gain curves of MMSE LSA are always lower
than that of MMSE STSA, resulting in lower
residual noise. - When the a priori SNR is high,
the gain curve of MMSE LSA is very flat which is
similar to Wiener filter. When the a priori SNR
is low, the gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the MMSE STSA
does.
Decision-Directed Wiener Filter
MMSE LSA
Noisy sample (0 dB)
36MMSE estimator with non-Gaussian prior
How well does Gaussian model fit the real
probability distribution of DFT coefficients?
Histogram of speech DFT amplitude.
Histogram of noise (recorded from market place)
DFT amplitude.
The histograms are taken from one hour of speech
37MMSE estimator with non-Gaussian prior
- The probability density function of the DFT
coefficients of speech can be better modeled by
Supper-Gaussian functions (e.g. Gamma or Laplace)
than the Gaussian function Rainer Martin 2002,
2003. - An even more exact probability density function
is the one tailored to fit the shape of the
histogram of the DFT coefficients Lotter, Vary
2003. - Using these density function in place of the
Gaussian density function (for speech or noise
processes) in the MMSE estimator can result in
better noise reduction. - Non-Gaussian prior MMSE estimator is nonlinear,
non-zero-phased.
38MMSE estimator with non-Gaussian prior
- Compared with WF
- Better output SNR (Gaussian/Gamma)
- Less musical noise (Laplace/Gamma)
- Less distortion to the speech
39Noise PSD estimation
Sørensen Andersen, EURASIP J. Applied Signal
Processing, 2005
40Noise PSD estimation
- Most conventional speech enhancement methods rely
on good estimates of noise PSD. - When the noise is non-stationary, online tracking
of the noise PSD is needed.
41Methods
- Minimum statistics Martin 2001
- Minima Controlled Recursive Averaging Cohen
Berdugo 2002 - Connected Region Speech Presence Detection
(CRSPD) Sørensen Andersen, 2005
42CRSPD
43Highway traffic noise (5 dB)
44Smoothed periodogram
45(No Transcript)
46Estimated noise periodogram and noise PSD
47Enhanced spectrogram
48Demos
- Highway traffic noise (0 dB)
- Noisy sample
- MMSE-LSA
- CRSPD
- Interior car noise (0 dB)
- Noisy sample
- MMSE-LSA
- CRSPD
49MMSE joint estimator for amplitude and phase
spectra
C. Li, S. V. Andersen, Inter-Frequency
Dependency in MMSE Speech Enhancement,
NORSIG04 C. Li, S. V. Andersen, A Block-Based
Linear MMSE Noise Reduction with a High Temporal
Resolution Modeling of the Speech Excitation,
EURASIP Journal on Applied Signal Processing
50Why MMSE joint estimator?
- Phase is found to be of importance for noise
reduction of low SNR sources. Whereas Independent
optimum amplitude estimator and optimum phase
estimator do not coexist. - Finite frame length and temporal power
localization introduce correlation between
spectral components. This correlation can be
exploited to improve the estimate of low SNR
frequency bin. - Time localization can be modeled with the joint
MMSE estimator, but can not be modeled by the
frequency domain Wiener filter. Time localization
indicates how much the phase is linearly related.
51Formulation
Signal model
Where F is the inverse Fourier matrix, S is the
Fourier coefficients vector, and v is white
Gaussian noise vector.
The MMSE estimator of S can be shown to be
and being the spectral covariance matrix
of the signal and the noise, respectively (need
to be estimated).
52Estimating covariance matrix
Let 1/A(Z) denote the transfer function of the
all pole model of speech, r denote the LPC
residual, and H denote the Toeplitz analysis
matrix consisting the coef. of A(Z), such as
The covariance matrix of r can be written as a
diagonal matrix with the square of r as its
diagonal elements. Then the covariance matrix of
s and S can be written respectively as
53Joint estimator vs. spectral amplitude MMSE
estimators
- In the joint estimator, the spectral covariance
matrix is assumed to be a full matrix, while
the Wiener filter and MMSE LSA estimator assume
it is a diagonal matrix. - This allows the joint estimator exploits the
correlation between frequency components, which
is ignored by the frequency domain MMSE
estimators.
54Correlation of frequency components
The covariance matrix of the frequency components
55Covariance in time and frequency
56Estimated spectra
The TFE-MMSE estimator preserves the signal
spectrum better than the Wiener filter.
57White Gaussian noise (10 dB)
58Results
- TFE-MMSE estimator
- TFE-Kalman filtering
- Compared to
- WF
- Noisy (10dB)
59Iterative Kalman filtering
C. Li, S. V. Andersen, A Iterative Speech
Enhancement Scheme Based on Kalman Filtering,
EUSIPCO 2005
60Motivation
- Kalman filter is a time domain MMSE estimator,
which is a joint amplitude-phase estimator when
the non-stationarity of the signal (or the system
noise) is faithfully represented. - We designed a non-stationary Kalman filter with
high temporal resolution modeling of the
excitation.
61Two steps each iteration
62The algorithm
- Speech spectrum is estimated by the WPSS and LPC
block. - Speech excitation is estimated by the WPSS and
PEKF block. - Need only one iteration. Iterations are done
sequentially, exploiting the correlation between
consecutive spectra.
63Results
64Demo
- White Gaussian noise (10 dB)
- Noisy sample
- Conventional Kalman filtering
- Iterative Wiener filtering (EM)
- Proposed IKF
65Blind system identification for non-Gaussian
speech analysis
C. Li, S. V. Andersen, Efficient blind system
identification of non-Gaussian Autoregressive
models with HMM modeling of the excitation,
IEEE trans. Signal Processing 2006, submitted.
66Motivation
- Non-Gaussian model for voiced speech
- Estimate the excitation model, vocal tract model
parameters, and the noise variance jointly. - Blind identification. The only known is the noisy
observation.
67E-HMARM model
68Estimated spectra
Input SNR is 15 dB, white Gaussian noise.
69Convergence
70What is the significance of the E-HMARM
- A non-Gaussian speech analysis method (LPC is
not) - A noise robust speech analysis method (LPC is
not) - A blind deconvolution of the excitation from the
vocal tract filter - Potential in single channel source separation
(due to the non-Gaussian model)
71Future work
- Single channel BSS
- Gaussian AR and Gaussian AR
- Non-Gaussian AR and Gaussian AR
- Non-Gaussian AR and non-Gaussian AR
- More comprehensive non-Gaussian non-stationary
speech models - Speech manipulation
- Speech interpolation