ICASSP 2006 Robustness Techniques Survey - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

ICASSP 2006 Robustness Techniques Survey

Description:

Luz Garc'ia, Jos'e C. Segura, Javier Ram'irez, Angel de la Torre, Carmen Ben'itez ... Teor'ia de la Se ~nal, Telem'atica y Comunicaciones (TSTC) Universidad de Granada ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 26
Provided by: Pili2
Category:

less

Transcript and Presenter's Notes

Title: ICASSP 2006 Robustness Techniques Survey


1
ICASSP 2006 Robustness Techniques Survey
ShihHsiang 2006
2
PARAMETRIC NONLINEAR FEATURE EQUALIZATIONFOR
ROBUST SPEECH RECOGNITIONLuz Garcia, Jose C.
Segura, Javier Ramirez, Angel de la Torre,
Carmen BenitezDpto. Teoria de la Senal,
Telematica y Comunicaciones (TSTC)Universidad
de Granada
3
Introduction
  • HEQ have been successfully applied to deal with
    the nonlinear effect of the acoustic environment
    in the feature domain
  • Normalize the probability distributions of the
    features in such a way that the acoustic
    environment effects are (partially) removed
  • HEQ still suffer from several limitations
  • Rely on a local estimation of the probability
    distributions of the features based on a reduced
    number of observations belonging to a single
    utterance to be equalized
  • Nonlinear transformation is based on mapping the
    global CDF of each feature into a reference one
  • The transformations are usually based on a
    component-by-component equalization of the
    feature vector, thus discarding any
    cross-information between features in the
    equalization process

4
Introduction (cont.)
  • In this paper, a parametric nonlinear
    equalization technique is proposed
  • Relies on a two Gaussian model for the
    probability distribution of the features
  • And on a simple Gaussian classifier to label the
    input frames as belonging to the speech or
    non-speech classes
  • Recognition experiments on the AURORA 4 database
    have been performed and the effectiveness of the
    algorithm is analyzed in comparison with other
    linear and nonlinear feature equalization
    techniques

5
Review Histogram Equalization
  • For a given random variable y with probability
    density function , a function
    mapping into a reference distribution

6
Review Histogram Equalization (cont.)
  • The relative content of non-speech frames is a
    cause of variability in the HEQ transformation
  • because an estimation of the global probability
    distribution is used, that takes into account
    both speech and non-speech frames

7
Review Histogram Equalization (cont.)
  • The unwanted variability of the transformation
    induced by the variable proportion of non-speech
    frames of each utterance can be
  • Reduced by removing non-speech frames before the
    estimation of the transformation
  • Another possibility is to use different
    transformations for speech and non-speech frames
  • Instead of using a transformation to map the
    global CDFs of the features, we can build
    separate mappings for speech and non-speech
    frames
  • As an alternative, the propose the use of a
    parametric form of the equalization transform
    based on a two Gaussian mixture model
  • The first Gaussian is used to represent
    non-speech frames, while the second one
    represents speech frames

8
Two-class parametric equalization
  • For each class, a parametric linear
    transformation is defined to map the clean and
    noisy representation spaces
  • The clean Gaussians for speech and non-speech
    frames can be estimated from the training
    database, while the noisy Gaussians should be
    estimated from the utterance to be equalized

noise
non-speech
clean Gaussian
noisy speech
speech
noisy Gaussian
9
Two-class parametric equalization (cont.)
  • In order to select whether the current frame y is
    speech or non-speech, a voice activity detector
    could be used
  • Implies a hard decision between both linear
    transformations that could create discontinuities
    in the limit of the non-speech/speech decision
  • Instead, a soft decision can be used

The posterior probabilities P(ny) and P(sy) are
obtained using a simple two-class Gaussian
classifier on MFCC C0
10
Two-class parametric equalization (cont.)
  • Training the two-class Gaussian classifier
  • Initially, those frames with C0 below the mean
    value are assigned to the non-speech class and
    those with C0 above the mean are assigned to the
    speech class
  • The EM algorithm is then iterated until
    convergence (usually, 10 iterations are enough)
    to obtain the final classifier
  • This classifier is used to obtain the class
    probabilities P(ny) and P(sy) and also to
    obtain the mean and covariance matricesµn,y?Sn,y?µ
    s,y and Ss,y for the non-speech and speech
    classes for the given noisy input utterance

11
Two-class parametric equalization (cont.)
The two Gaussian model for the C0 and C1 cepstral
coefficients (used as reference model) along with
the histograms of the speech and non-speech
frames for a set of clean utterances
12
Experimental Results
  • The proposed parametric equalization algorithm
    has been tested on the AURORA4 (WSJ0) database
  • The recognition system used in all cases is based
    on continuous crossword triphone models with 3
    tied states and a mixture of 6 Gaussians per
    state
  • The language model is the standard bigram for the
    WSJ0 task
  • A feature vector of 13 cepstral coefficients is
    used as the basic parameterization of the speech
    signal using C0 instead of the logarithmic energy
  • The baseline reference system (BASE) uses
    sentence-by-sentence subtraction of the mean
    values of each cepstral coefficient (CMS)
  • The parameters of the reference distribution have
    been obtained by averaging over the whole clean
    training set of utterances

13
Experimental Results
  • First row (BASE) corresponds to the baseline
    system which is based on a simple CMS linear
    normalization technique.
  • The second row (HEQ) shows the word error rates
    when using a standard quantile-based
    implementation of HEQ
  • relative word error reduction of 17.8
  • The performance of HEQ is clearly improved by PEQ
    as shown in the third row, with a relative word
    error reduction of 30.8.
  • This result is very close to the one obtained for
    the AFE, which yields a 31.4 reduction of the
    word error rate
  • Moreover, PEQ outperforms AFE in half of the
    tests (i.e. 02, 06, 08, 09, 10, 11 and 13).

14
Conclusions and Future Work
  • The transformation is based on a nonlinear
    interpolation of two independent linear
    transformations
  • The linear transformations are obtained using a
    simple Gaussian model for the classes of speech
    and non-speech features
  • The technique evaluated on a complex continuous
    speech recognition task showing its competitive
    performance against linear and nonlinear feature
    equalization techniques like CMS and HEQ
  • A study of influence of within class
    cross-correlations is currently under development

15
MODEL-BASED WIENER FILTER FOR NOISE ROBUST SPEECH
RECOGNITIONTakayuki Arakawa, Masanori Tsujikawa
and Ryosuke IsotaniMedia and Information
Research Laboratories, NEC Corporation,
Japant-arakawa_at_cp.jp.nec.com, tujikawa_at_cb.jp.nec.
com, r-isotani_at_bp.jp.nec.com
16
Introduction
  • Various kinds of background noise exist in the
    real world
  • Therefore robustness against various kinds of
    noise is quite important.
  • Several approaches have been proposed to deal
    with this issue
  • Signal-processing-based spectral enhancement
  • Spectral Subtraction (SS), Wiener Filter (WF)
  • Less computational costs, but needs many tuning
    costs depending on the kind of noise and
    signal-to-noise ratio (SNR)
  • Statistical-model-based noise adaptation
  • The acoustic model i.e., a hidden Markov model
    (HMM), is adapted to the noisy environment
  • It needs huge computational costs to adapt the
    distributions to a noisy environment

17
Introduction (cont.)
  • Statistical-model-based compensation
  • Using Gaussian mixture model (GMM)
  • The computational cost is still much more than
    that of the signal-processing-based spectral
    enhancement.
  • In this paper, they proposed Model-Based Wiener
    filter (MBW)

Concept
18
Proposed Method (Cont.)
  • A GMM with K Gaussian distributions is used as
    knowledge of clean speech in the cepstrum domain

MBW algorithm
19
Proposed Method (Cont.)
  • The noisy speech signal X(t) is modeled as
  • Step 1 Perform Spectral Subtraction (SS)
  • Step 2 Derive the expected value of the clean
    speech

noisy speech
clean speech
noise
Spectrum Domain
temporary clean speech
estimated noise
flooring parameter
Cepstrum Domain
MMSE Estimation
20
Proposed Method (Cont.)
  • Step 3 Calculate Wiener Gain
  • Step 4 Get the final estimated clean speech

smoothing parameter
21
Experiments and Results
  • Experiment Condition
  • The Mel-frequency cepstral coefficients (MFCC)
    and their 1st and 2nd derivatives are used as
    feature value of speech (include C0)
  • The feature value for GMM is composed of a
    13-dimensional MFCC only
  • The flooring parameter a is set at 0.1, and the
    smoothing parameter ß is set at 0.98
  • The MBW method was tested on the Aurora2-J task
  • contains utterances (in Japanese) of consecutive
    digit string recorded in clean environments
  • The other conditions are the same as Aurora2

22
Experiments and Results (cont.)
  • The performance of different mixture number of
    the GMM

At the point of 128 or 256, it becomes saturated
5dB restaurant noise
23
Experiments and Results (cont.)
  • The word accuracy for each SNR

almost equivalent to that of the AFE.
24
Experiments and Results (cont.)
  • The Word Accuracy over the SNR for each kind of
    noise.

These results show that the proposed method is
much more robust than the AFE against various
kinds of noise.
25
Conclusions
  • Review MBW algorithm
  • Roughly estimates clean speech signals using SS
  • Compensates them using a GMM to improve
    robustness against non-stationary noise
  • The compensated speech signal is used to
    calculate the Wiener gain
  • Performing Wiener filtering
  • The results show that the proposed method
    performs as well as the ETSI AFE
  • These results demonstrate that the proposed
    method is robust against various kinds of noise
Write a Comment
User Comments (0)
About PowerShow.com