Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit

Description:

Hemant Misra, Shajith Ikbal, Henve Bourland, & Hynek Hermansky ... Amirkabir University of Technology, Iran ; ... Such as unvoiced fricatives or unvoiced plosives ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 42
Provided by: slpCsie
Category:

less

Transcript and Presenter's Notes

Title: Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit


1
Survey of Robust TechniquesGraduate Institute of
Computer Science, National Taiwan Normal
University
  • 2005/3/3
  • Presented by Chen-Wei Liu

2
Outline
  • Spectral Entropy based Feature for Robust ASR
  • ICASSP,2004
  • IDIAP, Martiguy, Switzerland
  • Hemant Misra, Shajith Ikbal, Henve Bourland,
    Hynek Hermansky
  • An Energy Normalization Scheme for Improved
    Robustness in Speech Recognition
  • ICSLP,2004
  • Amirkabir University of Technology, Iran
    University of Waterloo, Canada
  • S.M. Ahadi, H. Sheikhzadeh, R.L. Brennan G.H.
    Freeman
  • Robust Speech Recognition with Spectral
    Subtraction in Low SNR
  • ICSLP,2004
  • Nara Institute of Science and Technology, Japan
  • Randy Gomez, Akinobu Lee, Hiroshi Saruwatari,
    Kiyohiro Shikano

3
Introduction (1/2) Spectral Entropy based
Feature for Robust ASR
  • Most of the state of the art ASR systems
  • Use cepstral features derived from short time
    Fourier transform spectrum of speech signal
  • While cepstral features are fairly good
    representation, they capture the absolute energy
    response of the spectrum
  • Further, we are not sure that all the information
    present in the STFT spectrum is captured by them
  • This paper suggests to capture further
    information from the spectrum by computing its
    entropy

4
Introduction (2/2) Spectral Entropy based
Feature for Robust ASR
  • For voiced sounds, spectra have clear formants
  • Entropies of such spectra will be low
  • On the other hand spectra of unvoiced sounds are
    flatter
  • Their entropies should be higher
  • Therefore, entropy of a spectrum can be used as
    an estimate for voicing/un-voicing decision
  • This paper extends the idea further and introduce
    multi-band/multi-resolution entropy feature

5
Spectral Entropy Feature (1/3)
  • Entropy can be used to capture the peakiness of a
    PMF (probability mass function)
  • A PMF with sharp peaks will have low entropy
    while a PMF with flat distribution will have high
    entropy
  • In case of STFT spectra of speech, we observe
    distinct peaks and the position of these peaks
  • Are depend on the phoneme under consideration
  • These formants are the one which characterize a
    sound

6
Spectral Entropy Feature (2/3)
  • The problem of computing entropy of a spectrum is
    that spectrum is not a PMF
  • So we convert the spectrum into a PMF
  • For each frame the entropy was computed by

7
Spectral Entropy Feature (3/3)
  • The formants are the one which are least affected
    as compared to the other parts of the spectrum

8
Multi-band/Multi-resolution Entropy (1/3)
  • The full-band spectrum is not a strong feature on
    its own
  • If we want to capture the formants of the
    spectrum as well as their location
  • What is the so called multi-band entropy features
  • Divide the full-band spectrum into J
    non-overlapping sub-bands of equal size
  • Entropy is computed for each sub-band and we
    obtain one entropy value for each sub-band

9
Multi-band/Multi-resolution Entropy (2/3)
  • These sub-band entropy values indicate the
    presence or absence of formants in that sub-band
  • When J1, we obtain one entropy value
  • In this experiment, we change J from 1 to 5
  • All the entropy values obtained by varying J
  • Were appended to form a 15 (12345)
    dimensional entropy feature vector

10
Multi-band/Multi-resolution Entropy (3/3)

11
Experimental Setup
  • Numbers95 database
  • US English connected digits telephone speech
  • 30 words in the database represented by 27
    phonemes
  • Training is performed on clean speech
  • Testing is corrupted by factory noise from
    Noisex92 database added at different SNRs to
    Numbers95 database
  • 3330 utterences for training 1143 utterences
    for testing
  • PLP features are used

12
Experimental Results (1/2)
  • WERs for clean speech for multi-band entropy
    features alone

13
Experimental Results (2/2)

14
Conclusion
  • Good improvement in performance is obtained
  • When multi-band entropy feature is appended to
    the usual PLP cepstral features
  • Specially in case of noise
  • The new feature seems to be quite robust to noise
  • The reason for robustness can be attributed to
    the fact that multi-band entropy feature tries to
    capture the location of the formants and formants
    are less affected by noise

15
Introduction (1/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
  • Its widely accepted that the energy of speech
    signal contains important information
  • Regarding the phonetic content of speech
  • The energy of voiced sounds and vowels is usually
    higher than that of the unvoiced sounds
  • Such as unvoiced fricatives or unvoiced plosives
  • However, the signal energy can dramatically
    change according to the conditions

16
Introduction (2/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
  • The same sentence uttered in different ways (e.g.
    whisper or shout) may feature different energy
    orders of magnitude
  • Meanwhile, the speech recognizer may face the
    energy mismatch condition
  • Different speaker or emotion, the environment,
    etc.
  • Therefore, direct use of frame energy is not
    considered to be helpful under realistic
    conditions

17
Introduction (3/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
  • Accordingly, a log-energy parameter is usually
    used in speech recognition systems
  • The logarithm causes the parameter to be less
    vulnerable to the abrupt changes in the level of
    energy by applying large compression factors
  • Traditional energy normalization techniques
  • Subtract the maximum utterance (log) energy from
    each frames energy
  • To provide a maximum energy level close to zero
    throughout the utterence

18
Introduction (4/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
  • However, this normalization is not very useful in
    speech recognition in noisy environments
  • This paper propose another energy normalization
    approach
  • Direct use of the raw energy parameter
  • A normalization similar to that used for ceptral
    parameters

19
Energy Normalization (1/2)
  • A rather standard energy normalization technique
    is as follows
  • Another energy normalization approach could be
    inspired from CN used in traditional cepstral
    normalization

20
Energy Normalization (2/2)
  • The approach is as follows

21
Direct use of Energy (1/2)
  • Additive noise
  • Simply increases the total signal energy and its
    long-time average energy
  • Hence, it can be partly compensated for by
    subtracting the energy parameter mean
  • For the log energy, however, the subtraction of
    long-time average is not as meaningful
  • Since a log operator has been applied after the
    addition of signal and noise

22
Direct use of Energy (2/2)

23
Experimental Setup
  • Aurora 2 was used for evaluations
  • Connected digit speech corpus
  • Test sets A,B,C
  • 8 different kinds of additive noise
  • six different SNRs

24
Experimental Evaluations (1/3)

25
Experimental Evaluations (2/3)

26
Experimental Evaluations (3/3)

27
Conclusions
  • This paper introduces a scheme which allows
    direct use of the frame energy parameter
  • Performs well in the presence of additive noise
  • Its combination with traditional CN has led to
    error rate improvements up to 55 on the Aurora 2
    task

28
Introduction Robust Speech Recognition with
Spectral Subtraction in Low SNR
  • Its practical to investigate the effects of the
    over-subtraction parameter relative to the
    recognition performance of the speech recognizer
  • However, SS used in robust speech recognition
    addresses the suppression of noise
  • But we dont have any idea on to what extent does
    noise has to be suppresses and how much
    distortion is allowed
  • SS is implemented solely as a mere speech
    enhancement technique
  • This paper uses NRR (Noise Reduction Rate) and
    MelCD (Mel Cepstrum Distortion) in order to study
    the effects of suppressing noise using SS
  • To tailor-fit the noise suppression to the
    recognizer

29
NRR and MelCD (1/4)
  • HMM-based speech recognition system is basically
    an issue of how acoustically similar the test
    are, with the created acoustic model
  • The degree of mismatch is very crucial
  • It could be indirectly attributed with NRR and
    the degree of distortion (MelCD)

30
NRR and MelCD (2/4)
  • At 25dB SNR there exists a correlation between an
    improved in NRR due to SS and the recognition
    accuracy

31
NRR and MelCD (3/4)
  • On the other hand, under low SNR, the correlation
    does not hold true anymore
  • In fact, there is no more correlation between max
    NRR with max Accuracy at very low SNR
  • Short to say, the parameters like NRR due to SS
    nor MelCD can no longer give as a hint what the
    recognition performance might be under low SNR
    conditions

32
NRR and MelCD (4/4)

33
System Implementation (1/3)
  • How to make SS more effective in front end of
    speech recognition ?
  • By optimizing
  • Traditional SS
  • Optimization Steps
  • Select some utterances and superimpose with
    different types of noise on various SNRs
    conditions from training data
  • Obtain matched
  • Obtain generalized

34
System Implementation (3/3)

35
System Implementation (2/3)

36
Experimental Results
  • Test set
  • 200 sentences from 46 speakers
  • 8 noise types
  • Original car, office, booth, crowd
  • Extended mall, poster, par, train
  • The single acoustic model
  • By superimposing 25 dB office noise to the JNAS
    database
  • Matched model
  • The noise superimposed to training set was the
    same as that to testing set

37
Experimental Results

38
Experimental Results

39
Experimental Results

40
Conclusion
  • It is shown that the conventional SS is not
    effective under low SNR conditions
  • This tailor fitted SS to optimize the recognition
    performance
  • By deriving the optimal from the training
    data which is directly related with the HMM
    models
  • 26.0 and 7.6 relative improvements for the
    proposed matched and generalized as compared
    with the conventional approaches constant 2

41
Experiment Approach

Uni-Table
All Training Data
HEQ table
Multi-Table
Uni-Table
HEQ
HTK (151)
Training Data
Label Multi-Table
HTK (151)
Multi-HEQ
Multi-Model (3)
Uni-Table
HEQ
HTK Recog.
Testing Data
Multi-Model Multi-Table
HTK Recog.
Multi-HEQ
Dynamic HEQ
Write a Comment
User Comments (0)
About PowerShow.com