Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit

Description:

Hemant Misra, Shajith Ikbal, Henve Bourland, & Hynek Hermansky ... Amirkabir University of Technology, Iran ; ... Such as unvoiced fricatives or unvoiced plosives ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 42

Provided by: slpCsie

Category:

more less

Transcript and Presenter's Notes

Title: Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit

1
Survey of Robust TechniquesGraduate Institute of
Computer Science, National Taiwan Normal
University

2005/3/3
Presented by Chen-Wei Liu

2
Outline

Spectral Entropy based Feature for Robust ASR
ICASSP,2004
IDIAP, Martiguy, Switzerland
Hemant Misra, Shajith Ikbal, Henve Bourland,
Hynek Hermansky
An Energy Normalization Scheme for Improved
Robustness in Speech Recognition
ICSLP,2004
Amirkabir University of Technology, Iran
University of Waterloo, Canada
S.M. Ahadi, H. Sheikhzadeh, R.L. Brennan G.H.
Freeman
Robust Speech Recognition with Spectral
Subtraction in Low SNR
ICSLP,2004
Nara Institute of Science and Technology, Japan
Randy Gomez, Akinobu Lee, Hiroshi Saruwatari,
Kiyohiro Shikano

3
Introduction (1/2) Spectral Entropy based
Feature for Robust ASR

Most of the state of the art ASR systems
Use cepstral features derived from short time
Fourier transform spectrum of speech signal
While cepstral features are fairly good
representation, they capture the absolute energy
response of the spectrum
Further, we are not sure that all the information
present in the STFT spectrum is captured by them
This paper suggests to capture further
information from the spectrum by computing its
entropy

4
Introduction (2/2) Spectral Entropy based
Feature for Robust ASR

For voiced sounds, spectra have clear formants
Entropies of such spectra will be low
On the other hand spectra of unvoiced sounds are
flatter
Their entropies should be higher
Therefore, entropy of a spectrum can be used as
an estimate for voicing/un-voicing decision
This paper extends the idea further and introduce
multi-band/multi-resolution entropy feature

5
Spectral Entropy Feature (1/3)

Entropy can be used to capture the peakiness of a
PMF (probability mass function)
A PMF with sharp peaks will have low entropy
while a PMF with flat distribution will have high
entropy
In case of STFT spectra of speech, we observe
distinct peaks and the position of these peaks
Are depend on the phoneme under consideration
These formants are the one which characterize a
sound

6
Spectral Entropy Feature (2/3)

The problem of computing entropy of a spectrum is
that spectrum is not a PMF
So we convert the spectrum into a PMF
For each frame the entropy was computed by

7
Spectral Entropy Feature (3/3)

The formants are the one which are least affected
as compared to the other parts of the spectrum

8
Multi-band/Multi-resolution Entropy (1/3)

The full-band spectrum is not a strong feature on
its own
If we want to capture the formants of the
spectrum as well as their location
What is the so called multi-band entropy features
Divide the full-band spectrum into J
non-overlapping sub-bands of equal size
Entropy is computed for each sub-band and we
obtain one entropy value for each sub-band

9
Multi-band/Multi-resolution Entropy (2/3)

These sub-band entropy values indicate the
presence or absence of formants in that sub-band
When J1, we obtain one entropy value
In this experiment, we change J from 1 to 5
All the entropy values obtained by varying J
Were appended to form a 15 (12345)
dimensional entropy feature vector

10
Multi-band/Multi-resolution Entropy (3/3)

11
Experimental Setup

Numbers95 database
US English connected digits telephone speech
30 words in the database represented by 27
phonemes
Training is performed on clean speech
Testing is corrupted by factory noise from
Noisex92 database added at different SNRs to
Numbers95 database
3330 utterences for training 1143 utterences
for testing
PLP features are used

12
Experimental Results (1/2)

WERs for clean speech for multi-band entropy
features alone

13
Experimental Results (2/2)

14
Conclusion

Good improvement in performance is obtained
When multi-band entropy feature is appended to
the usual PLP cepstral features
Specially in case of noise
The new feature seems to be quite robust to noise
The reason for robustness can be attributed to
the fact that multi-band entropy feature tries to
capture the location of the formants and formants
are less affected by noise

15
Introduction (1/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition

Its widely accepted that the energy of speech
signal contains important information
Regarding the phonetic content of speech
The energy of voiced sounds and vowels is usually
higher than that of the unvoiced sounds
Such as unvoiced fricatives or unvoiced plosives
However, the signal energy can dramatically
change according to the conditions

16
Introduction (2/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition

The same sentence uttered in different ways (e.g.
whisper or shout) may feature different energy
orders of magnitude
Meanwhile, the speech recognizer may face the
energy mismatch condition
Different speaker or emotion, the environment,
etc.
Therefore, direct use of frame energy is not
considered to be helpful under realistic
conditions

17
Introduction (3/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition

Accordingly, a log-energy parameter is usually
used in speech recognition systems
The logarithm causes the parameter to be less
vulnerable to the abrupt changes in the level of
energy by applying large compression factors
Traditional energy normalization techniques
Subtract the maximum utterance (log) energy from
each frames energy
To provide a maximum energy level close to zero
throughout the utterence

18
Introduction (4/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition

However, this normalization is not very useful in
speech recognition in noisy environments
This paper propose another energy normalization
approach
Direct use of the raw energy parameter
A normalization similar to that used for ceptral
parameters

19
Energy Normalization (1/2)

A rather standard energy normalization technique
is as follows
Another energy normalization approach could be
inspired from CN used in traditional cepstral
normalization

20
Energy Normalization (2/2)

The approach is as follows

21
Direct use of Energy (1/2)

Additive noise
Simply increases the total signal energy and its
long-time average energy
Hence, it can be partly compensated for by
subtracting the energy parameter mean
For the log energy, however, the subtraction of
long-time average is not as meaningful
Since a log operator has been applied after the
addition of signal and noise

22
Direct use of Energy (2/2)

23
Experimental Setup

Aurora 2 was used for evaluations
Connected digit speech corpus
Test sets A,B,C
8 different kinds of additive noise
six different SNRs

24
Experimental Evaluations (1/3)

25
Experimental Evaluations (2/3)

26
Experimental Evaluations (3/3)

27
Conclusions

This paper introduces a scheme which allows
direct use of the frame energy parameter
Performs well in the presence of additive noise
Its combination with traditional CN has led to
error rate improvements up to 55 on the Aurora 2
task

28
Introduction Robust Speech Recognition with
Spectral Subtraction in Low SNR

Its practical to investigate the effects of the
over-subtraction parameter relative to the
recognition performance of the speech recognizer
However, SS used in robust speech recognition
addresses the suppression of noise
But we dont have any idea on to what extent does
noise has to be suppresses and how much
distortion is allowed
SS is implemented solely as a mere speech
enhancement technique
This paper uses NRR (Noise Reduction Rate) and
MelCD (Mel Cepstrum Distortion) in order to study
the effects of suppressing noise using SS
To tailor-fit the noise suppression to the
recognizer

29
NRR and MelCD (1/4)

HMM-based speech recognition system is basically
an issue of how acoustically similar the test
are, with the created acoustic model
The degree of mismatch is very crucial
It could be indirectly attributed with NRR and
the degree of distortion (MelCD)

30
NRR and MelCD (2/4)

At 25dB SNR there exists a correlation between an
improved in NRR due to SS and the recognition
accuracy

31
NRR and MelCD (3/4)

On the other hand, under low SNR, the correlation
does not hold true anymore
In fact, there is no more correlation between max
NRR with max Accuracy at very low SNR
Short to say, the parameters like NRR due to SS
nor MelCD can no longer give as a hint what the
recognition performance might be under low SNR
conditions

32
NRR and MelCD (4/4)

33
System Implementation (1/3)

How to make SS more effective in front end of
speech recognition ?
By optimizing
Traditional SS
Optimization Steps
Select some utterances and superimpose with
different types of noise on various SNRs
conditions from training data
Obtain matched
Obtain generalized

34
System Implementation (3/3)

35
System Implementation (2/3)

36
Experimental Results

Test set
200 sentences from 46 speakers
8 noise types
Original car, office, booth, crowd
Extended mall, poster, par, train
The single acoustic model
By superimposing 25 dB office noise to the JNAS
database
Matched model
The noise superimposed to training set was the
same as that to testing set

37
Experimental Results

38
Experimental Results

39
Experimental Results

40
Conclusion

It is shown that the conventional SS is not
effective under low SNR conditions
This tailor fitted SS to optimize the recognition
performance
By deriving the optimal from the training
data which is directly related with the HMM
models
26.0 and 7.6 relative improvements for the
proposed matched and generalized as compared
with the conventional approaches constant 2

41
Experiment Approach

Uni-Table
All Training Data
HEQ table
Multi-Table
Uni-Table
HEQ
HTK (151)
Training Data
Label Multi-Table
HTK (151)
Multi-HEQ
Multi-Model (3)
Uni-Table
HEQ
HTK Recog.
Testing Data
Multi-Model Multi-Table
HTK Recog.
Multi-HEQ
Dynamic HEQ

Write a Comment

User Comments (0)