Title: Survey of Robust Techniques Graduate Institute of Computer Science, National Taiwan Normal Universit
1Survey of Robust TechniquesGraduate Institute of
Computer Science, National Taiwan Normal
University
- 2005/3/3
- Presented by Chen-Wei Liu
2Outline
- Spectral Entropy based Feature for Robust ASR
- ICASSP,2004
- IDIAP, Martiguy, Switzerland
- Hemant Misra, Shajith Ikbal, Henve Bourland,
Hynek Hermansky - An Energy Normalization Scheme for Improved
Robustness in Speech Recognition - ICSLP,2004
- Amirkabir University of Technology, Iran
University of Waterloo, Canada - S.M. Ahadi, H. Sheikhzadeh, R.L. Brennan G.H.
Freeman - Robust Speech Recognition with Spectral
Subtraction in Low SNR - ICSLP,2004
- Nara Institute of Science and Technology, Japan
- Randy Gomez, Akinobu Lee, Hiroshi Saruwatari,
Kiyohiro Shikano
3Introduction (1/2) Spectral Entropy based
Feature for Robust ASR
- Most of the state of the art ASR systems
- Use cepstral features derived from short time
Fourier transform spectrum of speech signal - While cepstral features are fairly good
representation, they capture the absolute energy
response of the spectrum - Further, we are not sure that all the information
present in the STFT spectrum is captured by them - This paper suggests to capture further
information from the spectrum by computing its
entropy
4Introduction (2/2) Spectral Entropy based
Feature for Robust ASR
- For voiced sounds, spectra have clear formants
- Entropies of such spectra will be low
- On the other hand spectra of unvoiced sounds are
flatter - Their entropies should be higher
- Therefore, entropy of a spectrum can be used as
an estimate for voicing/un-voicing decision - This paper extends the idea further and introduce
multi-band/multi-resolution entropy feature
5Spectral Entropy Feature (1/3)
- Entropy can be used to capture the peakiness of a
PMF (probability mass function) - A PMF with sharp peaks will have low entropy
while a PMF with flat distribution will have high
entropy - In case of STFT spectra of speech, we observe
distinct peaks and the position of these peaks - Are depend on the phoneme under consideration
- These formants are the one which characterize a
sound
6Spectral Entropy Feature (2/3)
- The problem of computing entropy of a spectrum is
that spectrum is not a PMF - So we convert the spectrum into a PMF
- For each frame the entropy was computed by
7Spectral Entropy Feature (3/3)
- The formants are the one which are least affected
as compared to the other parts of the spectrum
8Multi-band/Multi-resolution Entropy (1/3)
- The full-band spectrum is not a strong feature on
its own - If we want to capture the formants of the
spectrum as well as their location - What is the so called multi-band entropy features
- Divide the full-band spectrum into J
non-overlapping sub-bands of equal size - Entropy is computed for each sub-band and we
obtain one entropy value for each sub-band
9Multi-band/Multi-resolution Entropy (2/3)
- These sub-band entropy values indicate the
presence or absence of formants in that sub-band - When J1, we obtain one entropy value
- In this experiment, we change J from 1 to 5
- All the entropy values obtained by varying J
- Were appended to form a 15 (12345)
dimensional entropy feature vector
10Multi-band/Multi-resolution Entropy (3/3)
11Experimental Setup
- Numbers95 database
- US English connected digits telephone speech
- 30 words in the database represented by 27
phonemes - Training is performed on clean speech
- Testing is corrupted by factory noise from
Noisex92 database added at different SNRs to
Numbers95 database - 3330 utterences for training 1143 utterences
for testing - PLP features are used
12Experimental Results (1/2)
- WERs for clean speech for multi-band entropy
features alone
13Experimental Results (2/2)
14Conclusion
- Good improvement in performance is obtained
- When multi-band entropy feature is appended to
the usual PLP cepstral features - Specially in case of noise
- The new feature seems to be quite robust to noise
- The reason for robustness can be attributed to
the fact that multi-band entropy feature tries to
capture the location of the formants and formants
are less affected by noise
15Introduction (1/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
- Its widely accepted that the energy of speech
signal contains important information - Regarding the phonetic content of speech
- The energy of voiced sounds and vowels is usually
higher than that of the unvoiced sounds - Such as unvoiced fricatives or unvoiced plosives
- However, the signal energy can dramatically
change according to the conditions
16Introduction (2/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
- The same sentence uttered in different ways (e.g.
whisper or shout) may feature different energy
orders of magnitude - Meanwhile, the speech recognizer may face the
energy mismatch condition - Different speaker or emotion, the environment,
etc. - Therefore, direct use of frame energy is not
considered to be helpful under realistic
conditions
17Introduction (3/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
- Accordingly, a log-energy parameter is usually
used in speech recognition systems - The logarithm causes the parameter to be less
vulnerable to the abrupt changes in the level of
energy by applying large compression factors - Traditional energy normalization techniques
- Subtract the maximum utterance (log) energy from
each frames energy - To provide a maximum energy level close to zero
throughout the utterence
18Introduction (4/4) An Energy Normalization
Scheme for Improved Robustness in Speech
Recognition
- However, this normalization is not very useful in
speech recognition in noisy environments - This paper propose another energy normalization
approach - Direct use of the raw energy parameter
- A normalization similar to that used for ceptral
parameters
19Energy Normalization (1/2)
- A rather standard energy normalization technique
is as follows - Another energy normalization approach could be
inspired from CN used in traditional cepstral
normalization
20Energy Normalization (2/2)
- The approach is as follows
-
21Direct use of Energy (1/2)
- Additive noise
- Simply increases the total signal energy and its
long-time average energy - Hence, it can be partly compensated for by
subtracting the energy parameter mean - For the log energy, however, the subtraction of
long-time average is not as meaningful - Since a log operator has been applied after the
addition of signal and noise
22Direct use of Energy (2/2)
23Experimental Setup
- Aurora 2 was used for evaluations
- Connected digit speech corpus
- Test sets A,B,C
- 8 different kinds of additive noise
- six different SNRs
24Experimental Evaluations (1/3)
25Experimental Evaluations (2/3)
26Experimental Evaluations (3/3)
27Conclusions
- This paper introduces a scheme which allows
direct use of the frame energy parameter - Performs well in the presence of additive noise
- Its combination with traditional CN has led to
error rate improvements up to 55 on the Aurora 2
task
28Introduction Robust Speech Recognition with
Spectral Subtraction in Low SNR
- Its practical to investigate the effects of the
over-subtraction parameter relative to the
recognition performance of the speech recognizer - However, SS used in robust speech recognition
addresses the suppression of noise - But we dont have any idea on to what extent does
noise has to be suppresses and how much
distortion is allowed - SS is implemented solely as a mere speech
enhancement technique - This paper uses NRR (Noise Reduction Rate) and
MelCD (Mel Cepstrum Distortion) in order to study
the effects of suppressing noise using SS - To tailor-fit the noise suppression to the
recognizer
29NRR and MelCD (1/4)
- HMM-based speech recognition system is basically
an issue of how acoustically similar the test
are, with the created acoustic model - The degree of mismatch is very crucial
- It could be indirectly attributed with NRR and
the degree of distortion (MelCD) -
30NRR and MelCD (2/4)
- At 25dB SNR there exists a correlation between an
improved in NRR due to SS and the recognition
accuracy -
31NRR and MelCD (3/4)
- On the other hand, under low SNR, the correlation
does not hold true anymore - In fact, there is no more correlation between max
NRR with max Accuracy at very low SNR - Short to say, the parameters like NRR due to SS
nor MelCD can no longer give as a hint what the
recognition performance might be under low SNR
conditions
32NRR and MelCD (4/4)
33System Implementation (1/3)
- How to make SS more effective in front end of
speech recognition ? - By optimizing
- Traditional SS
- Optimization Steps
- Select some utterances and superimpose with
different types of noise on various SNRs
conditions from training data - Obtain matched
- Obtain generalized
34System Implementation (3/3)
35System Implementation (2/3)
36Experimental Results
- Test set
- 200 sentences from 46 speakers
- 8 noise types
- Original car, office, booth, crowd
- Extended mall, poster, par, train
- The single acoustic model
- By superimposing 25 dB office noise to the JNAS
database - Matched model
- The noise superimposed to training set was the
same as that to testing set
37Experimental Results
38Experimental Results
39Experimental Results
40Conclusion
- It is shown that the conventional SS is not
effective under low SNR conditions - This tailor fitted SS to optimize the recognition
performance - By deriving the optimal from the training
data which is directly related with the HMM
models - 26.0 and 7.6 relative improvements for the
proposed matched and generalized as compared
with the conventional approaches constant 2
41Experiment Approach
Uni-Table
All Training Data
HEQ table
Multi-Table
Uni-Table
HEQ
HTK (151)
Training Data
Label Multi-Table
HTK (151)
Multi-HEQ
Multi-Model (3)
Uni-Table
HEQ
HTK Recog.
Testing Data
Multi-Model Multi-Table
HTK Recog.
Multi-HEQ
Dynamic HEQ