Title: Kein Folientitel
1Separation and Robust Recognition of Noisy,
Convolutive Speech Mixtures using Time-Frequency-M
asking and Missing Data Techniques Dorothea
Kolossa, Aleksander Klimas, Reinhold
OrglmeisterBerlin University of Technology
2Overview
- Introduction
- Time-Frequency-Masking
- Missing Data Techniques
- Interfacing Time-Frequency Masking and Speech
Recognition -
- Transformation of uncertain features to
recognition domain - Modified missing data technique
- Experiments and Results
- Conclusions
3Time-Frequency Masking
Speech Signals
Time-Frequency Masking can provide effective
source separation
Masking Function
Mixture
4Time-Frequency Masking
- Criterion suitable for Convolutive Mixtures
- ICA Subband Energies
- Motivation for criterion
- ICA is inherently
- noise robust
- Frequency Variant ICA
- model is capable of
- capturing reverberation
- effects
- Audio-Example
- Car-Data 100kmh
-
in
no mask
masked
5Time-Frequency Masking
- Time-Frequency Masking
- Average SNR-Gain 3.4dB over ICA alone with
ICA-band- - energy-based mask
- Robust with respect to noise and reverberation
- But performance improvement does not translate
to - improved Speech Recognition Performance
- This is most likely due to feature distortions
- but human performance on recognizing
suggests that sufficient information is
present despite distortion - Solution suggested here is based on missing
data - techniques (e.g. Barker, Green Cooke, 2001)
-
6Missing Data Techniques
- Dealing with missing frequency bins
- Missing Data Techniques
- Integration over Uncertainty Ranges
- Marginalization
- Data Imputation
point estimate
Source Separation
HMM Speech Recognition
x1(t)
x2(t)
S(w)
uncertainty range
Source Separation
HMM Speech Recognition
x1(t)
x2(t)
S(w), sS(w)
7Missing Data Techniques
Using Variance in Recognition
8Missing Data Marginalization
Assign an uncertainty range for each unreliable
feature
xu
9Missing Data Imputation
Data Imputation When no information is available
(the feature is completely uncertain), the
recognizer model mean is used.
10Missing Data Techniques
Application of Missing Data Techniques to Speech
Recognition
Source Separation works in Frequency
Domain. Speech Recognition works best on MFCCs or
derived features. -gt Mismatch between domains.
Usual Compromise
uncertainty range
Source Separation
Missing Data HMM Speech Recognition
x1(t)
x2(t)
S(w), sS(w)
STFT-Domain
Problem Recognition performs significantly
better in other domains, such that missing
feature approach performs overall worse than
feature reconstruction (Raj, Seltzer, Stern 2004)
11Uncertain Feature Transformation
Possible Solution
uncertainty range
Source Separation
Missing Data HMM Speech Recognition
x1(t)
x2(t)
S(w), sS(w)
STFT-Domain
Uncertain Feature Transfor- mation
Source Separation
Missing Data HMM Speech Recognition
x1(t)
x2(t)
STFT-Domain
e.g. MFCC-Domain
S(w), sS(w)
Scep, sScep
12Uncertain Feature Transformation
Transforming Uncertain Features Flow Diagram
All grey boxes contain purely
linear trans- forms.
Microphone Signals Xi
Preprocessing (ICA)
sS(w ,t)
S(w ,t)
o
sabs (w ,t)
S (w ,t)
Feature Transformation for Recognition
Mel Filter Bank
Smel(w ,t)
smel(w ,t)
log
slog(w ,t)
Slog(w ,t)
DCT
scep(t ,t)
D
Acceleration
sd(w ,t)
sdd(w ,t)
Scep(t ,t)
ddcep(t ,t)
dcep(t ,t)
Feature Vector
13Uncertain Feature Transformation
Transformation of Variance in Nonlinearities Anal
ytical Integration must be done beforehand
manually often no analytic solution
exists or may be hard to find
14Uncertain Feature Transformation
Transformation of Variance in Nonlinearities Anal
ytical Integration must be done beforehand
manually often no analytic solution
exists or may be hard to find Integrals to
solve
15Uncertain Feature Transformation
Transformation of Variance in Nonlinearities Anal
ytical Integration must be done beforehand
manually often no analytic solution
exists or may be hard to find Absolute
Value
16Uncertain Feature Transformation
Transformation of Variance in Nonlinearities Anal
ytical Integration must be done beforehand
manually often no analytic solution
exists or may be hard to find Logarithm (G
ales1996) Analytical solution for
MFCC-Transform used for comparison
17Uncertain Feature Transformation
Transformation of Variance in Nonlinearities Anal
ytical Integration must be done beforehand
manually often no analytic solution
exists or may be hard to find Alternatives to
analytical Integration Monte-Carlo-Simulation too
expensive computationally Pseudo-Monte-Carlo int
eresting Unscented Transform
18Uncertain Feature Transformation
- Transforming Variance via Unscented Transform
-
- Generate set of Sigma-Points which capture
statistics (in contrast to - Monte-Carlo Methods, far fewer points are
needed) - Propagate Sigma-Points through nonlinearity
- Accurate up to those moments, which are
correctly represented by - Sigma-Points
19Uncertain Feature Transformation
Transforming Uncertain Features Flow Diagram
All grey boxes contain purely
linear trans- forms.
Microphone Signals Xi
Preprocessing (ICA)
sS(w ,t)
S(w ,t)
o
sabs (w ,t)
S (w ,t)
Feature Transformation for Recognition
Mel Filter Bank
Smel(w ,t)
smel(w ,t)
log
slog(w ,t)
Slog(w ,t)
DCT
scep(t ,t)
D
Acceleration
sd(w ,t)
sdd(w ,t)
Scep(t ,t)
ddcep(t ,t)
dcep(t ,t)
Feature Vector
20Modified Imputation
- Using Variance in Recognition
- Define uncertainty interval as function of
variance and - perform Integration
- or
- Maximize
-
- to obtain modified imputation equations.
21Modified Imputation
Using Variance in Imputation Maximization
of leads to the observation
estimate which can be found by for
single Gaussians
22Modified Imputation
Using Variance in Imputation and for
MOG-models is approximated by Finally,
the state-output-probability p(oq) is replaced
by
23Modified Imputation
Imputation for the case of uncertain information
24Experiments
Reverberant Room Recordings
25Experiments
In-Car Recordings
- Outline
- 8 channel microphone array
- Speech reproduced with artificial
- mouth from CD with TI digits
- Simultaneous recording with 4
- cardioid microphones
- 2 reference signals
- Speech
- TI digits,
- 10 different speakers, 2min each
- Setup
- Car Mercedes S 320
- Pre-Amplifier MidiMan
- Recorder 16channel, 12kHz, 16bit
- artificial heads HEAD acoustics
26Results
Correct on connected digits task config
a config b car 0kmh car 100kmh _at_
-9.6dBSNR noisy 56.1 49.8 50.6
21.5 TF-Masked 59.3 50.8 48.5 20.6 dB
gain 5.5dB 7.1dB 15.4dB 12.3dB Analytic
Integration 88.5 89.0 80.9 66.2 Unscented 87.2
86.1 80.1 68.4 config a and b trev
300ms in-car trev 70ms
27Conclusions
- Integration of ICA and Speech Recognition
- ICA Results can be improved by Time-Frequency
masking - Speech Recognition results can suffer despite
improvements in SNR - To improve recognition performance, variance
information on - spectral features can be derived in frequency
domain - For transforming variances to cepstral domain,
analytical integration - was compared with unscented transform. Results
are similar. - Unscented Transform may provide generally
applicable solution for - coupling TF-signal processing to speech
recognition in its - optimal domain of operation
28Literature
J. Barker, P. Green and M.P. Cooke Linking
Auditory Scene Analysis and Robust ASR by Missing
Data Techniques, Proceedings WISP 2001,
Stratford, UK. Available at http//hoarsenet.org/
spandh/projects/respite/publications/publications.
html M. Gales Model-Based Techniques for
Noise Robust Speech Recognition PhD Thesis,
Cambridge University, 1996. D. Kolossa and R.
Orglmeister Nonlinear Postprocessing for Blind
Speech Separation, Proceedings ICA2004, Lecture
Notes in Computer Science. B. Raj, M. Seltzer
and R. Stern Reconstruction of Missing Features
for Robust Speech Recognition, Speech
Communication 43, pp. 275-296, 2004. R. Stern
Signal Separation Motivated by Human Auditory
Perception Applications to Automatic Speech
Recognition, in Speech Separation by Humans and
Machines, P. Divenyi (Ed.), Kluwer 2005.
29 Thank you!