Title: Computer Vision, Speech Communication
1HIWIRE
Computer Vision, Speech Communication and Signal
Processing Research Group
2HIWIRE Involved CVSP Members
- Group Leader Prof. Petros Maragos
- Ph.D. Students / Graduate Research Assistants
- D. Dimitriadis (speech recognition,
modulations) - V. Pitsikalis (speech recognition,
fractals/chaos, fusion) - A. Katsamanis (speech modulations, statistical
processing, recognition, fusion) - G. Papandreou (vision PDEs, active contours,
level sets, AV-ASR, fusion) - G. Evangelopoulos (vision/speech texture,
modulations, fractals) - S. Leukimmiatis (speech statistical processing,
microphone arrays)
3ICCS-NTUA Tasks Involvement
- WP1 Environment and Sensor Robustness (26MM)
- Task 1 Sensor Integration Independence (11MM)
- Subject 1 Multi-Microphone Systems ( 5MM)
- Subject 5 Multi-Modal Features (audio-visual)
(6MM) - Task 2 Noise Independence (15MM)
- Subject 2 Advanced Signal Processing (15MM)
- WP2 User Robustness (8MM)
- Task 1 Improved Speaker Independence (4MM)
- Task 2 Rapid Speaker Adaptation (4MM)
- WP3 System Integration (4MM)
- WP4 Evaluation (5MM)
- WP5 Exploitation and dissemination (1MM)
4ICCS-NTUA in HIWIRE
- Evaluation
- Databases Baseline Completed
- Platform Front-end Release 1st
Version - WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
5ICCS-NTUA in HIWIRE
- Evaluation
- Databases Baseline Completed
- Platform Front-end Release 1st
Version - WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
6HIWIRE Advanced Front-end Challenges
- Points Considered during Implementation
- Modular Architecture
- Implementation in C-Code
- Incorporation of Different Ideas/Algorithms
- User-friendly interface providing additional
options dealing with on-site demands of the
project
7HIWIRE Advanced Front-end Options
1
1
- Support for Input Speech Signals
- Different Sampling Frequencies
- 8 kHz
- 11 kHz
- 16 kHz
- Different Byte-Ordering
- Little-endian
- Big-endian
- Different Input File Formats
- RAW
- NIST
- HTK
2
2
- Provides Flags/ Options
- Preprocessing Smoothing of Speech Signals
- Hamming Windowing
- Pre-emphasis
- Denoising/ VAD Algorithms
- LTSD-VAD Algorithm (UGR)
- MTE-VAD Algorithm (ICCS-NTUA)
- Wiener Denoising Algorithm-
- (Used only with a VAD algorithm)
- Output Features
- MFCC
- TECC
- C0 or logE
3
3
8HIWIRE Advanced Front-end Things to Be Done
- Script is in Testing Phase
- Create a CVS where Additional Modules
- should be included
-
- Tested Further in Speech Databases
- Evaluation in progress
-
- Fine-Tuning is Necessary
- Final Version should be Faster (Real-Time
Processing) - Incorporate it in the HIWIRE Platform
9Aurora 3 - Spanish
- Connected-Digits, Sampling Frequency 8 kHz
- Training Set
- WM (Well-Matched) 3392 utterances (quiet 532,
low 1668 and high noise 1192 - MM (Medium-Mismatch) 1607 utterances (quiet 396
and low noise 1211) - HM (High-Mismatch) 1696 utterances (quiet 266,
low 834 and high noise 596) - Testing Set
- WM 1522 utterances (quiet 260, low 754 and high
noise 508), 8056 digits - MM 850 utterances (quiet 0, low 0 and high
noise 850), 4543 digits - HM 631 utterances (quiet 0, low 377 and high
noise 254), 3325 digits - 2 Back-end ASR Systems (??? and BLasr)
- Feature Vectors MFCCAM-FM (or Auditory?M-FM),
TECC - All-Pair, Unweighted Grammar (or Word-Pair
Grammar) - Performance Criterion Word (digit) Accuracy Rates
10Databases Aurora 2
- Task Speaker Independent Recognition of Digit
Sequences - TI - Digits at 8kHz
- Training (8440 Utterances per scenario, 55M/55F)
- Clean (8kHz, G712)
- Multi-Condition (8kHz, G712)
- 4 Noises (artificial) subway, babble, car,
exhibition - 5 SNRs 5, 10, 15, 20dB , clean
- Testing, artificially added noise
- 7 SNRs -5, 0, 5, 10, 15, 20dB , clean
- A noises as in multi-cond train., G712 (28028
Utters) - B restaurant, street, airport, train station,
G712 (28028 Utters) - C subway, street (MIRS) (14014 Utters)
11ICCS-NTUA in HIWIRE 1st, 2nd Year
- Evaluation
- Databases Baseline Completed
- Platform Front-end Release 1st
Version - WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration?
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
12Microphone Arrays
- Multi-channel Speech Enhancement for Diffuse
Noise Fields - MVDR (Minimum Variance Distortionless Response)
Beamforming - Single Channel Linear and non-linear
Post-Filtering - MSE criterion leads to the linear Wiener
Post-filter. - MSE STSA and MSE log-STSA criteria leads to
non-Linear Post-filters.
13Microphone Arrays
- The Overall Speech Enhancement System includes
the following steps - The noisy channels inputs are fed into a time
alignment module (Different propagation paths for
every input channel) - The time aligned noisy observations are projected
to a single channel output with minimum noise
variance, through the MVDR beamformer. - The output of the beamformer is further processed
by a post-filter according to the used speech
enhancement criterion (MSE, MSE STSA, MSE
log-STSA). - For the post-filters, since they depend on second
order statistics of the source and the noise
signals, we have to develop an estimation scheme.
- Results on CMU Database
- 10 Speakers (13 utterances)
- Diffuse Noise
- SSNR Enhancement SSNRoutput-ESSNRinput (E
stands for the mean value of the N input
channels) - LAR, LSD, IS, LLR Low values signify high
speech quality. These measures are found to have
a high correlation with the human perception.
14Results CMU Database
15Spectrograms CMU Database
16Multi-Microphone ASR Experiments
- Details on Setup of ASR Tasks
-
- 700 Sentences for Training and
- 300 for Testing
- 12-state, left-right HMM w.
- Gaussian mixtures
- All-pair, unweighted grammar
- MFCCC0DDD (39 coefficients in total)
17ICCS-NTUA in HIWIRE 1st, 2nd Year
- Evaluation
- Databases Baseline Completed
- Platform Front-end Release 1st
Version - WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration?
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
18Multi-Cue Feature Fusion
- Goal
- Fuse heterogeneous information streams optimally
adaptively - Our approach
- Explicitly model uncertainty in all feature
measurements (due to noise or model fitting
errors) - Adjust model training to accommodate for
uncertainty - Dynamically compensate feature uncertainty during
decoding - Feature uncertainty estimation in the AV-ASR
case - For the Audio Stream/MFCC speech enhancement
process - For the Visual Stream model fitting variance
- Properties
- Adaptation at the frame level
- Explain and generalize cue weighting through
stream exponents - Integrates with a wide range of models, e.g. GMM,
HMM - Applicable to both audio-audio and audio-visual
scenarios - Can be combined with asynchronous models, e.g.
Product-HMM
19Measurement Noise and Adaptive Fusion
Conventional View Features are directly
observable
Our View We can only measure noise-corrupt
features
Ref Katsamanis, Papandreou, Pitsikalis, and
Maragos, EUSIPCO06
20EM-Training with Partially Known Features
- Even training data can be uncertain
Hidden
Conventional View
Observed
Hidden
Our View
Observed
Ref Papandreou, Katsamanis, Pitsikalis, and
Maragos, submission to NIPS06
21EM-Training Results for GMM
E-Step
Similar to conventional update rules
Uncertainty-compensated scores
M-Step
Filtered feature estimate
- Formulas for HMM are similar
22Decoding Uncertain Features
- Variance-Compensated (Soft) Scoring
- Probabilistic Justification for Stream Exponents
Relative Measurement Error
Adaptation at each frame stream/class/mixture
dependent stream weights
23Audio-visual Asynchrony Modeling
Multi-stream HMM
Product HMM
Ref Gravier et al., 2002
24Fusion Multi-Cue Audio-Audio
- Feature Uncertainty for Audio features
- Baseline Audio Features MFCC
- Enhancement using GMM of clean speech and Vector
Taylor Series Approximation - Uncertainty is Gaussian with Variance given by
the enhancement process - Used for Audio-Visual Fusion
- Fractal Audio Features MFD
- On-going research applying a similar framework
(GMM, VTS)
25MFD From Noisy Speech to Feature Uncertainty
True Noisy
Noise
Estimated Noisy
Clean
White Noise (0 dB)
- Ongoing Research Noise Compensation for MFD
26ICCS-NTUA in HIWIRE 1st, 2nd Year
- Evaluation
- Databases Baseline Completed
- Platform Front-end Release 1st
Version - WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration?
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
27Showcase Audio-Visual Speech Recognition
- Both shape texture can assist lipreading
- Active Appearance Models for face modeling
- Shape and texture of faces live in low-dim
manifolds - Features AAM Fitting (nonlinear least squares
problem) - Visual feature Uncertainty related to the
sensitivity of the least-squares solution
28Demo AAM fitting and uncertainty estimates
- The visual front-end supplies both features and
their respective uncertainty.
29Audio-Visual ASR Database
- Subset of CUAVE database used
- 36 speakers (30 training, 6 testing)
- 5 sequences of 10 connected digits per speaker
- Training set 1500 digits (30x5x10)
- Test set 300 digits (6x5x10)
- CUAVE database also contains more complex data
sets speaker moving around, speaker shows
profile, continuous digits, two speakers (to be
used in future evaluations) - CUAVE was kindly provided by the Clemson
University
30Evaluation on the CUAVE Database
31Audio-Visual Speech Classification with MS-HMM
Ref Katsamanis, Papandreou, Pitsikalis, and
Maragos, EUSIPCO06
32AV Digit Classification Results (Word Accuracy)
SNR (babble) Audio Visual AV MS-HMM AV MS-HMM Var-Comp AV P-HMM AV P-HMM Var-Comp
Clean 100 68.7 95.1 97.0 95.4 99.6
10 dB 92.8 - 88.3 90.2 90.6 92.5
5 dB 73.9 - 84.5 86.8 87.2 89.1
0 dB 54.7 - 79.6 81.1 83.8 82.6
- Audio MFCC_D_Z (26 features)
- Visual 6 shape 12 texture AAM coefficients
- AV MS-HMM AudioVisual Multistream HMM, weights
(1,1) - AV MS-HMM, Var-Comp AudioVisual Multistream
HMMVariance Compensation - AV P-HMM AudioVisual Product HMM, weights (1,1)
- AV P-HMM, Var-Comp AudioVisual Product HMM
Variance Compensation
Ref Pitsikalis, Katsamanis, Papandreou, and
Maragos, ICSLP06
33AV-ASR Results with Uncertain Training
Ref Papandreou, Katsamanis, Pitsikalis, and
Maragos, submission to NIPS06
34ICCS-NTUA in HIWIRE 1st, 2nd Year
- Evaluation
- Databases Baseline Completed
- Platform Front-end Release 1st
Version - WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration?
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
35Databases Aurora 4
- Task 5000 Word, Continuous Speech Recognition
- WSJ0 (16 / 8 kHz) Artificially Added Noise
- 2 microphones Sennheiser, Other
- Filtering G712, P341
- Noises Car, Babble, Restaurant, Street, Airport,
Train Station - Training (7138 Utterances per scenario)
- Clean Sennheiser mic.
- Multi-Condition Sennheiser Other mic.,
- 75 w. artificially added noise _at_ SNR 10 20
dB - Noisy Sennheiser, artificially added noise
- SNR 10 20 dB
- Testing (330 Utterances 166 Utterances each.
Speaker 8) - SNR 5-15 dB
- 1-7 Sennheiser microphone
- 8-14 Other microphone
36VTLN on the Platform
- Warping in the front-end
- Piecewise Linear Warping Function
- Warping in the filterbank domain by stretching or
compressing the frequency axis - Training HTK Implementation
- Testing
- Fast Implementation using GMM representing
normalized speech to estimate warping factors per
utterance.
37VTLN on the Platform, Results
38VTLN Research, TECC Features
- Teager Energy Cepstrum Coefficients are actually
energy measurements at the output of a Gammatone
filterbank, similarly to MFCC - VTLN can be applied in a similar manner
- The bark scale along which the filters are
uniformly positioned is properly stretched or
shrunk to achieve warping - Evaluation is currently in progress
39VTLN Research, using Formants
40Raw Formants-Dynamic Programming
41Formant Tracking
42ICCS-NTUA in HIWIRE 1st, 2nd Year
- Evaluation
- Databases Baseline Completed
- Platform Release 1st Version
- WP1
- Noise Robust Features Completed
- Multi-mic. array Enhancement Prelim. Results
- Fusion Prelim. Results
- Audio-Visual ASR Baseline Adv. Visual
Features - VAD Completed Integration?
- WP2
- VTLN Platform Integration Completed
- Speaker Normalization Research Prelim. Results
- Non-native Speech Database Completed
43Next...
- Fusion
- AudioAudio,
- AudioVisual,
- Nonlinear FeaturesVisual
- Visual Front-end
- VAD Nonlinear Features