Computer Vision, Speech Communication - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Computer Vision, Speech Communication

Description:

V. Pitsikalis (speech: recognition, fractals/chaos, fusion) ... C0 or logE. HIWIRE Advanced Front-end: Things to Be Done. Script is in Testing Phase ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 39
Provided by: cvspC
Category:

less

Transcript and Presenter's Notes

Title: Computer Vision, Speech Communication


1
HIWIRE
Computer Vision, Speech Communication and Signal
Processing Research Group
2
HIWIRE Involved CVSP Members
  • Group Leader Prof. Petros Maragos
  • Ph.D. Students / Graduate Research Assistants
  • D. Dimitriadis (speech recognition,
    modulations)
  • V. Pitsikalis (speech recognition,
    fractals/chaos, fusion)
  • A. Katsamanis (speech modulations, statistical
    processing, recognition, fusion)
  • G. Papandreou (vision PDEs, active contours,
    level sets, AV-ASR, fusion)
  • G. Evangelopoulos (vision/speech texture,
    modulations, fractals)
  • S. Leukimmiatis (speech statistical processing,
    microphone arrays)

3
ICCS-NTUA Tasks Involvement
  • WP1 Environment and Sensor Robustness (26MM)
  • Task 1 Sensor Integration Independence (11MM)
  • Subject 1 Multi-Microphone Systems ( 5MM)
  • Subject 5 Multi-Modal Features (audio-visual)
    (6MM)
  • Task 2 Noise Independence (15MM)
  • Subject 2 Advanced Signal Processing (15MM)
  • WP2 User Robustness (8MM)
  • Task 1 Improved Speaker Independence (4MM)
  • Task 2 Rapid Speaker Adaptation (4MM)
  • WP3 System Integration (4MM)
  • WP4 Evaluation (5MM)
  • WP5 Exploitation and dissemination (1MM)

4
ICCS-NTUA in HIWIRE
  • Evaluation
  • Databases Baseline Completed
  • Platform Front-end Release 1st
    Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

5
ICCS-NTUA in HIWIRE
  • Evaluation
  • Databases Baseline Completed
  • Platform Front-end Release 1st
    Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

6
HIWIRE Advanced Front-end Challenges
  • Points Considered during Implementation
  • Modular Architecture
  • Implementation in C-Code
  • Incorporation of Different Ideas/Algorithms
  • User-friendly interface providing additional
    options dealing with on-site demands of the
    project

7
HIWIRE Advanced Front-end Options
1
1
  • Support for Input Speech Signals
  • Different Sampling Frequencies
  • 8 kHz
  • 11 kHz
  • 16 kHz
  • Different Byte-Ordering
  • Little-endian
  • Big-endian
  • Different Input File Formats
  • RAW
  • NIST
  • HTK

2
2
  • Provides Flags/ Options
  • Preprocessing Smoothing of Speech Signals
  • Hamming Windowing
  • Pre-emphasis
  • Denoising/ VAD Algorithms
  • LTSD-VAD Algorithm (UGR)
  • MTE-VAD Algorithm (ICCS-NTUA)
  • Wiener Denoising Algorithm-
  • (Used only with a VAD algorithm)
  • Output Features
  • MFCC
  • TECC
  • C0 or logE

3
3
8
HIWIRE Advanced Front-end Things to Be Done
  • Script is in Testing Phase
  • Create a CVS where Additional Modules
  • should be included
  • Tested Further in Speech Databases
  • Evaluation in progress
  • Fine-Tuning is Necessary
  • Final Version should be Faster (Real-Time
    Processing)
  • Incorporate it in the HIWIRE Platform

9
Aurora 3 - Spanish
  • Connected-Digits, Sampling Frequency 8 kHz
  • Training Set
  • WM (Well-Matched) 3392 utterances (quiet 532,
    low 1668 and high noise 1192
  • MM (Medium-Mismatch) 1607 utterances (quiet 396
    and low noise 1211)
  • HM (High-Mismatch) 1696 utterances (quiet 266,
    low 834 and high noise 596)
  • Testing Set
  • WM 1522 utterances (quiet 260, low 754 and high
    noise 508), 8056 digits
  • MM 850 utterances (quiet 0, low 0 and high
    noise 850), 4543 digits
  • HM 631 utterances (quiet 0, low 377 and high
    noise 254), 3325 digits
  • 2 Back-end ASR Systems (??? and BLasr)
  • Feature Vectors MFCCAM-FM (or Auditory?M-FM),
    TECC
  • All-Pair, Unweighted Grammar (or Word-Pair
    Grammar)
  • Performance Criterion Word (digit) Accuracy Rates

10
Databases Aurora 2
  • Task Speaker Independent Recognition of Digit
    Sequences
  • TI - Digits at 8kHz
  • Training (8440 Utterances per scenario, 55M/55F)
  • Clean (8kHz, G712)
  • Multi-Condition (8kHz, G712)
  • 4 Noises (artificial) subway, babble, car,
    exhibition
  • 5 SNRs 5, 10, 15, 20dB , clean
  • Testing, artificially added noise
  • 7 SNRs -5, 0, 5, 10, 15, 20dB , clean
  • A noises as in multi-cond train., G712 (28028
    Utters)
  • B restaurant, street, airport, train station,
    G712 (28028 Utters)
  • C subway, street (MIRS) (14014 Utters)

11
ICCS-NTUA in HIWIRE 1st, 2nd Year
  • Evaluation
  • Databases Baseline Completed
  • Platform Front-end Release 1st
    Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration?
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

12
Microphone Arrays
  • Multi-channel Speech Enhancement for Diffuse
    Noise Fields
  • MVDR (Minimum Variance Distortionless Response)
    Beamforming
  • Single Channel Linear and non-linear
    Post-Filtering
  • MSE criterion leads to the linear Wiener
    Post-filter.
  • MSE STSA and MSE log-STSA criteria leads to
    non-Linear Post-filters.

13
Microphone Arrays
  • The Overall Speech Enhancement System includes
    the following steps
  • The noisy channels inputs are fed into a time
    alignment module (Different propagation paths for
    every input channel)
  • The time aligned noisy observations are projected
    to a single channel output with minimum noise
    variance, through the MVDR beamformer.
  • The output of the beamformer is further processed
    by a post-filter according to the used speech
    enhancement criterion (MSE, MSE STSA, MSE
    log-STSA).
  • For the post-filters, since they depend on second
    order statistics of the source and the noise
    signals, we have to develop an estimation scheme.
  • Results on CMU Database
  • 10 Speakers (13 utterances)
  • Diffuse Noise
  • SSNR Enhancement SSNRoutput-ESSNRinput (E
    stands for the mean value of the N input
    channels)
  • LAR, LSD, IS, LLR Low values signify high
    speech quality. These measures are found to have
    a high correlation with the human perception.

14
Results CMU Database
15
Spectrograms CMU Database
16
Multi-Microphone ASR Experiments
  • Details on Setup of ASR Tasks
  • 700 Sentences for Training and
  • 300 for Testing
  • 12-state, left-right HMM w.
  • Gaussian mixtures
  • All-pair, unweighted grammar
  • MFCCC0DDD (39 coefficients in total)

17
ICCS-NTUA in HIWIRE 1st, 2nd Year
  • Evaluation
  • Databases Baseline Completed
  • Platform Front-end Release 1st
    Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration?
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

18
Multi-Cue Feature Fusion
  • Goal
  • Fuse heterogeneous information streams optimally
    adaptively
  • Our approach
  • Explicitly model uncertainty in all feature
    measurements (due to noise or model fitting
    errors)
  • Adjust model training to accommodate for
    uncertainty
  • Dynamically compensate feature uncertainty during
    decoding
  • Feature uncertainty estimation in the AV-ASR
    case
  • For the Audio Stream/MFCC speech enhancement
    process
  • For the Visual Stream model fitting variance
  • Properties
  • Adaptation at the frame level
  • Explain and generalize cue weighting through
    stream exponents
  • Integrates with a wide range of models, e.g. GMM,
    HMM
  • Applicable to both audio-audio and audio-visual
    scenarios
  • Can be combined with asynchronous models, e.g.
    Product-HMM

19
Measurement Noise and Adaptive Fusion
Conventional View Features are directly
observable
Our View We can only measure noise-corrupt
features
Ref Katsamanis, Papandreou, Pitsikalis, and
Maragos, EUSIPCO06
20
EM-Training with Partially Known Features
  • Even training data can be uncertain

Hidden
Conventional View
Observed
Hidden
Our View
Observed
Ref Papandreou, Katsamanis, Pitsikalis, and
Maragos, submission to NIPS06
21
EM-Training Results for GMM
E-Step
Similar to conventional update rules
Uncertainty-compensated scores
M-Step
Filtered feature estimate
  • Formulas for HMM are similar

22
Decoding Uncertain Features
  • Variance-Compensated (Soft) Scoring
  • Probabilistic Justification for Stream Exponents

Relative Measurement Error
Adaptation at each frame stream/class/mixture
dependent stream weights
23
Audio-visual Asynchrony Modeling
Multi-stream HMM
Product HMM
Ref Gravier et al., 2002
24
Fusion Multi-Cue Audio-Audio
  • Feature Uncertainty for Audio features
  • Baseline Audio Features MFCC
  • Enhancement using GMM of clean speech and Vector
    Taylor Series Approximation
  • Uncertainty is Gaussian with Variance given by
    the enhancement process
  • Used for Audio-Visual Fusion
  • Fractal Audio Features MFD
  • On-going research applying a similar framework
    (GMM, VTS)

25
MFD From Noisy Speech to Feature Uncertainty
True Noisy
Noise
Estimated Noisy
Clean
White Noise (0 dB)
  • Ongoing Research Noise Compensation for MFD

26
ICCS-NTUA in HIWIRE 1st, 2nd Year
  • Evaluation
  • Databases Baseline Completed
  • Platform Front-end Release 1st
    Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration?
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

27
Showcase Audio-Visual Speech Recognition
  • Both shape texture can assist lipreading
  • Active Appearance Models for face modeling
  • Shape and texture of faces live in low-dim
    manifolds
  • Features AAM Fitting (nonlinear least squares
    problem)
  • Visual feature Uncertainty related to the
    sensitivity of the least-squares solution

28
Demo AAM fitting and uncertainty estimates
  • The visual front-end supplies both features and
    their respective uncertainty.

29
Audio-Visual ASR Database
  • Subset of CUAVE database used
  • 36 speakers (30 training, 6 testing)
  • 5 sequences of 10 connected digits per speaker
  • Training set 1500 digits (30x5x10)
  • Test set 300 digits (6x5x10)
  • CUAVE database also contains more complex data
    sets speaker moving around, speaker shows
    profile, continuous digits, two speakers (to be
    used in future evaluations)
  • CUAVE was kindly provided by the Clemson
    University

30
Evaluation on the CUAVE Database
31
Audio-Visual Speech Classification with MS-HMM
Ref Katsamanis, Papandreou, Pitsikalis, and
Maragos, EUSIPCO06
32
AV Digit Classification Results (Word Accuracy)
SNR (babble) Audio Visual AV MS-HMM AV MS-HMM Var-Comp AV P-HMM AV P-HMM Var-Comp
Clean 100 68.7 95.1 97.0 95.4 99.6
10 dB 92.8 - 88.3 90.2 90.6 92.5
5 dB 73.9 - 84.5 86.8 87.2 89.1
0 dB 54.7 - 79.6 81.1 83.8 82.6
  • Audio MFCC_D_Z (26 features)
  • Visual 6 shape 12 texture AAM coefficients
  • AV MS-HMM AudioVisual Multistream HMM, weights
    (1,1)
  • AV MS-HMM, Var-Comp AudioVisual Multistream
    HMMVariance Compensation
  • AV P-HMM AudioVisual Product HMM, weights (1,1)
  • AV P-HMM, Var-Comp AudioVisual Product HMM
    Variance Compensation

Ref Pitsikalis, Katsamanis, Papandreou, and
Maragos, ICSLP06
33
AV-ASR Results with Uncertain Training
Ref Papandreou, Katsamanis, Pitsikalis, and
Maragos, submission to NIPS06
34
ICCS-NTUA in HIWIRE 1st, 2nd Year
  • Evaluation
  • Databases Baseline Completed
  • Platform Front-end Release 1st
    Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration?
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

35
Databases Aurora 4
  • Task 5000 Word, Continuous Speech Recognition
  • WSJ0 (16 / 8 kHz) Artificially Added Noise
  • 2 microphones Sennheiser, Other
  • Filtering G712, P341
  • Noises Car, Babble, Restaurant, Street, Airport,
    Train Station
  • Training (7138 Utterances per scenario)
  • Clean Sennheiser mic.
  • Multi-Condition Sennheiser Other mic.,
  • 75 w. artificially added noise _at_ SNR 10 20
    dB
  • Noisy Sennheiser, artificially added noise
  • SNR 10 20 dB
  • Testing (330 Utterances 166 Utterances each.
    Speaker 8)
  • SNR 5-15 dB
  • 1-7 Sennheiser microphone
  • 8-14 Other microphone

36
VTLN on the Platform
  • Warping in the front-end
  • Piecewise Linear Warping Function
  • Warping in the filterbank domain by stretching or
    compressing the frequency axis
  • Training HTK Implementation
  • Testing
  • Fast Implementation using GMM representing
    normalized speech to estimate warping factors per
    utterance.

37
VTLN on the Platform, Results
38
VTLN Research, TECC Features
  • Teager Energy Cepstrum Coefficients are actually
    energy measurements at the output of a Gammatone
    filterbank, similarly to MFCC
  • VTLN can be applied in a similar manner
  • The bark scale along which the filters are
    uniformly positioned is properly stretched or
    shrunk to achieve warping
  • Evaluation is currently in progress

39
VTLN Research, using Formants
40
Raw Formants-Dynamic Programming
41
Formant Tracking
42
ICCS-NTUA in HIWIRE 1st, 2nd Year
  • Evaluation
  • Databases Baseline Completed
  • Platform Release 1st Version
  • WP1
  • Noise Robust Features Completed
  • Multi-mic. array Enhancement Prelim. Results
  • Fusion Prelim. Results
  • Audio-Visual ASR Baseline Adv. Visual
    Features
  • VAD Completed Integration?
  • WP2
  • VTLN Platform Integration Completed
  • Speaker Normalization Research Prelim. Results
  • Non-native Speech Database Completed

43
Next...
  • Fusion
  • AudioAudio,
  • AudioVisual,
  • Nonlinear FeaturesVisual
  • Visual Front-end
  • VAD Nonlinear Features
Write a Comment
User Comments (0)
About PowerShow.com