Comparing Audio and Visual Information for Speech Processing - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Comparing Audio and Visual Information for Speech Processing

Description:

Acoustic features typically work equally well in either application (Young2002) ... Acoustic performance is basically equal for both tasks ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 25
Provided by: david1132
Category:

less

Transcript and Presenter's Notes

Title: Comparing Audio and Visual Information for Speech Processing


1
Comparing Audio and Visual Information for Speech
Processing
  • David Dean, Patrick Lucey, Sridha Sridharan
    and Tim Wark
  • Presented by David Dean

2
Audio-Visual Speech Processing - Overview
  • Speech or speaker recognition traditionally audio
    only
  • Mature area of research
  • Significant problems in real-world environments
    (Wark2001)
  • High acoustic noise
  • Variation of speech
  • Audio-visual speech processing adds an additional
    modality to help alleviate these problems

3
Audio-Visual Speech Processing - Overview
  • Speech and speaker recognition tasks have many
    overlapping areas
  • The same configuration can be used for both
    text-dependent speaker recognition, and
    speaker-dependent speech recognition
  • Train speaker-dependent word (or sub-word) models
  • Speaker recognition chooses amongst speakers for
    a particular word, or
  • Word recognition chooses amongst words for a
    particular speaker.

4
Audio-Visual Speech Processing - Overview
  • Little research has been done into how the two
    applications (speaker vs. speech) differ in areas
    other than the set of models chosen for
    recognition
  • One area of interest in this research is the
    reliance on each modality
  • Acoustic features typically work equally well in
    either application (Young2002)
  • Little consensus has been reach on the
    suitability of visual features for each
    application

5
Experimental Setup
Acoustic Feature Extraction
Acoustic Speech/Speaker Models
Decision Fusion
Speech/ Speaker Decision
Visual Feature Extraction
Visual Speech/Speaker Models
Lip Location Tracking
6
Lip location and tracking
7
Finding Faces
  • Manual Red, Green and Blue skin thresholds were
    trained for each speaker
  • Faces were located by applying these thresholds
    to the video frames

8
Finding and tracking eyes
  • Top half of face region is searched for eyes
  • A shifted version of Cr-Cb thresholding was
    performed to locate possible eye regions
    (Butler2003)
  • Invalid eye candidate regions were removed, and
    the most likely pair of candidates chosen as the
    eyes
  • New eye location compared to old, and ignored if
    too far from old
  • About 40 of sequences had to be manually
    eye-tracked every 50 frames.

9
Finding and tracking lips
  • Eye locations are used to define
    rotation-normalised lip search region (LSR)
  • LSR converted to Red/Green colour-space and
    thresholded
  • Unlikely lip-candidates are removed
  • Rectangular area with largest amount of
    lip-candidate area within is lip ROI.

10
Feature Extraction and Datasets
  • MFCC 15 1 energy, deltas and accelerations
    48 features
  • PCA 20 eigenlip coefficients deltas and
    accelerations 60 features
  • Eigenlip-space trained on entire data set of lip
    images
  • Stationary speech from CUAVE (Patterson2002)
  • 5 sequences for training, 2 for testing (per
    speaker)
  • Testing was also performed on speech-babble
    corrupted noisy versions

11
Training
  • Phone transcriptions obtained from earlier
    research (Lucey 2004) were used to train speaker
    independent HMM phone models in both audio and
    visual domains
  • Speaker dependent models adapted using MLLR
    adaption from speaker independent models
  • HMM Toolkit (HTK) was used (Young 2002)

12
Comparing acoustic and visual information for
speech processing
  • Investigated using the identification rates of
    speaker-dependent acoustic and visual phoneme
    models
  • Test segments freely transcribed using all
    speaker dependent phoneme models
  • No restriction to specified user or word
  • Confusion tables for speech (phoneme) and speaker
    recognition were examined to get identification
    rates

13
Example Confusion Table(Phonemes in Clean
Acoustic Speech)
Actual Phonemes
Recognised Phonemes
14
Example Confusion Table(Phonemes in Clean Visual
Speech)
Actual Phonemes
Recognised Phonemes
15
Likelihood of speaker and phone identification
using phoneme models
16
Fusion
  • Because of the differing performance of each
    modality at speech and speaker recognition, the
    fusion configuration for each task must be
    adjusted with these performances in mind
  • For these experiments
  • Weighted sum fusion of the top 10 normalised
    scores in each modality
  • ranges from 0 (video only) to 1 (audio only)

17
Speech vs Speaker
  • The response of each system to speech-babble
    noise over a selected range of values were
    compared.

Word Identification
18
Speech vs Speaker
  • The response of each system to speech-babble
    noise over a selected range of values were
    compared.

Speaker Identification
19
Speech vs Speaker
  • Acoustic performance is basically equal for both
    tasks
  • Visual performance is clearly better for speaker
    recognition
  • Speech recognition fusion is catastrophic at
    nearly all noise levels
  • Speaker recognition is only catastrophic at high
    noise levels
  • We can also get an idea of the dominance of each
    modality by looking at values of that produce
    the best lines (ideal adaptive fusion)

20
Best Fusion
21
Conclusion and Further Work
  • PCA-based visual features are mostly
    person-dependent
  • Should be used with care in visual speech
    recognition tasks
  • It is believed that this dependency stems from
    the large amount of static person-specific
    information capture along with the dynamic lip
    configuration
  • Skin colour, facial hair, etc.
  • Visual information for speech recognition is only
    useful in high noise situations

22
Conclusion and Further Work
  • Even at very low levels of acoustic noise, visual
    speech information can provide similar
    performance to acoustic information for speaker
    recognition
  • Adaptive fusion for speaker recognition should
    therefore be biased towards visual features for
    best performance
  • Further study needs to be performed in methods of
    improving the visual modality for speech
    recognition by focusing more on the dynamic
    speech-related information
  • Mean-image removal, Optical flow, Contour
    representations

23
References
  • (Butler2003) D. Butler, C. McCool, M. McKay, S.
    Lowther, V. Chandran, and S. Sridharan, "Robust
    Face Localisation Using Motion, Colour and
    Fusion," presented at Proceedings of the Seventh
    International Conference on Digital Image
    Computing Techniques and Applications, DICTA
    2003, Macquarie University, Sydney, Australia,
    2003.
  • (Lucey2004) P. Lucey, T. Martin, and S.
    Sridharan, "Confusability of Phonemes Grouped
    According to their Viseme Classes in Noisy
    Environments," presented at SST 2004, Sydney,
    Australia, 2004.
  • (Patterson2002) E. Patterson, S. Gurbuz, Z.
    Tufekci, and J. N. Gowdy, "CUAVE a new
    audio-visual database for multimodal
    human-computer interface research," presented at
    Acoustics, Speech, and Signal Processing, 2002.
    Proceedings. (ICASSP '02). IEEE International
    Conference on, 2002.
  • (Young2002) S. Young, G. Evermann, D. Kershaw, G.
    Moore, J. Odell, D. Ollason, D. Povey, V.
    Valtchev, and P. Woodland, The HTK Book, 3.2 ed.
    Cambridge, UK Cambridge University Engineering
    Department., 2002.
  • (Wark2001) T. Wark and S. Sridharan, "Adaptive
    fusion of speech and lip information for robust
    speaker identification," Digital Signal
    Processing, vol. 11, pp. 169-186, 2001.

24
Questions?
Write a Comment
User Comments (0)
About PowerShow.com