Comparing Audio and Visual Information for Speech Processing - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Comparing Audio and Visual Information for Speech Processing

Description:

Acoustic features typically work equally well in either application (Young2002) ... Acoustic performance is basically equal for both tasks ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 25

Provided by: david1132

Category:

more less

Transcript and Presenter's Notes

Title: Comparing Audio and Visual Information for Speech Processing

1
Comparing Audio and Visual Information for Speech
Processing

David Dean, Patrick Lucey, Sridha Sridharan
and Tim Wark
Presented by David Dean

2
Audio-Visual Speech Processing - Overview

Speech or speaker recognition traditionally audio
only
Mature area of research
Significant problems in real-world environments
(Wark2001)
High acoustic noise
Variation of speech
Audio-visual speech processing adds an additional
modality to help alleviate these problems

3
Audio-Visual Speech Processing - Overview

Speech and speaker recognition tasks have many
overlapping areas
The same configuration can be used for both
text-dependent speaker recognition, and
speaker-dependent speech recognition
Train speaker-dependent word (or sub-word) models
Speaker recognition chooses amongst speakers for
a particular word, or
Word recognition chooses amongst words for a
particular speaker.

4
Audio-Visual Speech Processing - Overview

Little research has been done into how the two
applications (speaker vs. speech) differ in areas
other than the set of models chosen for
recognition
One area of interest in this research is the
reliance on each modality
Acoustic features typically work equally well in
either application (Young2002)
Little consensus has been reach on the
suitability of visual features for each
application

5
Experimental Setup
Acoustic Feature Extraction
Acoustic Speech/Speaker Models
Decision Fusion
Speech/ Speaker Decision
Visual Feature Extraction
Visual Speech/Speaker Models
Lip Location Tracking
6
Lip location and tracking
7
Finding Faces

Manual Red, Green and Blue skin thresholds were
trained for each speaker
Faces were located by applying these thresholds
to the video frames

8
Finding and tracking eyes

Top half of face region is searched for eyes
A shifted version of Cr-Cb thresholding was
performed to locate possible eye regions
(Butler2003)
Invalid eye candidate regions were removed, and
the most likely pair of candidates chosen as the
eyes
New eye location compared to old, and ignored if
too far from old
About 40 of sequences had to be manually
eye-tracked every 50 frames.

9
Finding and tracking lips

Eye locations are used to define
rotation-normalised lip search region (LSR)
LSR converted to Red/Green colour-space and
thresholded
Unlikely lip-candidates are removed
Rectangular area with largest amount of
lip-candidate area within is lip ROI.

10
Feature Extraction and Datasets

MFCC 15 1 energy, deltas and accelerations
48 features
PCA 20 eigenlip coefficients deltas and
accelerations 60 features
Eigenlip-space trained on entire data set of lip
images
Stationary speech from CUAVE (Patterson2002)
5 sequences for training, 2 for testing (per
speaker)
Testing was also performed on speech-babble
corrupted noisy versions

11
Training

Phone transcriptions obtained from earlier
research (Lucey 2004) were used to train speaker
independent HMM phone models in both audio and
visual domains
Speaker dependent models adapted using MLLR
adaption from speaker independent models
HMM Toolkit (HTK) was used (Young 2002)

12
Comparing acoustic and visual information for
speech processing

Investigated using the identification rates of
speaker-dependent acoustic and visual phoneme
models
Test segments freely transcribed using all
speaker dependent phoneme models
No restriction to specified user or word
Confusion tables for speech (phoneme) and speaker
recognition were examined to get identification
rates

13
Example Confusion Table(Phonemes in Clean
Acoustic Speech)
Actual Phonemes
Recognised Phonemes
14
Example Confusion Table(Phonemes in Clean Visual
Speech)
Actual Phonemes
Recognised Phonemes
15
Likelihood of speaker and phone identification
using phoneme models
16
Fusion

Because of the differing performance of each
modality at speech and speaker recognition, the
fusion configuration for each task must be
adjusted with these performances in mind
For these experiments
Weighted sum fusion of the top 10 normalised
scores in each modality
ranges from 0 (video only) to 1 (audio only)

17
Speech vs Speaker

The response of each system to speech-babble
noise over a selected range of values were
compared.

Word Identification
18
Speech vs Speaker

The response of each system to speech-babble
noise over a selected range of values were
compared.

Speaker Identification
19
Speech vs Speaker

Acoustic performance is basically equal for both
tasks
Visual performance is clearly better for speaker
recognition
Speech recognition fusion is catastrophic at
nearly all noise levels
Speaker recognition is only catastrophic at high
noise levels
We can also get an idea of the dominance of each
modality by looking at values of that produce
the best lines (ideal adaptive fusion)

20
Best Fusion
21
Conclusion and Further Work

PCA-based visual features are mostly
person-dependent
Should be used with care in visual speech
recognition tasks
It is believed that this dependency stems from
the large amount of static person-specific
information capture along with the dynamic lip
configuration
Skin colour, facial hair, etc.
Visual information for speech recognition is only
useful in high noise situations

22
Conclusion and Further Work

Even at very low levels of acoustic noise, visual
speech information can provide similar
performance to acoustic information for speaker
recognition
Adaptive fusion for speaker recognition should
therefore be biased towards visual features for
best performance
Further study needs to be performed in methods of
improving the visual modality for speech
recognition by focusing more on the dynamic
speech-related information
Mean-image removal, Optical flow, Contour
representations

23
References

(Butler2003) D. Butler, C. McCool, M. McKay, S.
Lowther, V. Chandran, and S. Sridharan, "Robust
Face Localisation Using Motion, Colour and
Fusion," presented at Proceedings of the Seventh
International Conference on Digital Image
Computing Techniques and Applications, DICTA
2003, Macquarie University, Sydney, Australia,
2003.
(Lucey2004) P. Lucey, T. Martin, and S.
Sridharan, "Confusability of Phonemes Grouped
According to their Viseme Classes in Noisy
Environments," presented at SST 2004, Sydney,
Australia, 2004.
(Patterson2002) E. Patterson, S. Gurbuz, Z.
Tufekci, and J. N. Gowdy, "CUAVE a new
audio-visual database for multimodal
human-computer interface research," presented at
Acoustics, Speech, and Signal Processing, 2002.
Proceedings. (ICASSP '02). IEEE International
Conference on, 2002.
(Young2002) S. Young, G. Evermann, D. Kershaw, G.
Moore, J. Odell, D. Ollason, D. Povey, V.
Valtchev, and P. Woodland, The HTK Book, 3.2 ed.
Cambridge, UK Cambridge University Engineering
Department., 2002.
(Wark2001) T. Wark and S. Sridharan, "Adaptive
fusion of speech and lip information for robust
speaker identification," Digital Signal
Processing, vol. 11, pp. 169-186, 2001.