Title: Comparing Audio and Visual Information for Speech Processing
1Comparing Audio and Visual Information for Speech
Processing
- David Dean, Patrick Lucey, Sridha Sridharan
and Tim Wark - Presented by David Dean
2Audio-Visual Speech Processing - Overview
- Speech or speaker recognition traditionally audio
only - Mature area of research
- Significant problems in real-world environments
(Wark2001) - High acoustic noise
- Variation of speech
- Audio-visual speech processing adds an additional
modality to help alleviate these problems
3Audio-Visual Speech Processing - Overview
- Speech and speaker recognition tasks have many
overlapping areas - The same configuration can be used for both
text-dependent speaker recognition, and
speaker-dependent speech recognition - Train speaker-dependent word (or sub-word) models
- Speaker recognition chooses amongst speakers for
a particular word, or - Word recognition chooses amongst words for a
particular speaker.
4Audio-Visual Speech Processing - Overview
- Little research has been done into how the two
applications (speaker vs. speech) differ in areas
other than the set of models chosen for
recognition - One area of interest in this research is the
reliance on each modality - Acoustic features typically work equally well in
either application (Young2002) - Little consensus has been reach on the
suitability of visual features for each
application
5Experimental Setup
Acoustic Feature Extraction
Acoustic Speech/Speaker Models
Decision Fusion
Speech/ Speaker Decision
Visual Feature Extraction
Visual Speech/Speaker Models
Lip Location Tracking
6Lip location and tracking
7Finding Faces
- Manual Red, Green and Blue skin thresholds were
trained for each speaker - Faces were located by applying these thresholds
to the video frames
8Finding and tracking eyes
- Top half of face region is searched for eyes
- A shifted version of Cr-Cb thresholding was
performed to locate possible eye regions
(Butler2003) - Invalid eye candidate regions were removed, and
the most likely pair of candidates chosen as the
eyes - New eye location compared to old, and ignored if
too far from old - About 40 of sequences had to be manually
eye-tracked every 50 frames.
9Finding and tracking lips
- Eye locations are used to define
rotation-normalised lip search region (LSR) - LSR converted to Red/Green colour-space and
thresholded - Unlikely lip-candidates are removed
- Rectangular area with largest amount of
lip-candidate area within is lip ROI.
10Feature Extraction and Datasets
- MFCC 15 1 energy, deltas and accelerations
48 features - PCA 20 eigenlip coefficients deltas and
accelerations 60 features - Eigenlip-space trained on entire data set of lip
images - Stationary speech from CUAVE (Patterson2002)
- 5 sequences for training, 2 for testing (per
speaker) - Testing was also performed on speech-babble
corrupted noisy versions
11Training
- Phone transcriptions obtained from earlier
research (Lucey 2004) were used to train speaker
independent HMM phone models in both audio and
visual domains - Speaker dependent models adapted using MLLR
adaption from speaker independent models - HMM Toolkit (HTK) was used (Young 2002)
12Comparing acoustic and visual information for
speech processing
- Investigated using the identification rates of
speaker-dependent acoustic and visual phoneme
models - Test segments freely transcribed using all
speaker dependent phoneme models - No restriction to specified user or word
- Confusion tables for speech (phoneme) and speaker
recognition were examined to get identification
rates
13Example Confusion Table(Phonemes in Clean
Acoustic Speech)
Actual Phonemes
Recognised Phonemes
14Example Confusion Table(Phonemes in Clean Visual
Speech)
Actual Phonemes
Recognised Phonemes
15Likelihood of speaker and phone identification
using phoneme models
16Fusion
- Because of the differing performance of each
modality at speech and speaker recognition, the
fusion configuration for each task must be
adjusted with these performances in mind - For these experiments
- Weighted sum fusion of the top 10 normalised
scores in each modality - ranges from 0 (video only) to 1 (audio only)
17Speech vs Speaker
- The response of each system to speech-babble
noise over a selected range of values were
compared.
Word Identification
18Speech vs Speaker
- The response of each system to speech-babble
noise over a selected range of values were
compared.
Speaker Identification
19Speech vs Speaker
- Acoustic performance is basically equal for both
tasks - Visual performance is clearly better for speaker
recognition - Speech recognition fusion is catastrophic at
nearly all noise levels - Speaker recognition is only catastrophic at high
noise levels - We can also get an idea of the dominance of each
modality by looking at values of that produce
the best lines (ideal adaptive fusion)
20Best Fusion
21Conclusion and Further Work
- PCA-based visual features are mostly
person-dependent - Should be used with care in visual speech
recognition tasks - It is believed that this dependency stems from
the large amount of static person-specific
information capture along with the dynamic lip
configuration - Skin colour, facial hair, etc.
- Visual information for speech recognition is only
useful in high noise situations
22Conclusion and Further Work
- Even at very low levels of acoustic noise, visual
speech information can provide similar
performance to acoustic information for speaker
recognition - Adaptive fusion for speaker recognition should
therefore be biased towards visual features for
best performance - Further study needs to be performed in methods of
improving the visual modality for speech
recognition by focusing more on the dynamic
speech-related information - Mean-image removal, Optical flow, Contour
representations
23References
- (Butler2003) D. Butler, C. McCool, M. McKay, S.
Lowther, V. Chandran, and S. Sridharan, "Robust
Face Localisation Using Motion, Colour and
Fusion," presented at Proceedings of the Seventh
International Conference on Digital Image
Computing Techniques and Applications, DICTA
2003, Macquarie University, Sydney, Australia,
2003. - (Lucey2004) P. Lucey, T. Martin, and S.
Sridharan, "Confusability of Phonemes Grouped
According to their Viseme Classes in Noisy
Environments," presented at SST 2004, Sydney,
Australia, 2004. - (Patterson2002) E. Patterson, S. Gurbuz, Z.
Tufekci, and J. N. Gowdy, "CUAVE a new
audio-visual database for multimodal
human-computer interface research," presented at
Acoustics, Speech, and Signal Processing, 2002.
Proceedings. (ICASSP '02). IEEE International
Conference on, 2002. - (Young2002) S. Young, G. Evermann, D. Kershaw, G.
Moore, J. Odell, D. Ollason, D. Povey, V.
Valtchev, and P. Woodland, The HTK Book, 3.2 ed.
Cambridge, UK Cambridge University Engineering
Department., 2002. - (Wark2001) T. Wark and S. Sridharan, "Adaptive
fusion of speech and lip information for robust
speaker identification," Digital Signal
Processing, vol. 11, pp. 169-186, 2001.
24Questions?