An%20Automatic%20Lip-reading%20Method%20Based%20on%20Polynomial%20Fitting - PowerPoint PPT Presentation

About This Presentation
Title:

An%20Automatic%20Lip-reading%20Method%20Based%20on%20Polynomial%20Fitting

Description:

The speech perception is multimodal involves information from at least two ... In Porc. ICIAP 1999, pp: 544-549, 1999. Experiment. Conclusion & Future work ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 35
Provided by: compHk
Category:

less

Transcript and Presenter's Notes

Title: An%20Automatic%20Lip-reading%20Method%20Based%20on%20Polynomial%20Fitting


1
An Automatic Lip-reading Method Based on
Polynomial Fitting
Meng LI Supervisor Dr. Yiu-ming
CHEUNG Department of Computer Science Hong Kong
Baptist University
2
Content
Introduction
Lip segmentation
Visual speech recognition
Experiment
Conclusion and future work
3
Introduction
The speech perception is multimodal involves
information from at least two sensory modalities.
4
Introduction
Silent Environment



Visual Only
73
Audio Only
91
Visual-Audio
97
0 20
40 60
80 100
Noisy Environment



Visual Only
73
Audio Only
47
Visual-Audio
87
0 20
40 60
80 100
5
Introduction
Speech recognition in noisy environment
Visual-only speech recognition
Identification
Others
The hottest research direction in lip-reading is
visual-speech recognition (with audio
information, or visual only)
63
31
1
5
6
Introduction
The basic structure of an typical AVSR (Automatic
Visual-Speech Recognition) system
Preprocessing
Feature Extraction
AV Fusion
7
Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Pixel Based
Motion Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
8
Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Pixel Based
Motion Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
9
Introduction
Positive
Disadvantage
Advantage
  • Sensitive to the illumination condition.
  • Sensitive to the rotate, scale transform.
  • Human dependence.
  • High dimension of feature data.
  • All information are utilized.
  • Highest recognition in ideal illumination
    condition.

10
Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Motion Based
Pixel Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
11
Introduction
Negative
Positive
Disadvantage
Advantage
  • Sensitive to the illumination condition.
  • Sensitive to the rotate, scale transform.
  • Human dependence.
  • High dimension of feature data.
  • Represent the motion of lip directly and
    completely.

12
Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Pixel Based
Moving Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
13
Introduction
Negative
Positive
Disadvantage
Advantage
  • Low dimension of feature data.
  • Robust to rotate and scale transformation.
  • If the model appropriate, human independence ca
    be implemented.
  • Convenient to employ some classical method (e.g.
    HMM) to match.
  • High computation complexity.

14
Introduction
15
Introduction
The rest of this presentation.
Lip segmentation under gray-level
  • Based on gray-level image.
  • Locate the minimum enclosing rectangular of
    mouth.
  • High processing speed.
  • Low computation complexity.

16
Introduction
The rest of this presentation.
Lip segmentation in colour space
  • Based on rgb, hsv and Lab colour space.
  • Can extract the outer boundary of lip.
  • High accuracy.
  • High computation complexity.

17
Introduction
The rest of this presentation.
Visual only speech recognition
  • Based on polynomial fitting.
  • High processing speed. Suitable for real-time
    system.
  • Perform good in limited training set.

18
Lip segmentation (1)
19
Lip segmentation (1)
20
Lip segmentation (1)
21
Lip segmentation (2)
Firstly, we transform the source image from RGB
color space into Lab space. In a channel,
negative values indicate green while positive
values indicate magenta. So, it is helpful to
highlight the lip region from skin.
22
Lip segmentation (2)
23
Lip segmentation (2)
In source image, we get the pixels located in the
non-black area, and transform them into HSV color
space. Then, we can get a vector as follow
We assume the data follow a normal distribution,
and estimate the mean and variance via ML
24
Lip segmentation (2)
We can transform the source image into HSV color
space, and get the vector as follow
Then, we can get a new image
The lighter pixel means it is similar to lip
region in color space.
25
Lip segmentation (2)
We select the block in which include
the gravity center as the lip region.
26
Visual speech recognition
27
Visual speech recognition
For each utterance, we can get two curves
correspond into the changing of width and height
of lip, respectively.
We can employ LSE to construct two polynomial to
fit the two curves.
28
Visual speech recognition
In this work, we get n3.
The maximum, minimum and the most right point is
recorded as the feature vectors.
Each utterance is assigned a label j, and we
use the following equations to train
We use the following equations to test (F is the
input feature vector, and T is the trained
template feature vector)
29
Experiment
The illumination source is an 18w fluorescent
lamp, the resolution of camera is 320240, FPS
30, and the entire environment is shown as below.
Our task is to recognize 10 isolate digits (0 to
9) in Chinese mandarin.
There are 5 speakers (4 males and 1 female) take
part into the experiment. For each digit,
speakers were asked to repeat 10 times to train
the system, and fifty times to test.
30
Experiment
The experiment result is shown as below
Digit Accuracy Digit Accuracy
0 0.972 5 0.912
1 0.952 6 0.964
2 0.976 7 0.744
3 0.964 8 0.952
4 0.788 9 0.932
31
Experiment
Compare with some existed approaches which also
utilize the width and height of lip as visual
feature
Method Accuracy
1 0.8127
2 0.7741
3 0.9149
4 0.7720
Our approach 0.9156
1,2 and 3 S.L.Wang, W.H.Lau, A.W.C.Liew, and
S.H.Leung. Automatic lipreading with limited
training data. In Proc. ICPR 2006, pp 881-884,
2006. 4 A.R.Baig, R.Seguier, and G. Vaucher.
Image sequence analysis using a spatio-temporal
coding for automatic lipreading. In Porc. ICIAP
1999, pp 544-549, 1999.
32
Experiment
33
Conclusion Future work
In this paper, we have proposed a new approach to
automatic lip reading recognition based upon
polynomial fitting. The feature vector of our
approach have low dimensions and the approach
need small testing data set. Experiments have
shown the promising result of the proposed
approach in comparison with the existing methods.
However, in the more difficult experiment task,
e.g. to recognize some words or sentences, some
appropriate model is required. This is the
emphasis of the next stage research.
34
Thank you! 31-08-2009
Write a Comment
User Comments (0)
About PowerShow.com