Title: An%20Automatic%20Lip-reading%20Method%20Based%20on%20Polynomial%20Fitting
1An Automatic Lip-reading Method Based on
Polynomial Fitting
Meng LI Supervisor Dr. Yiu-ming
CHEUNG Department of Computer Science Hong Kong
Baptist University
2Content
Introduction
Lip segmentation
Visual speech recognition
Experiment
Conclusion and future work
3Introduction
The speech perception is multimodal involves
information from at least two sensory modalities.
4Introduction
Silent Environment
Visual Only
73
Audio Only
91
Visual-Audio
97
0 20
40 60
80 100
Noisy Environment
Visual Only
73
Audio Only
47
Visual-Audio
87
0 20
40 60
80 100
5Introduction
Speech recognition in noisy environment
Visual-only speech recognition
Identification
Others
The hottest research direction in lip-reading is
visual-speech recognition (with audio
information, or visual only)
63
31
1
5
6Introduction
The basic structure of an typical AVSR (Automatic
Visual-Speech Recognition) system
Preprocessing
Feature Extraction
AV Fusion
7Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Pixel Based
Motion Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
8Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Pixel Based
Motion Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
9Introduction
Positive
Disadvantage
Advantage
- Sensitive to the illumination condition.
- Sensitive to the rotate, scale transform.
- Human dependence.
- High dimension of feature data.
- All information are utilized.
- Highest recognition in ideal illumination
condition.
10Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Motion Based
Pixel Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
11Introduction
Negative
Positive
Disadvantage
Advantage
- Sensitive to the illumination condition.
- Sensitive to the rotate, scale transform.
- Human dependence.
- High dimension of feature data.
- Represent the motion of lip directly and
completely.
12Introduction
Using all pixels in lip region as feature.
Capture the moving feature in all or parts of lip
during pronunciation
Pixel Based
Moving Based
Model Based
Shape Based
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and
the modal, using some parameters to represent the
shape of lip.
13Introduction
Negative
Positive
Disadvantage
Advantage
- Low dimension of feature data.
- Robust to rotate and scale transformation.
- If the model appropriate, human independence ca
be implemented. - Convenient to employ some classical method (e.g.
HMM) to match.
- High computation complexity.
14Introduction
15Introduction
The rest of this presentation.
Lip segmentation under gray-level
- Based on gray-level image.
- Locate the minimum enclosing rectangular of
mouth. - High processing speed.
-
- Low computation complexity.
16Introduction
The rest of this presentation.
Lip segmentation in colour space
- Based on rgb, hsv and Lab colour space.
- Can extract the outer boundary of lip.
- High accuracy.
- High computation complexity.
17Introduction
The rest of this presentation.
Visual only speech recognition
- Based on polynomial fitting.
- High processing speed. Suitable for real-time
system. -
- Perform good in limited training set.
18Lip segmentation (1)
19Lip segmentation (1)
20Lip segmentation (1)
21Lip segmentation (2)
Firstly, we transform the source image from RGB
color space into Lab space. In a channel,
negative values indicate green while positive
values indicate magenta. So, it is helpful to
highlight the lip region from skin.
22Lip segmentation (2)
23Lip segmentation (2)
In source image, we get the pixels located in the
non-black area, and transform them into HSV color
space. Then, we can get a vector as follow
We assume the data follow a normal distribution,
and estimate the mean and variance via ML
24Lip segmentation (2)
We can transform the source image into HSV color
space, and get the vector as follow
Then, we can get a new image
The lighter pixel means it is similar to lip
region in color space.
25Lip segmentation (2)
We select the block in which include
the gravity center as the lip region.
26Visual speech recognition
27Visual speech recognition
For each utterance, we can get two curves
correspond into the changing of width and height
of lip, respectively.
We can employ LSE to construct two polynomial to
fit the two curves.
28Visual speech recognition
In this work, we get n3.
The maximum, minimum and the most right point is
recorded as the feature vectors.
Each utterance is assigned a label j, and we
use the following equations to train
We use the following equations to test (F is the
input feature vector, and T is the trained
template feature vector)
29Experiment
The illumination source is an 18w fluorescent
lamp, the resolution of camera is 320240, FPS
30, and the entire environment is shown as below.
Our task is to recognize 10 isolate digits (0 to
9) in Chinese mandarin.
There are 5 speakers (4 males and 1 female) take
part into the experiment. For each digit,
speakers were asked to repeat 10 times to train
the system, and fifty times to test.
30Experiment
The experiment result is shown as below
Digit Accuracy Digit Accuracy
0 0.972 5 0.912
1 0.952 6 0.964
2 0.976 7 0.744
3 0.964 8 0.952
4 0.788 9 0.932
31Experiment
Compare with some existed approaches which also
utilize the width and height of lip as visual
feature
Method Accuracy
1 0.8127
2 0.7741
3 0.9149
4 0.7720
Our approach 0.9156
1,2 and 3 S.L.Wang, W.H.Lau, A.W.C.Liew, and
S.H.Leung. Automatic lipreading with limited
training data. In Proc. ICPR 2006, pp 881-884,
2006. 4 A.R.Baig, R.Seguier, and G. Vaucher.
Image sequence analysis using a spatio-temporal
coding for automatic lipreading. In Porc. ICIAP
1999, pp 544-549, 1999.
32Experiment
33Conclusion Future work
In this paper, we have proposed a new approach to
automatic lip reading recognition based upon
polynomial fitting. The feature vector of our
approach have low dimensions and the approach
need small testing data set. Experiments have
shown the promising result of the proposed
approach in comparison with the existing methods.
However, in the more difficult experiment task,
e.g. to recognize some words or sentences, some
appropriate model is required. This is the
emphasis of the next stage research.
34Thank you! 31-08-2009