Title: Gesture Recognition
1Gesture Recognition
- 1. Recognition of parameterized gestures
- 2. Real-time sign language recognition using a
- single video camera
- Jaron Schaeffer
- Jaron.Schaeffer_at_jayweb.de
2TOC
- Recognition of parameterized gestures
- Parametric gestures
- Previous Approaches
- Parametric Gaussian Hidden Markov Models
- Training/Testing
- Results
- Real-time sign language recognition using a
single video camera - Objective
- Feature Extraction
- The desk-based recognizer
- The wearable-based recognizer
3Part 1Recognition and Interpretation of
Parametric Gesture
1. Parametric gestures 2. Previous Approaches 3.
Parametric Gaussian Hidden Markov Models 4.
Training/Testing 5. Results
4Recognition and Interpretation of Parametric
Gesture
- What is a parametric gesture?
- A gesture that has a parameter T
- which is needed to fully understand
- the gesture.
- In the example, T is the size of
- the fish, given by the distance
- between the signers hands.
- Another example a pointing gesture,
- where T is the direction pointed
- to.
I caught a fish. It was this big.
5Recognition of Parametric GesturePrevious
Approaches 1
- Ad-hoc method for each gesture to be
- recognized
- Use an ad-hoc method to extract the parameter T
for each different parametric gesture - Problems
- Difficult to write
- Only works for gestures already labeled
- Unknown gestures have to be modelled as noise
from an existing prototype - A new method is needed for each gesture
6Recognition of Parametric GesturePrevious
Approaches 2
- Use multiple HMMs to cover the parameter
- space
- Use a HMM for each possible value of T in
parameter space - Problems
- Unknown, how many separate models will be
necessary - As dimensionality of parameter space increases, a
large number of models will be needed - Unreasonable demands on the amount of training
data
7RepetitionStandard continuous Gaussian HMMs
8RepetitionStandard continuous Gaussian HMMs
2
1
3
Likelihood for output of 2 is about 10 given the
system is in state 2
9Parametric Gaussian HMMs The model
10Parametric Gaussian HMMsTraining
- Training means Set the HMM parameters to
maximize the probability of the training
sequences - Each training sequence is paired with a value of
T - Baum-Welch form of expectation-maximization alg.
is used to update the parameters of the output
probability distributions
11Training Parametric Gaussian HMMsExpectation-Max
imization algorithm
- Assumption In addition to the observable data
(the observation sequence xt), there is hidden
data (the state sequence qt) - Expectation-Maximization algorithm
- Expectation
- Compute/guess value of the hidden data given
some of the observable data (Forward/Backward-Alg.
) - Maximization
- Given this guess at the hidden data, compute an
updated value of the parameters - Repeat until satisfied (change in parameters is
small) - A lot of math no more details here
12Training Parametric Gaussian HMMsTraining
results
- After applying the EM algorithm for each training
sequence, we get new values for - Ready for testing!
13Recognition of Parametric GestureTesting
- Testing
- Given a parameterized HMM and an input sequence,
- we wish to compute T and the probability of the
input - sequence.
- Extracing T
- Complicated in contrast to normal HMM testing
- Again, use an Expectation-Maximization (EM)
algorithm - that finally leads to
- Probability of the input sequence given T Use
Viterbi.
14Recognition of Parametric GestureResults
STIVE input and output
- Testing for the fish size
- gesture
- 30 examples of the fish gesture were collected
using STIVE (STereo INteravive Virtual
Environment) at a frame rate of 20Hz - STIVE returned the 3D positions of head and hands
- Each sequence in average 43 samples long
- T interpreted as fish size in inches
- Values varied from 7.7 in (small fish) to 36.6
inches (repectable catch)
15Recognition of Parametric GestureResults
STIVE input and output
- Testing for the fish size
- gesture
- 6 state parameterized HMM with no skip
transitions or backtransitions - Training with randomly chosen 15 sequences out of
the 30, rest for testing
16Recognition of Parametric GestureResults
Testing for the size gesture
Standard derivation
mean
Average absolute error of only 0.16 in
17Recognition of Parametric GestureResults
- Testing for the pointing gesture
- HMM now parameterized by more than one variable
((X/Y) position of the plane in front of the
user) - Motion capture system to record wrist position of
right hand at a frame rate of 30Hz - 50 sequences collected
- T interpreted as position of the wrist on the
pointing plane - 8 state parameterized HMM with no skip
transitions or backtransitions - 20 sequences for training, 30 for testing
18Recognition of Parametric GestureResults
- Testing for the pointing gesture Results
19Recognition of Parametric GestureResults under
noise
The average error as a funtion of noise
- N(0, x)-distributed noise added for testing
- f(x) is mean error between estimated/measured T
under noise and measured T in the noise-free case - Under noise, the HMM performs even better than
directly measuring T - Why?
- Direct measuring is more sensitive to noise,
since only one still image is used to measure T
the HMM uses the complete sequence to extract T.
f(x)
x
20Recognition of Parametric GestureResults
- Results quite good
- Why?
- Magnitude of Wj greatest for states corresponding
to the middle phase of the gestures - In the middle phases of the gestures, variation
of T maximally impacts the execution of the
gesture - System automatically learns which segment in the
gesture is most diagnostic of T
21Part 2Real-time sign language recognition using
a single video camera
1. Objective 2. Feature Extraction 3. The
desk-based recognizer 4. The wearable-based
recognizer
22Objective
- Recognition of sentence-level American Sign
Language (ASL) - Sentences of the form
- personal pronoun verb noun adjective
(same) personal pronoun - are to be recognized
- Example I like cars red
-
23The American Sign Language
- Language of Choice for most deaf in the United
States - Uses approx. 6000 gestures for common words and
finger spelling for communicating obscure words - Signed conversations proceed at about the pace of
spoken conversation - Some aspects of ASL ignored for simplification
- Storing objects in space for later reference,
moving of eyebrows for questions or directives
24Understanding ASLThe Task
- Two extensible HMM-based systems are provided for
recognition, both using one color camera - Desk mounted camera in front of user
- Camera mounted in a cap worn by the user
- Tracking stage does not attempts fine description
of hand shape, instead concentrates on the
evolution of the gestures through time - 40-words test lexicon with words that would
generate coherent sentences given the grammar
constraint
25Understanding ASL Hidden Markov Modeling
- Estimate the number of different states involved
in specifying a sign to determine the initial HMM
topology - For less complicated signs, skip transitions can
be introduced - Here, a 4 state HMM with one skip transition was
determined to be appropriate
26Understanding ASL Feature extraction - Hardware
- Hands are tracked in real-time using a single
color camera - 320x243 pixel resolution
- Silicon graphics 200Mhz workstation maintains
hand tracking at 10 frames per second
(sufficient) - Natural color of hands is needed
27Understanding ASL Feature Extraction - Hand
segmentation
- Hand segmentation
- To segment each hand initially, find a pixel of
the natural hand color in the image - Take this pixel as a seed and tolerantly grow the
hand region by checking the 8 neighbours for the
appropriate color - Labels left hand and right hand are assigned
to to whichever blob is leftmost and rightmost.
Seed pixels
right hand
left hand
What about occluding hands?
28Understanding ASL Feature extraction Features
used
- 16 element feature vector contructed for each
hand - Centroid (X,Y) position
- Change in (X,Y) to previous frame
- Area in Pixels
- Angle of axis of least inertia1 (found by first
eigenvector of the blob) - Length of this eigenvector
- Eccentricity2 of bounding ellipse
1. Inertia Trägheit 2. Eccentricity Hier
Abweichung von der Kreisform
29Understanding ASL Feature Extraction Occluding
hands
- Occlusion in hand
- segmentation
- Only one large blob
- Assign each of the two hands the features of this
single large blob - This method, combined with the time context
provided by HMM, is sufficient to distinguish
many different signs that have hand occlusions as
a trait
30Understanding ASL The desk-based recognizer
- Camera on a desk in front of the user
- 478 sentences used, constructed from the 40-words
lexicon - Each sign is 1 to 3 seconds long
- No pause between signs in a sentence, but
sentences themselves are distinct - 384 sentences used for training, rest for testing
31Understanding ASL The desk-based recognizer -
Training
- Sentences are divided in five equal portions for
initial segmentation - Initial estimates for the means and variances of
the output prob. are provided iteratively using
Viterbi alignment - Result are fed into a Baum-Welch re-estimator
whose estimates are refined in embedded training - Contexts are not used, since they would require
more data to train
32Understanding ASL The desk-based recognizer
Test 1
- Uses part-of-speech grammar
- personal pronoun verb noun adjective
(same) personal pronoun - Word recognition accuracy Acc is calculated by
- N total number of words in test set
- S number of substitutions
- No insertions or deletions, since number and
class of words to be recognized is known - Acc Percentage of correctly recognized words
-
33Understanding ASL The desk-based recognizer
Test 2
- Does not use part-of-speech grammar
- Word recognition accuracy Acc is calculated by
- N total number of words in test set
- S number of substitutions
- I number of insertions
- D number of deletions
- Insertions and deletions possible, since number
of words an word class unknown - Acc can now be negative
34Understanding ASL The desk-based recognizer
Results
- Third test performed Strip the absolute (X,Y)
positions from the feature vector - Simulates use of the recognizer in daily use if
the signer is not always in the same position
when the system is used - Word accuracy results
35Understanding ASL The wearable-based recognizer
- Camera mounted on a cap worn by the signer
- Same 500 sentences
- At beginning and end of sentence, hands were
often found in a resting position - To take this into account, another token called
silence was added to the dictionary - 400 sentences for training, 100 for testing
36Understanding ASL The wearable-based recognizer
- New grammar for testing purposes Only
restriction is that each sentence is 5 words long - Word Accuracy Rate Acc is calculated in the same
way as with the desk-based recognizer
37Understanding ASL The wearable-based recognizer
- Results
38End of presentation
- Thanks for your attention!
- References
- Real-Time American Sign Language Recognition
Using Desk and Wearable Computer Based Video - Thad Starner, Joshua Weaver, Alex Pentland
- M.I.T. Media Laboratory Perceptual Computing
Section Rechnical Resport No. 466 - IEEE PAMI 1998
- Recognition and Interpretation of Parametric
Gesture - Andrew D. Wilson, Aaron F. Bobick
- M.I.T. Media Laboratory Perceptual Computing
Section Rechnical Resport No. 421 - Internactional Conference on Computer Vision,
1998 - An Introduction to Hidden Markov models
- L.R. Rabiner and B.H. Juang
- IEEE ASSP Magazine, p. 4-16, Jan 1986
Any questions?