Gesture Recognition in Complex Scenes - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Gesture Recognition in Complex Scenes

Description:

Gesture Recognition in Complex Scenes Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Collaborators Jonathan Alon (ex ... – PowerPoint PPT presentation

Number of Views:362

Avg rating:3.0/5.0

Slides: 61

Provided by: Vassilis1

Category:

more less

Transcript and Presenter's Notes

Title: Gesture Recognition in Complex Scenes

1
Gesture Recognition in Complex Scenes

Vassilis Athitsos
Computer Science and Engineering Department
University of Texas at Arlington

2
Collaborators

Jonathan Alon (ex-BU, now Negevtech, Israel).
Jingbin Wang (ex-BU, now Google).
Quan Yuan (Boston University).
Alexandra Stefan (Boston University).
Stan Sclaroff (Boston University).
George Kollios (Boston University).
Margrit Betke (Boston University).

3
Motivation ASL Dictionary
4
Motivation ASL Dictionary

Addresses needs of a large community
500,000 to 2 million ASL users in the US.
??? in the European Union.
Direct impact in education of Deaf children.
Most born to hearing parents, learn ASL at
school.
Challenging problems in vision, learning,
database indexing.
Large-scale motion-based video retrieval.
Efficient large-scale multiclass recognition.
Learning complex patterns from few examples.

5
Sources of Information

Hand motion.
Hand pose.
Shape.
Orientation.
Facial expressions.
Body pose.

6
Dynamic Gestures

What gesture did the user perform?

Class 8
7
Typical Motion Recognition Approach
input sequence
trajectory
DetectorTracker
Classifier
class 0
8
Bottom-up Shortcoming
input frame
hand likelihood

Hand detection is often hard!
Color, motion, background subtraction are often
not enough.
Bottom-up frameworks are a fundamental computer
vision bottleneck.

9
Key Idea
hand candidates
input frame

Hand detection can return multiple candidates.
Design a recognition module for this type of
input.

10
Nearest-Neighbor Recognition
Query
M1
M2

Question how should we measure similarity?

MN
11
Database Sequences
Example database gesture

Assumption hand location is known in all frames
of the database gestures.
Database is built offline.
In worst case, manual annotation.
Online user experience is not affected.

12
Comparing Trajectories

is the hand position at frame i.
Temporary assumption known hand location.
How do we compare these trajectories?

13
Comparing Trajectories

Comparing i-th frame to i-th frame is
problematic.
What do we do with frame 8?

14
Comparing Trajectories

Alignment ((f1, g1), , (fm, gm)).
Must include all frames of both sequences.
A frame can occur multiple consecutive times.

15
Comparing Trajectories

((1,1), (1,2), (2,3), (3,4), (4,5),(5,6), (6,7),
(7,8))

16
Optimal Alignment

Cost of ((f1, g1), , (fm, gm)) has two terms
Correspondence cost average cost of each (fi,
gi),
Transition cost cost of two consecutive
pairings.
Dynamic Time Warping (DTW) computes optimal
alignment.
Complexity quadratic to length of sequences.

17
DTW
Frame 1
Frame 50
Frame 80
..
..
Q
M

Frame 1
. .
Frame 32
. .
Frame 51

For each cell (i, j)
Compute optimal alignment of M(1i) to Q(1j).
Answer depends only on (i-1, j), (i, j-1), (i-1,
j-1).
Time complexity proportional to size of table.

18
DSTW
..
..
Q
M

W

W

. .
. .
K
2
1

Alignment ((f1, g1 , k1), , (fm, gm , km))
Matching cost average cost of each (fi , gi ,
ki),
Transition cost cost of two consecutive
pairings.
How do we find the optimal alignment?

19
DSTW
..
..
Q
M

W

W

. .
. .
K
2
1

For each cell (i, j, k)
Compute optimal alignment of M(1i) to Q(1j),
using the k-th candidate for frame Q(j).
Answer depends on (i-1, j,k), (i, j-1,), (i-1,
j-1,).

20
DSTW
..
..
Q
M

W

W

K
2
1

Result optimal alignment.
((f1, g1, k1), (f2, g2, k2), , (fm, gm, km)).
We get hand locations for free!

21
Application Gesture Recognition with Short
Sleeves!
22
Experiment 10 Digits.
23
Experiment 10 Digits.

Test set 90 gestures, from 3 users.
Database 90 gestures from 3 users.
Each test gesture was only matched to the 60
examples from the other users
Accuracy 91.

24
Discussion

Higher level module (recognition) tolerant to
lower-level (detection) ambiguities.
Recognition disambiguates detection.
This is important for designing plug-and-play
modules.
Use in ASL dictionary.
User signs unknown word in front of computer.
Video sequences of signs are ranked in order of
DSTW score.

25
Static Gestures (Hand Poses)

Given a hand model, and a single image of a hand,
estimate
3D hand shape (joint angles).
3D hand orientation.

Joints
Input image
Articulated hand model
26
Static Gestures

Given a hand model, and a single image of a hand,
estimate
3D hand shape (joint angles).
3D hand orientation.

Input image
Articulated hand model
27
Similarity Based Matching

Goal
Estimate the class of query gesture q.
Method
Find the most similar database gestures.

q
query gesture
database
28
Problems

How do we measure similarity?
Tolerate errors in feature extraction.
Hand detection and segmentation.
How do we achieve efficient retrieval?
Efficient approximations of slow similarity
measures.

q
query gesture
database
29
Goal Hand Tracking Initialization

Given the 3D hand pose in the previous frame,
estimate it in the current frame.
Problem no good way to automatically initialize
a tracker.
Rehg et al. (1995), Heap et al. (1996),
Shimada et al. (2001),
Wu et al. (2001), Stenger et al. (2001), Lu
et al. (2003),

30
Assumptions in Our Approach

A few tens of distinct hand shapes.

31
Assumptions in Our Approach

A few tens of distinct hand shapes.
All 3D orientations should be allowed.
Motivation American Sign Language.

32
Assumptions in Our Approach

A few tens of distinct hand shapes.
All 3D orientations should be allowed.
Motivation American Sign Language.
Input single image, bounding box of hand.

33
Assumptions in Our Approach
input image
skin detection
segmented hand

We do not assume precise segmentation!
No clean contour extracted.

34
Approach Database Search

Over 100,000 computer-generated images.
Known hand pose.

input
35
Why?

We avoid direct estimation of 3D info.
With a database, we only match 2D to 2D.
We can find all plausible estimates.
Hand pose is often ambiguous.

input
36
Building the Database
26 hand shapes
37
Building the Database
4128 images are generated for each hand
shape. Total 107,328 images.
38
Features Edge Pixels

We use edge images.
Easy to extract.
Stable under illumination changes.

input
edge image
39
Similarity Measure Chamfer Distance
input
model
Overlaying input and model
How far apart are they?
40
Directed Chamfer Distance

Input two sets of points.
red, green.
c(red, green)
Average distance from each red point to nearest
green point.

41
Directed Chamfer Distance

Input two sets of points.
red, green.
c(red, green)
Average distance from each red point to nearest
green point.
c(green, red)
Average distance from each red point to nearest
green point.

42
Chamfer Distance

Input two sets of points.
red, green.
c(red, green)
Average distance from each red point to nearest
green point.
c(green, red)
Average distance from each red point to nearest
green point.

Chamfer distance C(red, green) c(red, green)
c(green, red)
43
Evaluating Retrieval Accuracy

A database image is a correct match for the input
if
the hand shapes are the same,
3D hand orientations differ by at most 30 degrees.

correct matches
input
incorrect matches
44
Evaluating Retrieval Accuracy

An input image has 25-35 correct matches among
the 107,328 database images.
Ground truth for input images is estimated by
humans.

45
Evaluating Retrieval Accuracy

Retrieval accuracy measure what is the rank of
the highest ranking correct match?

46
Evaluating Retrieval Accuracy
input

rank 1
rank 2
rank 3
rank 5
rank 6
rank 4
highest ranking correct match
47
Results on 703 Real Hand Images
Rank of highest ranking correct match Percentage of test images
1 15
1-10 40
1-100 73
48
Results on 703 Real Hand Images
Rank of highest ranking correct match Percentage of test images
1 15
1-10 40
1-100 73

Results are better on nicer images
Dark background.
Frontal view.
For half the images, top match was correct.

49
Examples
segmented hand
edge image
initial image
correct match
rank 1
50
Examples
segmented hand
edge image
initial image
correct match
rank 644
51
Examples
segmented hand
edge image
initial image
incorrect match
rank 1
52
Examples
segmented hand
edge image
initial image
correct match
rank 1
53
Examples
segmented hand
edge image
initial image
correct match
rank 33
54
Examples
segmented hand
edge image
initial image
incorrect match
rank 1
55
Examples
segmented hand
edge image
hard case
segmented hand
edge image
easy case
56
Discussion