Gesture Recognition in Complex Scenes - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Gesture Recognition in Complex Scenes

Description:

Gesture Recognition in Complex Scenes Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Collaborators Jonathan Alon (ex ... – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 61
Provided by: Vassilis1
Category:

less

Transcript and Presenter's Notes

Title: Gesture Recognition in Complex Scenes


1
Gesture Recognition in Complex Scenes
  • Vassilis Athitsos
  • Computer Science and Engineering Department
  • University of Texas at Arlington

2
Collaborators
  • Jonathan Alon (ex-BU, now Negevtech, Israel).
  • Jingbin Wang (ex-BU, now Google).
  • Quan Yuan (Boston University).
  • Alexandra Stefan (Boston University).
  • Stan Sclaroff (Boston University).
  • George Kollios (Boston University).
  • Margrit Betke (Boston University).

3
Motivation ASL Dictionary
4
Motivation ASL Dictionary
  • Addresses needs of a large community
  • 500,000 to 2 million ASL users in the US.
  • ??? in the European Union.
  • Direct impact in education of Deaf children.
  • Most born to hearing parents, learn ASL at
    school.
  • Challenging problems in vision, learning,
    database indexing.
  • Large-scale motion-based video retrieval.
  • Efficient large-scale multiclass recognition.
  • Learning complex patterns from few examples.

5
Sources of Information
  • Hand motion.
  • Hand pose.
  • Shape.
  • Orientation.
  • Facial expressions.
  • Body pose.

6
Dynamic Gestures
  • What gesture did the user perform?

Class 8
7
Typical Motion Recognition Approach
input sequence
trajectory
DetectorTracker
Classifier
class 0
8
Bottom-up Shortcoming
input frame
hand likelihood
  • Hand detection is often hard!
  • Color, motion, background subtraction are often
    not enough.
  • Bottom-up frameworks are a fundamental computer
    vision bottleneck.

9
Key Idea
hand candidates
input frame
  • Hand detection can return multiple candidates.
  • Design a recognition module for this type of
    input.

10
Nearest-Neighbor Recognition
Query
M1
M2
  • Question how should we measure similarity?

MN
11
Database Sequences
Example database gesture
  • Assumption hand location is known in all frames
    of the database gestures.
  • Database is built offline.
  • In worst case, manual annotation.
  • Online user experience is not affected.

12
Comparing Trajectories
  • is the hand position at frame i.
  • Temporary assumption known hand location.
  • How do we compare these trajectories?

13
Comparing Trajectories
  • Comparing i-th frame to i-th frame is
    problematic.
  • What do we do with frame 8?

14
Comparing Trajectories
  • Alignment ((f1, g1), , (fm, gm)).
  • Must include all frames of both sequences.
  • A frame can occur multiple consecutive times.

15
Comparing Trajectories
  • ((1,1), (1,2), (2,3), (3,4), (4,5),(5,6), (6,7),
    (7,8))

16
Optimal Alignment
  • Cost of ((f1, g1), , (fm, gm)) has two terms
  • Correspondence cost average cost of each (fi,
    gi),
  • Transition cost cost of two consecutive
    pairings.
  • Dynamic Time Warping (DTW) computes optimal
    alignment.
  • Complexity quadratic to length of sequences.

17
DTW
Frame 1
Frame 50
Frame 80
..
..
Q
M






Frame 1
. .
Frame 32
. .
Frame 51
  • For each cell (i, j)
  • Compute optimal alignment of M(1i) to Q(1j).
  • Answer depends only on (i-1, j), (i, j-1), (i-1,
    j-1).
  • Time complexity proportional to size of table.

18
DSTW
..
..
Q
M


W





W









. .
. .
K
2
1
  • Alignment ((f1, g1 , k1), , (fm, gm , km))
  • Matching cost average cost of each (fi , gi ,
    ki),
  • Transition cost cost of two consecutive
    pairings.
  • How do we find the optimal alignment?

19
DSTW
..
..
Q
M


W





W









. .
. .
K
2
1
  • For each cell (i, j, k)
  • Compute optimal alignment of M(1i) to Q(1j),
    using the k-th candidate for frame Q(j).
  • Answer depends on (i-1, j,k), (i, j-1,), (i-1,
    j-1,).

20
DSTW
..
..
Q
M


W





W









K
2
1
  • Result optimal alignment.
  • ((f1, g1, k1), (f2, g2, k2), , (fm, gm, km)).
  • We get hand locations for free!

21
Application Gesture Recognition with Short
Sleeves!
22
Experiment 10 Digits.
23
Experiment 10 Digits.
  • Test set 90 gestures, from 3 users.
  • Database 90 gestures from 3 users.
  • Each test gesture was only matched to the 60
    examples from the other users
  • Accuracy 91.

24
Discussion
  • Higher level module (recognition) tolerant to
    lower-level (detection) ambiguities.
  • Recognition disambiguates detection.
  • This is important for designing plug-and-play
    modules.
  • Use in ASL dictionary.
  • User signs unknown word in front of computer.
  • Video sequences of signs are ranked in order of
    DSTW score.

25
Static Gestures (Hand Poses)
  • Given a hand model, and a single image of a hand,
    estimate
  • 3D hand shape (joint angles).
  • 3D hand orientation.

Joints
Input image
Articulated hand model
26
Static Gestures
  • Given a hand model, and a single image of a hand,
    estimate
  • 3D hand shape (joint angles).
  • 3D hand orientation.

Input image
Articulated hand model
27
Similarity Based Matching
  • Goal
  • Estimate the class of query gesture q.
  • Method
  • Find the most similar database gestures.

q
query gesture
database
28
Problems
  • How do we measure similarity?
  • Tolerate errors in feature extraction.
  • Hand detection and segmentation.
  • How do we achieve efficient retrieval?
  • Efficient approximations of slow similarity
    measures.

q
query gesture
database
29
Goal Hand Tracking Initialization
  • Given the 3D hand pose in the previous frame,
    estimate it in the current frame.
  • Problem no good way to automatically initialize
    a tracker.
  • Rehg et al. (1995), Heap et al. (1996),
    Shimada et al. (2001),
  • Wu et al. (2001), Stenger et al. (2001), Lu
    et al. (2003),

30
Assumptions in Our Approach
  • A few tens of distinct hand shapes.

31
Assumptions in Our Approach
  • A few tens of distinct hand shapes.
  • All 3D orientations should be allowed.
  • Motivation American Sign Language.

32
Assumptions in Our Approach
  • A few tens of distinct hand shapes.
  • All 3D orientations should be allowed.
  • Motivation American Sign Language.
  • Input single image, bounding box of hand.

33
Assumptions in Our Approach
input image
skin detection
segmented hand
  • We do not assume precise segmentation!
  • No clean contour extracted.

34
Approach Database Search
  • Over 100,000 computer-generated images.
  • Known hand pose.

input
35
Why?
  • We avoid direct estimation of 3D info.
  • With a database, we only match 2D to 2D.
  • We can find all plausible estimates.
  • Hand pose is often ambiguous.

input
36
Building the Database
26 hand shapes
37
Building the Database
4128 images are generated for each hand
shape. Total 107,328 images.
38
Features Edge Pixels
  • We use edge images.
  • Easy to extract.
  • Stable under illumination changes.

input
edge image
39
Similarity Measure Chamfer Distance
input
model
Overlaying input and model
How far apart are they?
40
Directed Chamfer Distance
  • Input two sets of points.
  • red, green.
  • c(red, green)
  • Average distance from each red point to nearest
    green point.

41
Directed Chamfer Distance
  • Input two sets of points.
  • red, green.
  • c(red, green)
  • Average distance from each red point to nearest
    green point.
  • c(green, red)
  • Average distance from each red point to nearest
    green point.

42
Chamfer Distance
  • Input two sets of points.
  • red, green.
  • c(red, green)
  • Average distance from each red point to nearest
    green point.
  • c(green, red)
  • Average distance from each red point to nearest
    green point.

Chamfer distance C(red, green) c(red, green)
c(green, red)
43
Evaluating Retrieval Accuracy
  • A database image is a correct match for the input
    if
  • the hand shapes are the same,
  • 3D hand orientations differ by at most 30 degrees.

correct matches
input
incorrect matches
44
Evaluating Retrieval Accuracy
  • An input image has 25-35 correct matches among
    the 107,328 database images.
  • Ground truth for input images is estimated by
    humans.

45
Evaluating Retrieval Accuracy
  • Retrieval accuracy measure what is the rank of
    the highest ranking correct match?

46
Evaluating Retrieval Accuracy
input


rank 1
rank 2
rank 3
rank 5
rank 6
rank 4
highest ranking correct match
47
Results on 703 Real Hand Images
Rank of highest ranking correct match Percentage of test images
1 15
1-10 40
1-100 73
48
Results on 703 Real Hand Images
Rank of highest ranking correct match Percentage of test images
1 15
1-10 40
1-100 73
  • Results are better on nicer images
  • Dark background.
  • Frontal view.
  • For half the images, top match was correct.

49
Examples
segmented hand
edge image
initial image
correct match
rank 1
50
Examples
segmented hand
edge image
initial image
correct match
rank 644
51
Examples
segmented hand
edge image
initial image
incorrect match
rank 1
52
Examples
segmented hand
edge image
initial image
correct match
rank 1
53
Examples
segmented hand
edge image
initial image
correct match
rank 33
54
Examples
segmented hand
edge image
initial image
incorrect match
rank 1
55
Examples
segmented hand
edge image
hard case
segmented hand
edge image
easy case
56
Discussion
  • 3D pose estimation from a single image is hard!
  • What is our system good for?
  • Cleanly segmented frontal views.
  • Generating hypotheses that domain
    knowledge/constraints can disambiguate.
  • Tracker initialization and error recovery.
  • How would our system be integrated with a tracker?

57
Research Directions
  • More accurate similarity measures.
  • Problem higher-level features are more
    informative, but harder to calculate.
  • Better tolerance to segmentation errors.
  • Clutter.
  • Incorrect scale and translation.

58
Current Work ASL Dictionary
59
Current Work ASL Dictionary
  • Computer vision challenge
  • Estimate hand pose and motion accurately and
    fast.
  • Our existing hand pose method leaves many
    questions unanswered.
  • Machine learning challenge
  • Currently, in DSTW, there is no learning.
  • learn models of signs.
  • 4000 classes, 1-5 examples per sign.
  • Data mining challenge
  • indexing methods for large numbers of classes.

60
  • Comments, questions, complaints
  • E-mail athitsos_at_uta.edu
  • Web http//crystal.uta.edu/athitsos/

END
Write a Comment
User Comments (0)
About PowerShow.com