MultiLevel Particle Filter Fusion of Features and Cues for AudioVisual Person Tracking - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

MultiLevel Particle Filter Fusion of Features and Cues for AudioVisual Person Tracking

Description:

3D location estimated from window position and width ... 'Scout' tracks scan the room, scored on FG detection proximity ( detection ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 14
Provided by: Keni
Category:

less

Transcript and Presenter's Notes

Title: MultiLevel Particle Filter Fusion of Features and Cues for AudioVisual Person Tracking


1
Multi-Level Particle Filter Fusion of Features
and Cues for Audio-VisualPerson Tracking
  • Keni Bernardin, Tobias Gehrig, Rainer
    Stiefelhagen
  • Universität Karlsruhe
  • keni_at_ira.uka.de, tgehrig_at_ira.uka.de,
    stiefel_at_ira.uka.de
  • 8.5.2007

2
Particle Filter-based Fusion
  • Uses low-level features
  • foreground segmentation maps
  • image pixel colors
  • and high level cues
  • upper body detections
  • Person regions from blob tracking
  • SLOC estimates
  • The algorithm keeps one separate particle filter
    per person track.

  • A particle represents one point on a person. Only
    75 particles used per track. Particles are scored
    on observed features and penalized on proximity
    to other tracks (track repulsion)
  • Sampling Importance Resampling (SIR)
  • Propagation 2 sets of particles with low/high
    dynamics (max 3738)
  • No specific room knowledge used (dimensions,
    objects, background)

3
Upper Body Detectors
  • Boosted classifier cascades based on
    Haar-features to detect upper body region (only
    corner cameras).
  • Only standard cascades implemented in OpenCV. No
    adaptation or tuning to CHIL rooms.
  • Entire image is scanned for all corner cameras.
    This is time consuming factor! (approx. 10-12fps,
    640x480). Without detections RT factor 1.48

  • Inside rectangle used to build upper body
    histogram, outside rectangle for initial
    background histogram (used by scout trackers)
  • Reprojection of detections to 3D scene
  • 3D location estimated from window position and
    width
  • Location uncertainty also computed, expressed as
    covariance matrix
  • Similar procedure for detected person regions
    (top cam) and SLOC estimates.

4
Person Regions Foreground
  • Adaptive background modeling (10 learn frames,
    run-on), foreground segmentation using fixed
    threshold (all camera views)
  • Person region tracking (only top cam)
  • Extraction of FG blobs
  • Initialization/Deletion of person models
    (x,y,Radius) based on FG blob support
  • EM-adaptation of model parameters based on
    spatial overlap
  • Reprojection to 3D scene
  • RT factor 0.9 (could be much faster).


5
Upper Body Color Histograms
  • Modified HSV cone for more compact color
    histograms
  • v var (max 10 bins)
  • s satvar (max 10 bins)
  • h huesatvar (max 16 bins)
  • Adaptation of upper body histograms upon
    (matching) upper body detection or SLOC estimate
    (mahalanobis distance using 3D detections
    location and covariance matrix).
  • Views where no detection was made use average
    histogram of other views
  • Continuous adaptation of top view histogram and
    of all backgrounds
  • Upper body histogram filtering independently for
    all views using background
  • Hfilt minmax(Hbody) (1 minmax(Hbg))

6
SLOC Estimates
  • JPDAF acoustic tracker output used
  • Acoustic estimates 3D position and localization
    uncertainty (covariance matrix) used to score
    particles
  • Similar to detections Adaptation of matching
    tracks color histograms upon SLOC estimate
  • Simulation of upper body classifier cascade
    detection window for corner views
  • Simulation of person region detection circle for
    top view

  • SLOC estimates are used as high level cues. Just
    as other features, they serve to score particles
    and therefore initialize or maintain tracks,
    update positions, etc (Feature-level fusion of
    modalities)

7
Track Creation / Deletion
  • Scout tracks scan the room, scored on FG
    detection proximity ( detection color) person
    region overlap
  • A scout track is validated when
  • Particle spread small (std deviation lt 60cm)
    AND
  • Average particle score above activation threshold
    AND
  • Upper body histograms (sampled at projected
    particle positions) balanced across all corner
    views (bhattacharyya distance) AND dissimilar to
    histogram of neighboring background (sampled from
    60cm circle around track center) in respective
    view
  • Tracks are invalidated when
  • Particle spread gt 90cm OR
  • Average (body color detection person region)
    score below deactivation threshold


8
JPDAF Acoustic Tracker
JPDAF-based Acoustic Tracker
  • GCC-Phat based time delay estimation
  • one observation vector per microphone array using
    only time delays above threshold
  • tracking of multiple targets using JPDAF
  • only observations inside validation region of a
    target are associated with that target
  • internally maintained IEKFs updated according to
    the probability that the observation originated
    from respective targets
  • current active speaker selected based on the
    volume of the error ellipsoid given by the state
    error covariance matrix
  • new targets are created when an observation
    cannot be associated with the existing targets
    and deleted when they do not initialize within a
    given time or have not been active for some time

9
UKA Top-View AV Tracker
  • Visual tracking
  • Adaptive foreground segmentation with fixed
    threshold
  • Person models (x,y,radius)
  • Expectation-Maximization approach for assignment
    of blobs to models and model parameter update
  • Creation of new model when unmatched FG blob
    exists, deletion of models which are unsupported
    (high time delay ensures greater stability)
  • Acoustic Source localization Output of the Joint
    Probabilistic Data Association Filter System
  • Audio-Visual Fusion Fusion is done at decision
    level using a 3-state finite state machine, which
    selects or averages the video and audio tracks.


Audio Matching video track
Unmatched audio track
Only video track available
10
Example Video 1

11
Example Video 2

12
Example Video 3

13
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com