Title: 3D Scene Models
13D Scene Models 6.870 Object recognition and
scene understanding Krista Ehinger
2Questions
- What makes a good 3D scene model? How accurate
does it need to be? - How far can you get with automatic surface
detection? Where do you need human input?
3Modelling the scene
- Real scenes have way too many surfaces
4Modelling the scene
5Tour Into the Picture (TIP)?
- Model the scene as 5 planes foreground objects
- Easy implementation planes/objects defined by
humans
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the
Picture Using a spidery mesh user interface to
make animation from a single image". ACM SIGGRAPH
1997
6TIP Implementation
- User defines vanishing point, rear wall of the
scene (inner rectangle)? - Given some assumptions about the camera,
position/size of all planes can be computed...
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the
Picture Using a spidery mesh user interface to
make animation from a single image". ACM SIGGRAPH
1997
7Defining the box
- Define planes Floor - y0, Ceiling - yH
- Given horizon (vanishing point), corners of
floor, ceiling can be computed from 2D image
position
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the
Picture Using a spidery mesh user interface to
make animation from a single image". ACM SIGGRAPH
1997
8Defining the box
- Once the positions of the planes are known,
compute the texture of the planes
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the
Picture Using a spidery mesh user interface to
make animation from a single image". ACM SIGGRAPH
1997
9What about foreground objects?
- Assume a quadrangle attached to floor, compute
attachment points, upper points - Hierarchical model of foreground objects
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the
Picture Using a spidery mesh user interface to
make animation from a single image". ACM SIGGRAPH
1997
10Extracting foreground objects
- Foreground objects removed, added to mask
- Holes in background filled in using photo
completion software
Y. Horry, K.I. Anjyo and K. Arai. "Tour Into the
Picture Using a spidery mesh user interface to
make animation from a single image". ACM SIGGRAPH
1997
11TIP Demonstration
12TIP Discussion
- Pros
- Accurate model (due to human input)?
- Deals with foreground objects, occlusions
- Cons
- Requires human input, not automatic
- Model too simple for many real-world scenes
13Modelling the scene
- Option 2 Pop-up book world
14Automatic Photo Pop-Up
- Three classes of surface ground, sky, vertical
- Not just a box can model more kinds of scenes
- Automatic classification, no labeling
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic
Photo Pop-up", ACM SIGGRAPH 2005.
15Photo Pop-Up Implementation
- Pixels - superpixels - constellations
- Automatic labeling of constellations as ground,
vertical, or sky - Define angles of vertical planes (using
attachment to ground)? - Map textures to vertical planes (as in TIP)?
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic
Photo Pop-up", ACM SIGGRAPH 2005.
16Superpixels, constellations
- Superpixels are neighboring pixels that have
nearly the same color (Tao et al, 2001)? - Superpixels assigned to constellations according
to how likely they are to share a label (ground,
vertical, sky) based on difference between
feature vectors
17Feature vectors
- Color features RGB, hue, saturation
- Texture features Difference of oriented
Gaussians, Textons - Location (absolute and percentile)?
- N superpixels in constellation
- Line and intersection detectors
- Not used constellation shape (contiguous, N
sides), some texture features
18Training process
- For each of 82 labeled training images
- Compute superpixels, features, pairwise
likelihoods - Form a set of N constellations (N 3 to 25),
each labeled with ground truth - Compute constellation features
- Compute constellation label, homogeneity
likelihood
19Training process
- Adaboost weak classifiers learn to estimate
whether superpixels have same label (based on
feature vector)? - Another set of Adaboost week classifiers learns
constellation label, homogeneity likelihood
(expressed as percent ground, vertical, sky,
mixed)? - Emphasis on classifying larger constellations
20Building the 3D model
- Along vertical/ground boundary, fit line segments
(Hough transform) goal is to find simplest
shape (fewest lines)? - Project lines up from corners of boundary lines,
cut and fold
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic
Photo Pop-up", ACM SIGGRAPH 2005.
21Photo Pop-Up Demonstration
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic
Photo Pop-up", ACM SIGGRAPH 2005.
22Photo Pop-Up Discussion
- Pros
- Automatic
- Can handle a variety of scenes, not just boxes
- Cons
- No handling of foreground objects
- Misclassification leads to very strange models
- Only 2 kinds of surface ground, vertical
D. Hoiem, A.A. Efros, and M. Hebert, "Automatic
Photo Pop-up", ACM SIGGRAPH 2005.
23Modelling the scene
- Option 3 Actually try to model surface angles
243D Scene Structure from Still Image
- Compute surface normal for each surface
- No right-angle assumptions surfaces can have any
angle - Automatic (trained on images with known depth
maps)?
253D Scene Implementation
- Segment image into superpixels
- Estimate surface normal of each superpixel (using
Markov Random Field model)? - Optional Detect and extract foreground objects
- Map textures to planes
Original image
Modeled depth map
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene
Structure from a Single Still Image". In ICCV
workshop on 3D Representation for Recognition
(3dRR-07), 2007
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene
Structure from a Single Still Image". In ICCV
workshop on 3D Representation for Recognition
(3dRR-07), 2007
26Image features
- Superpixel features (xi)?
- Color and texture features as in Photo Pop-Up
- Vector also includes features of neighboring
superpixels - Boundary features (xij)?
- Color difference, texture difference, edge
detector
27Markov Random Field Model
- First term model planes in terms of image
features of superpixels - Second term model planes in terms of pairs of
superpixels, with constraints...
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene
Structure from a Single Still Image". In ICCV
workshop on 3D Representation for Recognition
(3dRR-07), 2007
28Model constraints
- Connected structure except where there is an
occlusion, neighboring superpixels are likely to
be connected - Coplanar structure except where there are folds,
neighboring superpixels are likely to lie on the
same plane - Co-linearity long straight lines in the image
correspond to straight lines in 3D
29Foreground objects
- Automatically-detected foreground objects may be
removed from model (for example pedestrians,
using Dalal Triggs detector)? - Detected objects add 3D cues (pedestrians are
basically vertical, occlude other surfaces)?
303D Scene Demonstration
31Results
A. Saxena, M. Sun, A. Y. Ng. "Learning 3-D Scene
Structure from a Single Still Image". In ICCV
workshop on 3D Representation for Recognition
(3dRR-07), 2007
323D Scene Discussion
- Pros
- Handles a variety of scene types
- Fairly accurate (about 2/3 of scenes correct)?
- Automatic
- Handles foreground objects
- Cons
- Still fails on 1/3 of scenes
33Discussion
- Simple 3D models are adequate for many scenes
- You can get pretty far without human input (but
still would be better results with human
annotation of scenes) - Extensions?
- Use photo completion techniques to handle
occlusions? - Massive training sets - better 3D models?