Contentbased Image Retrieval - PowerPoint PPT Presentation

1 / 68

About This Presentation

Title:

Contentbased Image Retrieval

Description:

Contentbased Image Retrieval – PowerPoint PPT presentation

Number of Views:625

Avg rating:3.0/5.0

Slides: 69

Provided by: cecsA

Category:

more less

Transcript and Presenter's Notes

Title: Contentbased Image Retrieval

1
Content-based Image Retrieval
2

Retrieval
The process of getting back information that has
been stored in a database
Image Retrieval
User, search engine, and database

Text-based image retrieval
Use keywords or captions,
interpret with high-level semantics,
retrieval by matching the keywords

Retrieval result of Google Images for Airplane
4

However,
Labelling is time consuming and expensive
Only partially describe the visual content
Humans subjectivity

sky (50 points), bird (60 points), soaring (120
points), or frigate bird (150 points).
5

Content-based image retrieval
Annotators are replaced by computers
Keywords are replaced by low-level visual
features (Color, texture, shape)
Retrieval by comparing the visual features
Semantic gap
A critical issue in content-based image retrieval
Low-level visual features used by computers
High-level concepts used by human

Three retrieval modes
Target search (Pattern matching)
Search for a specific image
Category search (Most studied in CBIR)
Search for images from a particular class
Association search (Most difficult)
No specific target at the beginning of retrieval

Vincent Van Gogh Starry Night (1889)
7

The image domain in CBIR
Narrow domain
Limited and predictable variability in visual
content
Domain knowledge can be very helpful
Semantic gap is small
Broad domain
Unlimited and unpredictable variability in visual
content
Less domain knowledge for CBIR
Semantic gap is large

Visual features
Invariant vs. discriminative
local vs. global
Color
Color space RGB, Lab, HSV,
Color-based feature Color histogram, color
moments,
Texture
Co-occurrence matrix, Tamura texture
representation, Wavelet transform based
representation,..
Shape
Boundary-based or region-based
Chain code, Fourier descriptor, Moment
invariants, Finite element method,
Salient features
Focus attention on the most salient fragments or
objects,
Local invariant features
Structure and lay-out
Spatial relationship between feature values,
point sets, or object sets.
Graph-based description

Measure image similarity
Derives interpretation from the features
Measure with a similarity function
Euclidean distance
Mahalanobis distance
Minkowski distance
Distance between histograms
Minkowski distance, Histogram intersection,
Kullback-Leibler divergence between
distributions, Chi-square statistics,
Quadratic-form distance, Cumulative histogram
distance, Earth-mover distance,
Similarity of structural features, Similarity of
salient features,
Similarity of two shapes,

A demonstration system
10

Learn a similarity metric
Narrow down the semantic gap
Adapt to different retrieval sessions
Deal with the subjectivity of human perception on
visual content
Interaction between users and a retrieval system

Relevance feedback
Originates from text information retrieval
Judge the relevance of initial retrieval result
and refine the query
Explicit feedback
User ranks or labels the retrieved documents
Implicit feedback
Read or not, how long the user read this document
Pseudo feedback
Assume the top k retrieved documents are relevant
Rocchio algorithm

Image retrieval with relevance feedback
Ask the user to evaluate the retrieved images and
feed the evaluation back to the retrieval system
The system responses to the evaluation and give
refined result
Explicit feedback dominates in CBIR
Relevant (positive) vs. Irrelevant (Negative)
(most popular)
Multi-level score
Ranking
Relative judgment (Image A is more relevant than
B)

What can be learned from user feedback
Refine the query
The importance of different features in retrieval
Combination of different visual features
Color, texture, shape,
The subspace where the relevant images lie
PCA, Kernel PCA, MDS,
A classifier separating relevant images from
irelevant ones
Linear classifier (Fisher discriminant analysis)
Quadratic classifier (Model with Gaussian
distributions)
Neural Networks
Support Vector Machines, Kernel-based classifier,
Boosting,

The small sample problem
Nobody is willing to label many images in a
retrieval session
The scarcity of labeled images prevents efficient
and reliable learning
The class of irrelevant images is too
heterogeneous to be represented by a small number
of labeled images
Existing ways to deal with the small sample
problem
Take assumptions about data distributions
Perform dimensionality reduction
Active learning
Learning with unlabelled data (Semi-supervised
learning)
Incorporate prior knowledge
Long-term relevance feedback learning

Image retrieval with active learning
Goal Learn most but label least
What should be presented to the user for
labeling?
Most relevant images vs. Most uncertain images
Active learning
A classifier, for example,
An initial labeled data set
A pool of unlabelled data
A query function
Goal maximize the classification accuracy with
minimal labeling chance
Present the user with most uncertain images

Incorporate prior knowledge into CBIR
Learning Prior knowledge (domain theory)
observation
Prior knowledge should be built into the design
of a classifier
Knowledge into CBIR
PK1 For a randomly sampled unlabelled image,
before seeing any additional evidence, its label
is negative with higher probability
PK2 The negative image samples collected from
relevance feedback do not efficiently represent
the true distribution of negative image class,
PK3 For a randomly sampled unlabelled image, its
probabilities of being positive can be predicted
based on the given query and retrieval history

A general model of knowledge-based CBIR
Given
Training sample set
Prior knowledge set
A classifier
Find
which best
fits both
That is,
Where denotes a loss
function, is a regularization parameter

Kernel biased discriminant analysis
When finding a discriminative projection
direction, different strategies are applied to
relevant and irrelevant image classes
Relevant image samples are required to be well
clustered after projection
Irrelevant image samples are only required to be
far from the relevant images
S1 the scatter matrix of irrelevant images to
the mean of relevant images
S2 the scatter matrix of relevant images to the
mean of relevant images

Long-term learning from user feedback
The log of user feedback from different users and
different retrieval sessions
It shows the consistency of human perception on
the relationship between images
Also includes the subjectivity of human
perception ---- the relationship may be quite
noisy.
Boost the accuracy of the initial retrieval
result to a higher level, and the user feedback
is only for adjusting
Some research work has been seen on using log
data in recent years, but it is still in the
initial stage
Advanced machine learning and data analysis tools
are needed to handle the vast amount of noisy
data to extract the information behind them.

Concluding remarks on CBIR
The CBIR has passed its early years and now
concentrate on deeper problems
On narrow domain, the retrieval performance can
be quite satisfactory. However, the retrieval on
broad domain still faces many problems
For generic CBIR, human is an indispensable
component in a retrieval system
Incorporating the information in the log data
will be a promising way to significantly improve
the retrieval performance
Learning to narrow down the semantic gap
Integration of text-based retrieval with
content-based ones.

21
Object retrieval in a supermarket--- Where Is
The WeetBix?

Efficiently locate an item in a supermarket
Formulated as an object retrieval problem
Query an image of the item you are looking for
Database the images of all the shelves in a
supermarket

22
(No Transcript)
23

Characters of this retrieval problem
Target search (Pattern matching)
Relatively narrow domain (Images within a
supermarket)
No relevance feedback
Query image only occupies a small area of the
database image---all the other areas are actually
clutters
Large scale difference between query image and
database image
In each database image, there are often multiple
copies of each item
Each database image is full of striking signs and
patterns with all sorts of colors

How to find it?
24

A local invariant feature approach
Interest points/regions in an image
Invariant to scale and affine transformations
Robust to illumination and viewpoint changes
Corners, edges, ridges, multi-junctions,
blob-like structures,
Repeatability
Descriptor of the detected regions
Each image becomes a Bag of descriptors

(Mikolajczyk Schmid, ECCV02)
25
Local invariant feature detection Hessia
n Affine Detector
26
The detected image region
27
A close-up view
28
A close-up view
29

Describe the visual content in a detected region
The SIFT descriptor (Histogram of local
gradients)

29 0 1 2 0 0 8 24 132 1 26 14 89 1 0
30
2272 by 1704
20,000 SIFT feature
52 million SIFT feature
Database
3000 images
31

Each image becomes a Bag of decriptors

How to match the query with the database image?
A straightforward point-to-point comparison is
computationally infeasible

Visual vocabulary
The descriptors from all images are clustered to
k clusters. Each cluster is called a visual
word.
Each image is projected as a k-dimensional
histogram, representing the distribution of the
visual words on this image
More efficient way to evaluate the similarity of
images

33
52 million SIFT descriptors
Database
After normalized, each SIFT feature will map to a
point on a unit sphere in 128-d space.
34
52 million SIFT descriptors
1
100
10K
1M
Hierarchical clustering via k-means
35
Descriptors in the same cluster have similar
visual appearance ----- a visual word. If two
images contains the descriptors falling in the
same cluster, they share this visual word.
Some visual words
36

Each image becomes a histogram

w1 shows how many word1 can be found in this image
w1 , w2, , w1M

1M
Actually, this vector is very sparse. Only about
20k of the items are non-zero. This
representation is much more efficiency than a
matrix of 20k by 128 full of SIFT descriptors
37
1M
w1, w2, , w1M
38

Similarity between two images
counting the number of non-zero dimensions
shared by the two sparse vectors
or taking the exact number of each visual words
into account.

im1 0 0 1 0 1 0 0 3 0 im2
1 0 2 0 0 1 0 1 1
sum 0 0 1 0 0 0 0 1 0
39
However, In text categorization, when
identifying an file some of the words are not as
descriptive as others, such as the, and,
is, etc. Same situation exists in visual
words, too. How can we know what visual words
are discriminative, what visual words are
not?
40
Previous work in text categorization has a good
solution. A weighting standard named term
frequency-inverse document frequency
(tf-idf) nid --- the number of occurrences
of word i in the document d, nd --- the total
number of words in the document d, ni --- the
number of occurrences of term i in the whole
database N --- the number of documents in the
whole database ti --- the weight of word i.
41

Stop list
The weight of some visual words are extremely
high, whereas some of them are extremely low.
They need to be ignored in retrieval because
extremely low means such a visual word is so
non-discriminative that it can help nothing in
retrieval.
extremely high means such a visual word will
dominate similarity evaluation.
A widely used practice for stop list is to ignore
5 of the visual words with the highest or lowest
weights.

A loose geometric constraint
The local features are much richer in a database
image.
As a result, a high matching score can be easily
obtained between a query image and most database
images
Partitions a database image into 25 sub-images
with overlapping, each of which is one ninth of
the original one
A sub-image is large enough to contain the object
to retrieve but has much less background clutter
Neighboring sub-images have the half area
overlapped to reduce the risk of separating one
object into two sub-images.

Similarity measure

44
One last issue, the speed
1M ( sparse )
im1,1 im1,2 im1,2m im2,1 . . . im3000,1
im3000,2m
3000

Although we can take advantage of the character
of sparse, scanning through 3000 images will
still cost some time.
45
What does an invert file mean?
1M ( sparse )
im1,1 im1,2 im1,1M im2,1 . . . im3000,1
im3000,1M
?
3000

46
Actually, in a normal file
we get the words it contains
given image id
im1,1 im1,2 im1,1m im2,1 . . . im3000,1
im3000,1m

but in an invert file
given an visual word, we know the id of images
containing it!
47
with visual vocabulary and invert file, we can
search through a database containing 52 million
128-dimension features with in 0.025 seconds.
48
(No Transcript)
49
(No Transcript)
50
A
B
Spatial consistence check
51
A
B
52
A
B
53
A
B
23
23
54
A
B
11
260
23
23
112
57
26
108
55
A
B
26
11
23
23
112
57
26
11
56
A
B
57
A
B
58
A
B
59
A
B
60
A
B
61
A
B
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68