XRCE at ImageCLEF 07 - PowerPoint PPT Presentation

About This Presentation

Title:

XRCE at ImageCLEF 07

Description:

XRCE at ImageCLEF 07 Stephane Clinchant, Jean-Michel Renders and Gabriela Csurka Xerox Research Centre Europe France Outline Problem statement Image Similarity Text ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 16

Provided by: clefIsti

Category:

more less

Transcript and Presenter's Notes

Title: XRCE at ImageCLEF 07

1
XRCE at ImageCLEF 07

Stephane Clinchant, Jean-Michel Renders and
Gabriela Csurka
Xerox Research Centre Europe
France

2
Outline

Problem statement
Image Similarity
Text Similarity
Fusion between text and image
Cross-Media Similarities
Experimental results
Conclusion

3
Problem Statement

Problem
Retrieve relevant images from a given cross-media
database (images with text) given a set of query
images and a query text
Proposed solutions
Rank the images in the database based on image
similarity (1), text similarity (2) and
cross-media similarities (3)

4
Image Similarity

The goal is to define an image similarity measure
that is able to best reflect a semantic
similarity of the images.
E.g.
sim( ,
) gt sim( , )
Our proposed solution (detailed in next slides)
is to
consider both local color and local texture
features
build a generative model (GMM) in the low level
feature space
represent the image based on Fisher Kernel
Principles
define a similarity measure between Fisher Vectors

5
Fisher Vector

Given a generative model with parameters ? (GMM)
the gradient vector
normalized by the Fisher information matrix
leads to a unique model-dependent
representation of the image, called Fisher Vector
As similarity between Fisher vectors the L1-norm
was used

Fisher Kernels on Visual Vocabularies for Image
Categorization, F. Perronnin and C. Dance, CVPR
2007.
6
Text similarity

The text is first pre-processed including
Tokenization, lemmatization, word decompounding
and stop-word removal
The text is modeled by a multinomial language
model and smoothed via Jelinek-Mercer method
where pML(w ?d ) ? (w, d) and pML(w ?C ) ?
?d(w,d)
The textual similarity between two documents is
defined by the cross-entropy function

7
Enriching the text using external corpus

Reason the texts related to the images in the
corpus are poor (title only).
How each text in the corpus was enriched as
follows
For each terms in the document we add related
terms based on their clustered usage analysis an
external corpus
The external corpus was the Flickr image database
The relationship between terms was based on the
frequency of their co-occurrence as tags for
the same image in Flickr (see top 5 ex. below)

classroom school, class, students, teacher, children
Riviera france, nice, sea, beach, french
Jesus christ, church, cross, religion, god
Ecuador galapagos, quito, southamerica, germany, worldcup
8
Fusion between image and text

Early fusion
Simple concatenation of image and text features
(e.g. bag-of-words and bag-of-visual-words)
Estimating their co-occurences or joint
probabilities (Mori et al, Vinokourov et al,
Duygulu et al, Blei et al, Jeon et al, etc )
Late fusion
Simply combining the scores of mono-media
searches (Maillot et al, Clinchant et al)
Intermediate level fusion
Relevance models (Jeon et al )
Trans-media (or intermedia) feedback (Maillot et
al, Chang et al)

9
Intermediate level fusion

Compute mono-media similarities between an
aggregate of objects coming from a first
retrieval step and a multimodal object .
Use the duality of data to switch media during
feedback process

Pseudo Feedback Top N ranked images based on
image similarity
Final rank Re-ranked documents based on textual
similarity

Aggregate textual information

10
Aggregate information from pseudo-feedback

Aim
Compute similarities between an aggregate of
objects Nimg(q) corresponding to a first
retrieval for query q and a new multimodal object
u in the Corpus
Where Nimg(q)T(I1), T(I2) T(IN) , T(Ik) is
the textual part of the kth image Ik in the
(pseudo)-feedback group based on image similarity
Possible solutions
Direct Concatenation Aggregate (concatenate)
T(Ik), k1,N to form a single object and compute
text similarity between it and T(u).
Trans-media document re-ranking Aggregate all
similarity measures between couple of objects .
Complementary (or Inter-media) Feedback Use a
pseudo feedback algorithm to extract relevant
features of Nimg(q) and use them to compute the
similarity with T(u).

11
Trans-media document re-ranking

We define the following similarity measure
between an aggregate of objects Nimg and a
multimodal object u
Notes
This approach can be seen as a document
re-ranking method instead of a query expansion
mechanism.
The values simTXT(T(u),T(v)) can be pre-computed
offline if the corpus is of reasonable size.
By duality, we can inverse the role of images and
text

12
Complementary Feedback

We derive a LM (?F) for the relevance concept
from the text set FNimg(q)
?F is assumed to be multinomial (peaked at
relevant terms) estimated by EM from
where P(w?C) is word probability built upon the
Corpus, and ? (0.5) a fixed parameter.
The similarities between Nimg and T(u) is given
by the cross-entropy similarity
between ?F and T(u) or we can first interpolate
?F with the query text
Notes
? (0.5 in our exp) can be seen as a mixing
weight between image and text
Unlike, trans-media re-ranking method, it needs a
second retrieval step.
We can inverse the role of images and text if we
use Rocchios method instead of (1).

A Study of smoothing methods for Language Models
applied to Information Retrieval, Zhai and
Lafferty, SIGIR 2001.
13
XRCEs ImageCLEF Runs
Run Name Modality Query Approach MAP
1. EN-EN-AUTO-FB-TXT_FLR Text only TXT LM FLR 0.2075
2. AUTO-NOFB-IMG_COMBFK Image only IMG FV L1 0.1890
3. AUTO-FB-TXTIMG_PREFFKTXT Mixed IMG TR 0.2801
4. AUTO-FB-TXTIMG_PREFFKTXT_FLR Mixed IMG TR FLR 0.2761
5. EN-EN-AUTO-FB-TXTIMG_QTXT_COMBPREFFKTXT Mixed IQTQ TR R1 0.3020
6. EN-EN-AUTO-FB-TXTIMG_MPRF Mixed IQTQ CF 0.3168
7. DE-EN-AUTO-FB-TXTIMG_MPRF_FLR Mixed IQTQ QT CF FLR 0.2899
8. EN-DE-AUTO-FB-TXTIMG_MPRF Mixed IQTQ QT CF 0.2776

LM language model with cross entropy
FVL1 Fisher Vector with L1 norm
FLR - text enriched by Flicker tags

TR Transmedia Reranking
CF Complementary Feedback
Ri Run I
QT Query Translation

14
Conclusion

Our image similarity measure (L1 norm on Ficher
Vectors) seems to be quite suitable for CBIR.
It was the second best Visual Only system and
unlike the first system it does not used any
query expansion (nor feedback)
Combining it with text similarity within an
intermediate level fusion allowed for a
significant improvement.
Mixing the modalities increased the performance
of about 50 (relative) over mono-media (pure
text or pure image) systems .
Three out of six proposed cross-media systems
were the best three Automatic Mixed Runs .
The system well performed even when the query and
the Corpus were in different languages (English
versus German).