Title: XRCE at ImageCLEF 07
1XRCE at ImageCLEF 07
- Stephane Clinchant, Jean-Michel Renders and
Gabriela Csurka - Xerox Research Centre Europe
- France
2Outline
- Problem statement
- Image Similarity
- Text Similarity
- Fusion between text and image
- Cross-Media Similarities
- Experimental results
- Conclusion
3Problem Statement
- Problem
- Retrieve relevant images from a given cross-media
database (images with text) given a set of query
images and a query text - Proposed solutions
- Rank the images in the database based on image
similarity (1), text similarity (2) and
cross-media similarities (3)
4Image Similarity
- The goal is to define an image similarity measure
that is able to best reflect a semantic
similarity of the images. - E.g.
- sim( ,
) gt sim( , ) - Our proposed solution (detailed in next slides)
is to - consider both local color and local texture
features - build a generative model (GMM) in the low level
feature space - represent the image based on Fisher Kernel
Principles - define a similarity measure between Fisher Vectors
5Fisher Vector
- Given a generative model with parameters ? (GMM)
- the gradient vector
- normalized by the Fisher information matrix
- leads to a unique model-dependent
representation of the image, called Fisher Vector - As similarity between Fisher vectors the L1-norm
was used
Fisher Kernels on Visual Vocabularies for Image
Categorization, F. Perronnin and C. Dance, CVPR
2007.
6Text similarity
- The text is first pre-processed including
- Tokenization, lemmatization, word decompounding
and stop-word removal - The text is modeled by a multinomial language
model and smoothed via Jelinek-Mercer method - where pML(w ?d ) ? (w, d) and pML(w ?C ) ?
?d(w,d) - The textual similarity between two documents is
defined by the cross-entropy function
7Enriching the text using external corpus
- Reason the texts related to the images in the
corpus are poor (title only). - How each text in the corpus was enriched as
follows - For each terms in the document we add related
terms based on their clustered usage analysis an
external corpus - The external corpus was the Flickr image database
- The relationship between terms was based on the
frequency of their co-occurrence as tags for
the same image in Flickr (see top 5 ex. below)
classroom school, class, students, teacher, children
Riviera france, nice, sea, beach, french
Jesus christ, church, cross, religion, god
Ecuador galapagos, quito, southamerica, germany, worldcup
8Fusion between image and text
- Early fusion
- Simple concatenation of image and text features
(e.g. bag-of-words and bag-of-visual-words) - Estimating their co-occurences or joint
probabilities (Mori et al, Vinokourov et al,
Duygulu et al, Blei et al, Jeon et al, etc ) - Late fusion
- Simply combining the scores of mono-media
searches (Maillot et al, Clinchant et al) - Intermediate level fusion
- Relevance models (Jeon et al )
- Trans-media (or intermedia) feedback (Maillot et
al, Chang et al)
9Intermediate level fusion
- Compute mono-media similarities between an
aggregate of objects coming from a first
retrieval step and a multimodal object . - Use the duality of data to switch media during
feedback process
Pseudo Feedback Top N ranked images based on
image similarity
Final rank Re-ranked documents based on textual
similarity
Aggregate textual information
10Aggregate information from pseudo-feedback
- Aim
- Compute similarities between an aggregate of
objects Nimg(q) corresponding to a first
retrieval for query q and a new multimodal object
u in the Corpus - Where Nimg(q)T(I1), T(I2) T(IN) , T(Ik) is
the textual part of the kth image Ik in the
(pseudo)-feedback group based on image similarity - Possible solutions
- Direct Concatenation Aggregate (concatenate)
T(Ik), k1,N to form a single object and compute
text similarity between it and T(u). - Trans-media document re-ranking Aggregate all
similarity measures between couple of objects . - Complementary (or Inter-media) Feedback Use a
pseudo feedback algorithm to extract relevant
features of Nimg(q) and use them to compute the
similarity with T(u).
11Trans-media document re-ranking
- We define the following similarity measure
between an aggregate of objects Nimg and a
multimodal object u - Notes
- This approach can be seen as a document
re-ranking method instead of a query expansion
mechanism. - The values simTXT(T(u),T(v)) can be pre-computed
offline if the corpus is of reasonable size. - By duality, we can inverse the role of images and
text
12Complementary Feedback
- We derive a LM (?F) for the relevance concept
from the text set FNimg(q) - ?F is assumed to be multinomial (peaked at
relevant terms) estimated by EM from - where P(w?C) is word probability built upon the
Corpus, and ? (0.5) a fixed parameter. - The similarities between Nimg and T(u) is given
by the cross-entropy similarity - between ?F and T(u) or we can first interpolate
?F with the query text - Notes
- ? (0.5 in our exp) can be seen as a mixing
weight between image and text - Unlike, trans-media re-ranking method, it needs a
second retrieval step. - We can inverse the role of images and text if we
use Rocchios method instead of (1).
A Study of smoothing methods for Language Models
applied to Information Retrieval, Zhai and
Lafferty, SIGIR 2001.
13XRCEs ImageCLEF Runs
Run Name Modality Query Approach MAP
1. EN-EN-AUTO-FB-TXT_FLR Text only TXT LM FLR 0.2075
2. AUTO-NOFB-IMG_COMBFK Image only IMG FV L1 0.1890
3. AUTO-FB-TXTIMG_PREFFKTXT Mixed IMG TR 0.2801
4. AUTO-FB-TXTIMG_PREFFKTXT_FLR Mixed IMG TR FLR 0.2761
5. EN-EN-AUTO-FB-TXTIMG_QTXT_COMBPREFFKTXT Mixed IQTQ TR R1 0.3020
6. EN-EN-AUTO-FB-TXTIMG_MPRF Mixed IQTQ CF 0.3168
7. DE-EN-AUTO-FB-TXTIMG_MPRF_FLR Mixed IQTQ QT CF FLR 0.2899
8. EN-DE-AUTO-FB-TXTIMG_MPRF Mixed IQTQ QT CF 0.2776
- LM language model with cross entropy
- FVL1 Fisher Vector with L1 norm
- FLR - text enriched by Flicker tags
- TR Transmedia Reranking
- CF Complementary Feedback
- Ri Run I
- QT Query Translation
14Conclusion
- Our image similarity measure (L1 norm on Ficher
Vectors) seems to be quite suitable for CBIR. - It was the second best Visual Only system and
unlike the first system it does not used any
query expansion (nor feedback) - Combining it with text similarity within an
intermediate level fusion allowed for a
significant improvement. - Mixing the modalities increased the performance
of about 50 (relative) over mono-media (pure
text or pure image) systems . - Three out of six proposed cross-media systems
were the best three Automatic Mixed Runs . - The system well performed even when the query and
the Corpus were in different languages (English
versus German).
15Thank you for your attention!