An Empirical Study on Large-Scale Content-Based Image Retrieval - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

An Empirical Study on Large-Scale Content-Based Image Retrieval

Description:

The process of searching for digital images in large databases ... Evaluate efficiency by average CPU time elapsed on a given query. 15. Experimental Setup ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 23
Provided by: Wym2
Category:

less

Transcript and Presenter's Notes

Title: An Empirical Study on Large-Scale Content-Based Image Retrieval


1
An Empirical Study on Large-Scale Content-Based
Image Retrieval
  • Group Meeting
  • Presented by Wyman
  • 23-1-2007

2
Introduction
  • Content-based Image Retrieval (CBIR)
  • The process of searching for digital images in
    large databases based on image contents
  • It consists of four modules in general
  • data acquisition and processing
  • feature representation
  • data indexing
  • query and feedback processing

3
Introduction
  • Content-based Image Retrieval (CBIR)
  • Extensively studied in both academia and industry
  • Yet, traditional data indexing techniques are not
    scalable to high dimensional data
  • Image contents in high dimensional feature space
  • Thus, traditional CBIR systems are not efficient

4
Contribution
  • Propose a scalable CBIR scheme by applying
    locality-sensitive hashing (LSH)
  • LSH is good at indexing high dimensional data
  • Conduct a comprehensive empirical evaluation of
    CBIR over a half million images
  • There is very limited empirical study on that
    large CBIR systems
  • Address some challenges for building scalable
    CBIR systems on large-scale data
  • some innovative ideas are suggested for tackling
    these issues

5
Hashing Algorithm
  • Hashing algorithm is commonly employed for
    indexing large-scale database
  • Due to its fast database lookup capability
  • O(1) on average
  • However, it is never used for similarity indexing
  • It builds index for searching identical copy of
    the query key, but not for searching near
    neighbors

6
Locality-Sensitive Hashing
  • An emerging new indexing algorithm
  • Proposed to solve high-dimensional near neighbor
    searching problem in Euclidean space l2
  • It can answer queries in sublinear time
  • Each near neighbor being reported with a fixed
    probability
  • Principles
  • Two close points most likely share the same hash
    value
  • By looking into the hash bucket of the query
    point, we obtain many near neigbhors of the query
    points
  • Large fraction of data points are not processed

7
Locality-Sensitive Hashing
  • we employ the E2LSH (Exact Euclidean LSH) package
  • The probability of finding near neighbors can be
    controlled by two parameters L and k
  • L number of hash functions g(v) (h1(v),
    hk(v))
  • L larger L increases the probability of finding
    all R-near neighbors
  • k larger k reduces the chance of hitting data
    points that are not R-near neighbors
  • h(v) controlled by two parameters a and b
  • a a d dimensional vector with entries chosen
    independently from a Gaussian distribution
  • b a real number chosen uniformly from the range
    0, w

Same h if this is within (n-1)w, (n)w )
8
Main problem of E2LSH
  • One main problem is that E2LSH is a memory based
    implementation
  • All the data points and the data structures are
    stored in the main memory
  • Maximum database size limited by the amount of
    free main memory available

9
Our Scalable Implementation
  • We propose a multi-partition indexing approach
  • We divide the whole database into multiple
    partitions
  • Each of them is small enough to be loaded into
    the main memory
  • Then we can process queries on each partition
    respectively

10
The Detailed Procedure
  1. Divide the database into n partitions, where the
    number of partitions n ceiling of (database
    size / max partitions size).
  2. Run E2LSH indexing on each of the partitions to
    create the hash tables
  3. Given a query q, load the pre-built hash tables T
    of one partition into the main memory
  4. Run E2LSH query on T to retrieve top k ranking
    images with respect to q
  5. Repeat (3) and (4) steps until all partitions are
    processed.
  6. Collect the results from all partitions and
    return top k ranking results with respect to the
    query q.

11
Some Critical Issues
  • Disk-access overhead for loading the hash tables
    into the main memory
  • We can consider some parallel solutions to
    overcome this overhead issue and speedup the
    overall solution in our future work

12
Feature Representation
  • We represent an image with three types of visual
    feature color, shape, and texture
  • For color, we use Grid Color Moment
  • For shape, we employ an edge direction histogram
  • For texture, we adopt Gabor feature
  • In total, a 238-dimensional feature vector is
    employed to represent each image in our image
    database

13
Empirical Evaluation
  • Experimental Testbed
  • Testbed containing 500,000 images crawled from
    Web
  • 5,000 images from COREL image data set are
    engaged as the query set
  • Contains 50 categories
  • Each category consists of exactly 100 images that
    are randomly selected
  • Every category represents a different semantic
    topic, such as butterfly, car, cat, dog, etc

14
Performance Metrics
  • Two standard performance metrics
  • Precision and recall
  • Relevant image if it is one of the 100 COREL
    images in the database which share the same
    category with the query images
  • Evaluate efficiency by average CPU time elapsed
    on a given query

15
Experimental Setup
  • Form a query set of 1000 image examples by
    randomly select 20 images from each of the 50
    COREL image categories
  • Prepare 10 image databases of different sizes
    ranging from 50,000 images to 500,000 images
  • A database of size N contains the followings
  • 5,000 COREL images
  • N-5000 other images selected from our testbed
  • Extract the image features using the techniques
    discussed

16
Experimental Setup
  • Perform a series of experiments using our CBIR
    system with LSH indexing on all the 10 databases
  • LSHs parameters L 550 and k 34
  • Retrieve top 20 and top 50 ranking images for
    each query image
  • Calculate recall and precision
  • Simulate two rounds of relevance feedback
  • re-querying the database with relevant examples
    with respect to the given query
  • Environments
  • 3.2GHz Intel Pentium 4 PC with 2GB memory running
    Linux kernel 2.6
  • All implementations are programmed by C
    language
  • The same experiments are repeated with Exhaustive
    Searching indexing

17
Average Precision of TOP20
  • The results of LSH is very close to the ES
    results
  • Their maximal difference is no more than 5 at
    any database size

Average Precision of TOP20
18
Average Recall of TOP50
  • Average recall and average precision of our CBIR
    system decrease with the database size, yet the
    decreasing rate
  • Diminishes when the database size increases

Average Recall of TOP50
19
Average Query Time
  • the query time for ES is linear to the database
    size, while the one for LSH is sublinear

20
Time Performance of LSH over ES on different
databases
  • LSH approach is much faster than the ES solution
    with an average speedup greater than 4 times
  • The gap of time performance between them grows
    even faster when the database size increases

21
Conclusion and Future Work
  • Proposed a scalable CBIR scheme based on a fast
    high dimensional indexing technique, LSH
  • Conducted extensive empirical evaluations on a
    large testbed of a half million images
  • Our scalable CBIR system is more efficient than
    traditional exhaustive linear searching methods
  • Our system is scalable to large-scale CBIR
    applications
  • Addressed some limitations and challenges in our
    current solution
  • Consider parallel solutions in the coming future

22
Q A
Write a Comment
User Comments (0)
About PowerShow.com