Word Hashing for Efficient Search in Document Image Collections - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Word Hashing for Efficient Search in Document Image Collections

Description:

Direct matching of images is an ... Word spotting in Ottoman documents. Successive pruning stages eliminate wrong words. ... Retrieval of Ottoman documents' ... – PowerPoint PPT presentation

Number of Views:225
Avg rating:3.0/5.0
Slides: 42
Provided by: anand9
Category:

less

Transcript and Presenter's Notes

Title: Word Hashing for Efficient Search in Document Image Collections


1
Word Hashing for Efficient Search in Document
Image Collections
  • Anand Kumar
  • Advisors
  • Dr. C. V. Jawahar
  • IIIT Hyderabad
  • Dr. R. Manmatha
  • University of Massachusetts, Amherst, USA

2
Overview
  • Introduction
  • The problem
  • Previous work
  • Contributions
  • Searching in document images
  • Annotation for retrieval
  • Summary
  • Future work

3
Introduction
Processing
Input Query
Scanning
Database
Image Matching
Retrieved Documents
Documents
NOT the text (ASCII) words.
Matching images of words,
4
Challenges
  • Direct matching of images is an expensive
    process.
  • Represent word images as feature vectors and
    match.
  • Representation should capture the characteristics
    (mainly content) of words.
  • On every query, searching in large word image
    database by matching is time consuming.
  • The scalability issues arise with the increase in
    size of the document image collection.

5
Basic Directions of Solution
  • Convert the images into text using recognizers
    and build index using text search methods.
  • If the converted text has errors, will the text
    search methods deliver the expected performance?

6
Basic Directions of Solution
  • Group similar words in the document image
    collection and annotate (label with text) the
    groups.
  • Apply text search methods for accessing the
    documents.
  • Is it possible to annotate large groups of words
    found in a large collection of document images?

7
The Problem
  • Building an index using matching or other
    existing methods is not scalable for even
    moderate collections.
  • Given a large collection of document images, how
    to search efficiently for similar words so that
    queries are answered quickly (in milli seconds)?

8
Previous Work
  • Recognition based methods
  • Chan et al. use bi-gram letter transition model
    for recognition of words.
  • BYBLOS system uses similar approach for line
    recognition.
  • The recognizers may fail in presence of
    degradations.
  • There are no good recognizers and language
    modeling approaches for Indian languages.
  • Jim Chan, Celal Ziftci, and David A. Forsyth.
    Searching Online Arabic Documents. In Proc. of
    Conference on Computer Vision and Pattern
    Recognition (CVPR) (2), pages 1455-1462, 2006.
  • Zhidong Lu, Richard Schwartz, Premkumar
    Natarajan, Issam Bazzi, and John Makhoul.
    Advances in the BBN BYBLOS OCR System. In Proc.
    of International Conference on Document Analysis
    and Recognition (ICDAR), pages 337-340, 1999.
  • U. Pal and B.B. Chaudhuri. Indian Script
    Character Recognition A Survey. Pattern
    Recognition, 371887-1899, 2004.

9
Previous Work
  • Recognition free methods
  • Word spotting in handwritten documents
  • Words are clustered and the clusters are
    annotated to enable search.
  • Dynamic time warping (DTW) is used for matching
    words.
  • George Washingtons handwritten documents.
  • Similar approach for printed Indian language
    documents.
  • Word spotting in Ottoman documents
  • Successive pruning stages eliminate wrong words.
  • Toni M. Rath and R. Manmatha. Word Image
    Matching Using Dynamic Time Warping. In Proc. of
    Conference on Computer Vision and Pattern
    Recognition (CVPR)(2), pages 521-527, 2003.
  • A. Balasubramanian, M. Meshesha, and C. V.
    Jawahar. Retrieval from Document Image
    Collections. In International Workshop on
    Document Analysis Systems (DAS), pages 1-12,
    2006.
  • Esra Ataer and Pinar Duygulu. Retrieval of
    Ottoman documents. In Multimedia Information
    Retrieval (MIR) workshop, pages 155-162, 2006.

10
Contribution of This Work
  • Data is processed quickly using the proposed
    technique to help search efficiently in large
    collection.
  • Effect of word image representation and document
    types on the proposed technique are analyzed.
  • Scalability of the proposed method is
    demonstrated on a collection of Kalidasas books.
  • The group of similar words retrieved using the
    proposed approach are labeled (automatically) for
    annotation based access to documents.
  • A method to improve the automatic word labeling
    (annotation) accuracy is presented.

11
Overview
  • Introduction
  • The problem
  • Previous work
  • Contributions
  • Searching in document images
  • Word image representation
  • Similarity search
  • Content sensitive hashing
  • Fitting in retrieval system
  • Experimental results
  • Annotation for retrieval
  • Summary
  • Future work

12
Word Image Representation
  • Profile Features
  • Ink transitions
  • Number of black to white pixel transitions in the
    image row or column. Calculated for both rows and
    columns.
  • Projection profiles
  • Sum over the pixel values of a column

13
Word Image Representation
  • Profile Features
  • Upper word profiles
  • Black pixel distance from top boundary of the
    image.
  • Lower word profiles
  • Black pixel distance from bottom of the image.
  • If no pixel is found in a column, the value is
    taken as height of the image.

14
Word Image Representation
  • Region based moments
  • Central moments
  • Discrete Fourier Transform (DFT) coefficients.
  • Projection and word profile features are
    segmented vertically into four equal parts.
  • 1D Fourier transform of the segmented profile
    features is obtained.
  • n4 real parts and last n-13 imaginary parts of
    the DFT are taken as features.
  • Total 84 Fourier coefficients are taken from each
    image.

3 x (7 x 4) 84
features x (coefficients x segments) total
coefficients for every image
15
Similarity search
  • Given word image representations as vectors
    (points) in some space,
  • We need to search for similar vectors (points)
    i.e., nearest neighbor search (NNS).
  • k-d tree, B-tree or R-tree can be used for the
    NNS.
  • How to handle the slight differences in the
    representation of similar words?
  • Approximate nearest neighbor search has to be
    carried out.
  • Since the representations are in high dimension
    (more than 84 in our case), traditional way of
    searching is inefficient.
  • Locality sensitive hashing (LSH) is an
    approximate nearest neighbor search method for
    sub-linear time complexity.

Jon Louis Bentley. Multidimensional Binary
Search Trees Used for Associative Searching.
Communications of the ACM, 18(9)509-517, 1975.
Sunil Arya and David M. Mount. Approximate
Nearest Neighbor Queries in Fixed Dimensions. In
SODA '93. pages 271-280, 1993.
Rudolf Bayer and E. McCreight. Organization and
Maintenance of Large Ordered Indexes. Acta
Informatica, 1(3)173-189, 1972.
M. Datar, N. Immorlica, P. Indyk, and V. S.
Mirrokni. Locality-Sensitive Hashing Scheme
Based on p-Stable Distributions. In ACM SOCG,
pages 253-262, 2004.
16
Content Sensitive Hashing
  • A similarity search problem in which it is not
    necessary to find exact answer instead determine
    approximate answer.
  • The key idea is
  • To hash points using several hash functions so as
    to ensure that for each function the probability
    of collision is much higher for objects which are
    close to each other than for those which are far
    apart.
  • When a query point is given,
  • Hash the query point and retrieve elements stored
    in buckets containing that point.

17
Content Sensitive Hashing
  • Hashing Technique
  • Given set P of n points and number of hash
    tables L.
  • for each hash table Ti, i 1,,L
  • for each point pj, j1,n
  • store pj on bucket gi(pj) of hash table Ti.
  • where gi(p), i1,,L is hash function of table Ti
  • Hash function can be combination of other
    functions.
  • Some Examples
  • g(v1,,vk) a1.v1ak.vk mod M
  • where M is hash table size and a1,,ak are
    random numbers from interval 0M-1
  • g(p) h1(p),,hk(p)
  • where hi(p) (ai.pbi)/w, ai is a d dimensional
    vector
  • gL(p) v1(p)vL(p)
  • v(p) Unaryc(x1)Unaryc(xd).
  • Unaryc(x) x 1s followed by c-x 0s
  • vi(p) gt select some bits from v(p), i 1..L

18
Content Sensitive Hashing
  • Querying
  • To process a query q
  • we search all indices of g1(q),,gL(q) and
    collect all points from L indices of hash tables.
  • Linear search on the collected points.
  • Output points within distance R from query.

19
Content Sensitive Hashing
  • Example
  • Let, p1,3,2, q1,2,3, r3,1,1, s2,1,1
    are d3 dimensional points and c3 is max value
    in the dimensions.
  • v(p) Unaryc(x1)Unaryc(xd).
  • Unaryc(x) x 1s followed by c-x 0s
  • v(p) v(1,3,2) 100 111 110
  • A new dimensions d cd 9 is obtained i.e., a
    set I 1,2,3,4,5,6,7,8,9.
  • Let number of hash tables L2, and I11,5,6,
    I22,3,7,9 be L subsets from of I.
  • Hash function is gL(p) v1(p)vL(p)
  • vi(p) gt select Ii bits from v(p), i 1L

Unary(1) 100
Unary(3) 111
Unary(2) 110
20
Content Sensitive Hashing
  • Example
  • v(p) v(1,3,2) 100 111 110
  • g1(p) 111, g2(p) 0010. (7, 2)
  • v(q) v(1,2,3) 100 110 111
  • g1(q) 110, g2(q) 0011. (6, 3)
  • v(r) v(3,1,1) 111 100 100
  • g1(r) 100, g2(r) 1110. (4, 13)
  • Query s 2,1,1
  • v(s) 110 100 100
  • g1(s) 100, g2(s) 1010. (4, 10)
  • Resulting point is r

I11,5,6 and I22,3,7,9
v(p) v(1,3,2) 100 111 110
g1(p)
1
1
1
21
Fitting in Retrieval System
Document Images
Textual Query
Relevant Documents
Cross Lingual
Pre-processing
Segmentation and word detection
Word Rendering
Hashed Words
Feature Extraction
Feature Extraction
Hashing
Offline Process
Online Process
22
Experimental Results
Performance on different data sets of English
language
query
results
23
Experimental Results
Performance on different data sets of English
language
Performance of individual features
Performance with combination of features
24
Experimental Results
Searching in Kalidasas Collection.
Cross-lingual search
25
Experimental Results
Searching in Kalidasas Collection.
Comparison with Dynamic Time Warping based NNS
26
Overview
  • Introduction
  • The problem
  • Previous work
  • Contributions
  • Searching in document images
  • Annotation for retrieval
  • Annotation based search
  • Annotation correction
  • Experimental results
  • Summary
  • Future work

27
Annotation for Retrieval
  • Annotation is the process of identifying objects
    in images and labeling with meaningful
    description.
  • Search is easy and efficient in annotated
    document images.
  • Challenges
  • Recognition for annotation may be inaccurate.
  • Manual annotation is impractical

28
Annotation for Retrieval
  • Can we use image search to speed up annotation
    and increase accuracy?
  • Image search produces clusters of similar words.
  • A single representative is required to annotate
    words of the whole cluster.
  • Cluster of recognized words can be obtained to
    get the representative.
  • The cluster information can be used to obtain
    correct annotation of the cluster.

29
Annotation Based Search
Document Images
Word Annotation by Recognition
Relevant Documents
Cluster of Word Images
Pre-processing
Text Search Engine
Segmentation and word detection
Hashed Words
Feature Extraction
Textual Query
Hashing
Online Process
Offline Process
30
Annotation Correction
Correction by Majority Voting
What if too erroneous?
ambidextrous ambidextro4s ambidextrous
ambidextrous ab idex tro4s ambiderous an biderous
ambiderous ambideous
recognition
Ordered words
Word length 12
ambidextrous
Final word
Text words of cluster
Word image cluster
31
Annotation Correction
Correction by Majority Voting
  • Input Cluster C of words.
  • Output Representative word WR for C
  • S Sort C based on string length
  • Get M S for all A, B in S edit distance of A
    and B is less than half of the lengths of A and
    B
  • If l is the length of most of the strings
    (majority) the cluster representative WR has
    length l.
  • For each character i 1,,l do
  • Get all k words of length l
  • Find majority of characters for position i of WR

32
Annotation Correction
Correction by Alignment
ambidextrous
a m b i d e x t r o
u s a b i d e x t
r o 4 s a m b i d e
r o u s a m b i d
e x t r o u s a m b i
d e x t r o 4 s
abidextro4s
ambiderous
ambidextrous
ambidextro4s
Aligned words
Text words of cluster
a m b i d e x t r o 4 s
a m b i d e x t r o
u s
Final word
Word obtained by majority voting
33
Annotation Correction
Correction by Alignment
  • Input Cluster C of Wi 1,,n words
  • Output Cluster representative WR of C
  • for each i 1,,n do
  • for each j 1,,n do
  • if j ? i then do
  • Align word Wi and Wj
  • Record errors Ek, k 1,,m in Wi
  • Record possible correction Gp, p 1,,q for Ek
    from Wj
  • end if
  • end for
  • Find correction Ch Gp by majority voting
  • Correct Ek with Ch
  • O ? O U Wi
  • end for
  • Find correct word WR from the alignments O with
    majority voting.

34
Experimental Results
35
Experimental Results
Effect of cluster size on the retrieval
performance
36
Summary
  • Direct hashing of the word features eliminates
    costly processing before building an index.
  • Query results can be obtained in milliseconds
    using the content sensitive hashing (CSH).
  • Scalability of the proposed method is
    demonstrated on a collection of Kalidasas books.
  • Two methods to improve the automatic word
    labeling (annotation) accuracy are presented.
  • Demonstrated annotation based retrieval technique
    using the automatic annotations of document
    images.

37
Future Work
  • Indexing of documents images in different fonts.
  • Searching in Multi-lingual documents is one of
    the challenging tasks.
  • Many Indian language documents are translated to
    other languages.
  • Usage of cluster information
  • for improving the accuracy of character
    recognizers.
  • Annotation becomes difficult in presence of
    errors in every recognized word of a cluster.
  • Need to explore new techniques for annotation

38
Related Publications
  • Anand Kumar, C.V.Jawahar and R. Manmatha.
    "Efficient Search in Document Image Collections".
    Asian Conference on Computer Vision (ACCV), pages
    586-595, November 18-22, 2007, Tokyo, Japan.
  • C.V.Jawahar and Anand Kumar. "Content Level
    Annotation of Large Collection of Printed
    Document Images". International Conference on
    Document Analysis and Recognition (ICDAR), pages
    799-803, September 23-26, 2007, Brazil.
  • Anand Kumar, A. Balasubramanian, Anoop M.
    Namboodiri and C.V. Jawahar. "Model-Based
    Annotation of Online Handwritten Datasets",
    International Workshop on Frontiers in
    Handwriting Recognition (IWFHR), October 23-26,
    2006, La Baule, France.

39
Thank You
  • Questions ?

40
Dynamic Time Warping
41
Partial Matching
Write a Comment
User Comments (0)
About PowerShow.com