Title: Word Hashing for Efficient Search in Document Image Collections
1Word Hashing for Efficient Search in Document
Image Collections
- Anand Kumar
- Advisors
- Dr. C. V. Jawahar
- IIIT Hyderabad
- Dr. R. Manmatha
- University of Massachusetts, Amherst, USA
2Overview
- Introduction
- The problem
- Previous work
- Contributions
- Searching in document images
- Annotation for retrieval
- Summary
- Future work
3Introduction
Processing
Input Query
Scanning
Database
Image Matching
Retrieved Documents
Documents
NOT the text (ASCII) words.
Matching images of words,
4Challenges
- Direct matching of images is an expensive
process. - Represent word images as feature vectors and
match. - Representation should capture the characteristics
(mainly content) of words. - On every query, searching in large word image
database by matching is time consuming. - The scalability issues arise with the increase in
size of the document image collection.
5Basic Directions of Solution
- Convert the images into text using recognizers
and build index using text search methods. - If the converted text has errors, will the text
search methods deliver the expected performance?
6Basic Directions of Solution
- Group similar words in the document image
collection and annotate (label with text) the
groups. - Apply text search methods for accessing the
documents. - Is it possible to annotate large groups of words
found in a large collection of document images?
7The Problem
- Building an index using matching or other
existing methods is not scalable for even
moderate collections. - Given a large collection of document images, how
to search efficiently for similar words so that
queries are answered quickly (in milli seconds)?
8Previous Work
- Recognition based methods
- Chan et al. use bi-gram letter transition model
for recognition of words. - BYBLOS system uses similar approach for line
recognition. - The recognizers may fail in presence of
degradations. - There are no good recognizers and language
modeling approaches for Indian languages.
- Jim Chan, Celal Ziftci, and David A. Forsyth.
Searching Online Arabic Documents. In Proc. of
Conference on Computer Vision and Pattern
Recognition (CVPR) (2), pages 1455-1462, 2006. - Zhidong Lu, Richard Schwartz, Premkumar
Natarajan, Issam Bazzi, and John Makhoul.
Advances in the BBN BYBLOS OCR System. In Proc.
of International Conference on Document Analysis
and Recognition (ICDAR), pages 337-340, 1999. - U. Pal and B.B. Chaudhuri. Indian Script
Character Recognition A Survey. Pattern
Recognition, 371887-1899, 2004.
9Previous Work
- Recognition free methods
- Word spotting in handwritten documents
- Words are clustered and the clusters are
annotated to enable search. - Dynamic time warping (DTW) is used for matching
words. - George Washingtons handwritten documents.
- Similar approach for printed Indian language
documents. - Word spotting in Ottoman documents
- Successive pruning stages eliminate wrong words.
- Toni M. Rath and R. Manmatha. Word Image
Matching Using Dynamic Time Warping. In Proc. of
Conference on Computer Vision and Pattern
Recognition (CVPR)(2), pages 521-527, 2003. - A. Balasubramanian, M. Meshesha, and C. V.
Jawahar. Retrieval from Document Image
Collections. In International Workshop on
Document Analysis Systems (DAS), pages 1-12,
2006. - Esra Ataer and Pinar Duygulu. Retrieval of
Ottoman documents. In Multimedia Information
Retrieval (MIR) workshop, pages 155-162, 2006.
10Contribution of This Work
- Data is processed quickly using the proposed
technique to help search efficiently in large
collection. - Effect of word image representation and document
types on the proposed technique are analyzed. - Scalability of the proposed method is
demonstrated on a collection of Kalidasas books. - The group of similar words retrieved using the
proposed approach are labeled (automatically) for
annotation based access to documents. - A method to improve the automatic word labeling
(annotation) accuracy is presented.
11Overview
- Introduction
- The problem
- Previous work
- Contributions
- Searching in document images
- Word image representation
- Similarity search
- Content sensitive hashing
- Fitting in retrieval system
- Experimental results
- Annotation for retrieval
- Summary
- Future work
12Word Image Representation
- Profile Features
- Ink transitions
- Number of black to white pixel transitions in the
image row or column. Calculated for both rows and
columns. - Projection profiles
- Sum over the pixel values of a column
13Word Image Representation
- Profile Features
- Upper word profiles
- Black pixel distance from top boundary of the
image. - Lower word profiles
- Black pixel distance from bottom of the image.
- If no pixel is found in a column, the value is
taken as height of the image.
14Word Image Representation
- Region based moments
- Central moments
- Discrete Fourier Transform (DFT) coefficients.
- Projection and word profile features are
segmented vertically into four equal parts. - 1D Fourier transform of the segmented profile
features is obtained. - n4 real parts and last n-13 imaginary parts of
the DFT are taken as features. - Total 84 Fourier coefficients are taken from each
image.
3 x (7 x 4) 84
features x (coefficients x segments) total
coefficients for every image
15Similarity search
- Given word image representations as vectors
(points) in some space, - We need to search for similar vectors (points)
i.e., nearest neighbor search (NNS). - k-d tree, B-tree or R-tree can be used for the
NNS. - How to handle the slight differences in the
representation of similar words? - Approximate nearest neighbor search has to be
carried out. - Since the representations are in high dimension
(more than 84 in our case), traditional way of
searching is inefficient. - Locality sensitive hashing (LSH) is an
approximate nearest neighbor search method for
sub-linear time complexity.
Jon Louis Bentley. Multidimensional Binary
Search Trees Used for Associative Searching.
Communications of the ACM, 18(9)509-517, 1975.
Sunil Arya and David M. Mount. Approximate
Nearest Neighbor Queries in Fixed Dimensions. In
SODA '93. pages 271-280, 1993.
Rudolf Bayer and E. McCreight. Organization and
Maintenance of Large Ordered Indexes. Acta
Informatica, 1(3)173-189, 1972.
M. Datar, N. Immorlica, P. Indyk, and V. S.
Mirrokni. Locality-Sensitive Hashing Scheme
Based on p-Stable Distributions. In ACM SOCG,
pages 253-262, 2004.
16Content Sensitive Hashing
- A similarity search problem in which it is not
necessary to find exact answer instead determine
approximate answer. - The key idea is
- To hash points using several hash functions so as
to ensure that for each function the probability
of collision is much higher for objects which are
close to each other than for those which are far
apart. - When a query point is given,
- Hash the query point and retrieve elements stored
in buckets containing that point.
17Content Sensitive Hashing
- Hashing Technique
- Given set P of n points and number of hash
tables L. - for each hash table Ti, i 1,,L
- for each point pj, j1,n
- store pj on bucket gi(pj) of hash table Ti.
- where gi(p), i1,,L is hash function of table Ti
- Hash function can be combination of other
functions. - Some Examples
- g(v1,,vk) a1.v1ak.vk mod M
- where M is hash table size and a1,,ak are
random numbers from interval 0M-1 - g(p) h1(p),,hk(p)
- where hi(p) (ai.pbi)/w, ai is a d dimensional
vector - gL(p) v1(p)vL(p)
- v(p) Unaryc(x1)Unaryc(xd).
- Unaryc(x) x 1s followed by c-x 0s
- vi(p) gt select some bits from v(p), i 1..L
18Content Sensitive Hashing
- Querying
- To process a query q
- we search all indices of g1(q),,gL(q) and
collect all points from L indices of hash tables. - Linear search on the collected points.
- Output points within distance R from query.
19Content Sensitive Hashing
- Example
- Let, p1,3,2, q1,2,3, r3,1,1, s2,1,1
are d3 dimensional points and c3 is max value
in the dimensions. - v(p) Unaryc(x1)Unaryc(xd).
- Unaryc(x) x 1s followed by c-x 0s
- v(p) v(1,3,2) 100 111 110
- A new dimensions d cd 9 is obtained i.e., a
set I 1,2,3,4,5,6,7,8,9. - Let number of hash tables L2, and I11,5,6,
I22,3,7,9 be L subsets from of I. - Hash function is gL(p) v1(p)vL(p)
- vi(p) gt select Ii bits from v(p), i 1L
Unary(1) 100
Unary(3) 111
Unary(2) 110
20Content Sensitive Hashing
- Example
- v(p) v(1,3,2) 100 111 110
- g1(p) 111, g2(p) 0010. (7, 2)
- v(q) v(1,2,3) 100 110 111
- g1(q) 110, g2(q) 0011. (6, 3)
- v(r) v(3,1,1) 111 100 100
- g1(r) 100, g2(r) 1110. (4, 13)
- Query s 2,1,1
- v(s) 110 100 100
- g1(s) 100, g2(s) 1010. (4, 10)
- Resulting point is r
I11,5,6 and I22,3,7,9
v(p) v(1,3,2) 100 111 110
g1(p)
1
1
1
21Fitting in Retrieval System
Document Images
Textual Query
Relevant Documents
Cross Lingual
Pre-processing
Segmentation and word detection
Word Rendering
Hashed Words
Feature Extraction
Feature Extraction
Hashing
Offline Process
Online Process
22Experimental Results
Performance on different data sets of English
language
query
results
23Experimental Results
Performance on different data sets of English
language
Performance of individual features
Performance with combination of features
24Experimental Results
Searching in Kalidasas Collection.
Cross-lingual search
25Experimental Results
Searching in Kalidasas Collection.
Comparison with Dynamic Time Warping based NNS
26Overview
- Introduction
- The problem
- Previous work
- Contributions
- Searching in document images
- Annotation for retrieval
- Annotation based search
- Annotation correction
- Experimental results
- Summary
- Future work
27Annotation for Retrieval
- Annotation is the process of identifying objects
in images and labeling with meaningful
description. - Search is easy and efficient in annotated
document images. - Challenges
- Recognition for annotation may be inaccurate.
- Manual annotation is impractical
28Annotation for Retrieval
- Can we use image search to speed up annotation
and increase accuracy? - Image search produces clusters of similar words.
- A single representative is required to annotate
words of the whole cluster. - Cluster of recognized words can be obtained to
get the representative. - The cluster information can be used to obtain
correct annotation of the cluster.
29Annotation Based Search
Document Images
Word Annotation by Recognition
Relevant Documents
Cluster of Word Images
Pre-processing
Text Search Engine
Segmentation and word detection
Hashed Words
Feature Extraction
Textual Query
Hashing
Online Process
Offline Process
30Annotation Correction
Correction by Majority Voting
What if too erroneous?
ambidextrous ambidextro4s ambidextrous
ambidextrous ab idex tro4s ambiderous an biderous
ambiderous ambideous
recognition
Ordered words
Word length 12
ambidextrous
Final word
Text words of cluster
Word image cluster
31Annotation Correction
Correction by Majority Voting
- Input Cluster C of words.
- Output Representative word WR for C
- S Sort C based on string length
- Get M S for all A, B in S edit distance of A
and B is less than half of the lengths of A and
B - If l is the length of most of the strings
(majority) the cluster representative WR has
length l. - For each character i 1,,l do
- Get all k words of length l
- Find majority of characters for position i of WR
32Annotation Correction
Correction by Alignment
ambidextrous
a m b i d e x t r o
u s a b i d e x t
r o 4 s a m b i d e
r o u s a m b i d
e x t r o u s a m b i
d e x t r o 4 s
abidextro4s
ambiderous
ambidextrous
ambidextro4s
Aligned words
Text words of cluster
a m b i d e x t r o 4 s
a m b i d e x t r o
u s
Final word
Word obtained by majority voting
33Annotation Correction
Correction by Alignment
- Input Cluster C of Wi 1,,n words
- Output Cluster representative WR of C
- for each i 1,,n do
- for each j 1,,n do
- if j ? i then do
- Align word Wi and Wj
- Record errors Ek, k 1,,m in Wi
- Record possible correction Gp, p 1,,q for Ek
from Wj - end if
- end for
- Find correction Ch Gp by majority voting
- Correct Ek with Ch
- O ? O U Wi
- end for
- Find correct word WR from the alignments O with
majority voting.
34Experimental Results
35Experimental Results
Effect of cluster size on the retrieval
performance
36Summary
- Direct hashing of the word features eliminates
costly processing before building an index. - Query results can be obtained in milliseconds
using the content sensitive hashing (CSH). - Scalability of the proposed method is
demonstrated on a collection of Kalidasas books. - Two methods to improve the automatic word
labeling (annotation) accuracy are presented. - Demonstrated annotation based retrieval technique
using the automatic annotations of document
images.
37Future Work
- Indexing of documents images in different fonts.
- Searching in Multi-lingual documents is one of
the challenging tasks. - Many Indian language documents are translated to
other languages. - Usage of cluster information
- for improving the accuracy of character
recognizers. - Annotation becomes difficult in presence of
errors in every recognized word of a cluster. - Need to explore new techniques for annotation
38Related Publications
- Anand Kumar, C.V.Jawahar and R. Manmatha.
"Efficient Search in Document Image Collections".
Asian Conference on Computer Vision (ACCV), pages
586-595, November 18-22, 2007, Tokyo, Japan. - C.V.Jawahar and Anand Kumar. "Content Level
Annotation of Large Collection of Printed
Document Images". International Conference on
Document Analysis and Recognition (ICDAR), pages
799-803, September 23-26, 2007, Brazil. - Anand Kumar, A. Balasubramanian, Anoop M.
Namboodiri and C.V. Jawahar. "Model-Based
Annotation of Online Handwritten Datasets",
International Workshop on Frontiers in
Handwriting Recognition (IWFHR), October 23-26,
2006, La Baule, France.
39Thank You
40Dynamic Time Warping
41Partial Matching