Word Hashing for Efficient Search in Document Image Collections - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Word Hashing for Efficient Search in Document Image Collections

Description:

Direct matching of images is an ... Word spotting in Ottoman documents. Successive pruning stages eliminate wrong words. ... Retrieval of Ottoman documents' ... – PowerPoint PPT presentation

Number of Views:225

Avg rating:3.0/5.0

Slides: 42

Provided by: anand9

Category:

more less

Transcript and Presenter's Notes

Title: Word Hashing for Efficient Search in Document Image Collections

1
Word Hashing for Efficient Search in Document
Image Collections

Anand Kumar
Advisors
Dr. C. V. Jawahar
IIIT Hyderabad
Dr. R. Manmatha
University of Massachusetts, Amherst, USA

2
Overview

Introduction
The problem
Previous work
Contributions
Searching in document images
Annotation for retrieval
Summary
Future work

3
Introduction
Processing
Input Query
Scanning
Database
Image Matching
Retrieved Documents
Documents
NOT the text (ASCII) words.
Matching images of words,
4
Challenges

Direct matching of images is an expensive
process.
Represent word images as feature vectors and
match.
Representation should capture the characteristics
(mainly content) of words.
On every query, searching in large word image
database by matching is time consuming.
The scalability issues arise with the increase in
size of the document image collection.

5
Basic Directions of Solution

Convert the images into text using recognizers
and build index using text search methods.
If the converted text has errors, will the text
search methods deliver the expected performance?

6
Basic Directions of Solution

Group similar words in the document image
collection and annotate (label with text) the
groups.
Apply text search methods for accessing the
documents.
Is it possible to annotate large groups of words
found in a large collection of document images?

7
The Problem

Building an index using matching or other
existing methods is not scalable for even
moderate collections.
Given a large collection of document images, how
to search efficiently for similar words so that
queries are answered quickly (in milli seconds)?

8
Previous Work

Recognition based methods
Chan et al. use bi-gram letter transition model
for recognition of words.
BYBLOS system uses similar approach for line
recognition.
The recognizers may fail in presence of
degradations.
There are no good recognizers and language
modeling approaches for Indian languages.

Jim Chan, Celal Ziftci, and David A. Forsyth.
Searching Online Arabic Documents. In Proc. of
Conference on Computer Vision and Pattern
Recognition (CVPR) (2), pages 1455-1462, 2006.
Zhidong Lu, Richard Schwartz, Premkumar
Natarajan, Issam Bazzi, and John Makhoul.
Advances in the BBN BYBLOS OCR System. In Proc.
of International Conference on Document Analysis
and Recognition (ICDAR), pages 337-340, 1999.
U. Pal and B.B. Chaudhuri. Indian Script
Character Recognition A Survey. Pattern
Recognition, 371887-1899, 2004.

9
Previous Work

Recognition free methods
Word spotting in handwritten documents
Words are clustered and the clusters are
annotated to enable search.
Dynamic time warping (DTW) is used for matching
words.
George Washingtons handwritten documents.
Similar approach for printed Indian language
documents.
Word spotting in Ottoman documents
Successive pruning stages eliminate wrong words.

Toni M. Rath and R. Manmatha. Word Image
Matching Using Dynamic Time Warping. In Proc. of
Conference on Computer Vision and Pattern
Recognition (CVPR)(2), pages 521-527, 2003.
A. Balasubramanian, M. Meshesha, and C. V.
Jawahar. Retrieval from Document Image
Collections. In International Workshop on
Document Analysis Systems (DAS), pages 1-12,
2006.
Esra Ataer and Pinar Duygulu. Retrieval of
Ottoman documents. In Multimedia Information
Retrieval (MIR) workshop, pages 155-162, 2006.

10
Contribution of This Work

Data is processed quickly using the proposed
technique to help search efficiently in large
collection.
Effect of word image representation and document
types on the proposed technique are analyzed.
Scalability of the proposed method is
demonstrated on a collection of Kalidasas books.
The group of similar words retrieved using the
proposed approach are labeled (automatically) for
annotation based access to documents.
A method to improve the automatic word labeling
(annotation) accuracy is presented.

11
Overview

Introduction
The problem
Previous work
Contributions
Searching in document images
Word image representation
Similarity search
Content sensitive hashing
Fitting in retrieval system
Experimental results
Annotation for retrieval
Summary
Future work

12
Word Image Representation

Profile Features
Ink transitions
Number of black to white pixel transitions in the
image row or column. Calculated for both rows and
columns.
Projection profiles
Sum over the pixel values of a column

13
Word Image Representation

Profile Features
Upper word profiles
Black pixel distance from top boundary of the
image.
Lower word profiles
Black pixel distance from bottom of the image.
If no pixel is found in a column, the value is
taken as height of the image.

14
Word Image Representation

Region based moments
Central moments
Discrete Fourier Transform (DFT) coefficients.
Projection and word profile features are
segmented vertically into four equal parts.
1D Fourier transform of the segmented profile
features is obtained.
n4 real parts and last n-13 imaginary parts of
the DFT are taken as features.
Total 84 Fourier coefficients are taken from each
image.

3 x (7 x 4) 84
features x (coefficients x segments) total
coefficients for every image
15
Similarity search

Given word image representations as vectors
(points) in some space,
We need to search for similar vectors (points)
i.e., nearest neighbor search (NNS).
k-d tree, B-tree or R-tree can be used for the
NNS.
How to handle the slight differences in the
representation of similar words?
Approximate nearest neighbor search has to be
carried out.
Since the representations are in high dimension
(more than 84 in our case), traditional way of
searching is inefficient.
Locality sensitive hashing (LSH) is an
approximate nearest neighbor search method for
sub-linear time complexity.

Jon Louis Bentley. Multidimensional Binary
Search Trees Used for Associative Searching.
Communications of the ACM, 18(9)509-517, 1975.
Sunil Arya and David M. Mount. Approximate
Nearest Neighbor Queries in Fixed Dimensions. In
SODA '93. pages 271-280, 1993.
Rudolf Bayer and E. McCreight. Organization and
Maintenance of Large Ordered Indexes. Acta
Informatica, 1(3)173-189, 1972.
M. Datar, N. Immorlica, P. Indyk, and V. S.
Mirrokni. Locality-Sensitive Hashing Scheme
Based on p-Stable Distributions. In ACM SOCG,
pages 253-262, 2004.
16
Content Sensitive Hashing

A similarity search problem in which it is not
necessary to find exact answer instead determine
approximate answer.
The key idea is
To hash points using several hash functions so as
to ensure that for each function the probability
of collision is much higher for objects which are
close to each other than for those which are far
apart.
When a query point is given,
Hash the query point and retrieve elements stored
in buckets containing that point.

17
Content Sensitive Hashing

Hashing Technique
Given set P of n points and number of hash
tables L.
for each hash table Ti, i 1,,L
for each point pj, j1,n
store pj on bucket gi(pj) of hash table Ti.
where gi(p), i1,,L is hash function of table Ti

Hash function can be combination of other
functions.
Some Examples
g(v1,,vk) a1.v1ak.vk mod M
where M is hash table size and a1,,ak are
random numbers from interval 0M-1
g(p) h1(p),,hk(p)
where hi(p) (ai.pbi)/w, ai is a d dimensional
vector
gL(p) v1(p)vL(p)
v(p) Unaryc(x1)Unaryc(xd).
Unaryc(x) x 1s followed by c-x 0s
vi(p) gt select some bits from v(p), i 1..L

18
Content Sensitive Hashing

Querying
To process a query q
we search all indices of g1(q),,gL(q) and
collect all points from L indices of hash tables.
Linear search on the collected points.
Output points within distance R from query.

19
Content Sensitive Hashing

Example
Let, p1,3,2, q1,2,3, r3,1,1, s2,1,1
are d3 dimensional points and c3 is max value
in the dimensions.
v(p) Unaryc(x1)Unaryc(xd).
Unaryc(x) x 1s followed by c-x 0s
v(p) v(1,3,2) 100 111 110
A new dimensions d cd 9 is obtained i.e., a
set I 1,2,3,4,5,6,7,8,9.
Let number of hash tables L2, and I11,5,6,
I22,3,7,9 be L subsets from of I.
Hash function is gL(p) v1(p)vL(p)
vi(p) gt select Ii bits from v(p), i 1L

Unary(1) 100
Unary(3) 111
Unary(2) 110
20
Content Sensitive Hashing

Example
v(p) v(1,3,2) 100 111 110
g1(p) 111, g2(p) 0010. (7, 2)
v(q) v(1,2,3) 100 110 111
g1(q) 110, g2(q) 0011. (6, 3)
v(r) v(3,1,1) 111 100 100
g1(r) 100, g2(r) 1110. (4, 13)
Query s 2,1,1
v(s) 110 100 100
g1(s) 100, g2(s) 1010. (4, 10)
Resulting point is r

I11,5,6 and I22,3,7,9
v(p) v(1,3,2) 100 111 110
g1(p)
1
1
1
21
Fitting in Retrieval System
Document Images
Textual Query
Relevant Documents
Cross Lingual
Pre-processing
Segmentation and word detection
Word Rendering
Hashed Words
Feature Extraction
Feature Extraction
Hashing
Offline Process
Online Process
22
Experimental Results
Performance on different data sets of English
language
query
results
23
Experimental Results
Performance on different data sets of English
language
Performance of individual features
Performance with combination of features
24
Experimental Results
Searching in Kalidasas Collection.
Cross-lingual search
25
Experimental Results
Searching in Kalidasas Collection.
Comparison with Dynamic Time Warping based NNS
26
Overview

Introduction
The problem
Previous work
Contributions
Searching in document images
Annotation for retrieval
Annotation based search
Annotation correction
Experimental results
Summary
Future work

27
Annotation for Retrieval

Annotation is the process of identifying objects
in images and labeling with meaningful
description.
Search is easy and efficient in annotated
document images.
Challenges
Recognition for annotation may be inaccurate.
Manual annotation is impractical

28
Annotation for Retrieval

Can we use image search to speed up annotation
and increase accuracy?
Image search produces clusters of similar words.
A single representative is required to annotate
words of the whole cluster.
Cluster of recognized words can be obtained to
get the representative.
The cluster information can be used to obtain
correct annotation of the cluster.

29
Annotation Based Search
Document Images
Word Annotation by Recognition
Relevant Documents
Cluster of Word Images
Pre-processing
Text Search Engine
Segmentation and word detection
Hashed Words
Feature Extraction
Textual Query
Hashing
Online Process
Offline Process
30
Annotation Correction
Correction by Majority Voting
What if too erroneous?
ambidextrous ambidextro4s ambidextrous
ambidextrous ab idex tro4s ambiderous an biderous
ambiderous ambideous
recognition
Ordered words
Word length 12
ambidextrous
Final word
Text words of cluster
Word image cluster
31
Annotation Correction
Correction by Majority Voting

Input Cluster C of words.
Output Representative word WR for C
S Sort C based on string length
Get M S for all A, B in S edit distance of A
and B is less than half of the lengths of A and
B
If l is the length of most of the strings
(majority) the cluster representative WR has
length l.
For each character i 1,,l do
Get all k words of length l
Find majority of characters for position i of WR

32
Annotation Correction
Correction by Alignment
ambidextrous
a m b i d e x t r o
u s a b i d e x t
r o 4 s a m b i d e
r o u s a m b i d
e x t r o u s a m b i
d e x t r o 4 s
abidextro4s
ambiderous
ambidextrous
ambidextro4s
Aligned words
Text words of cluster
a m b i d e x t r o 4 s
a m b i d e x t r o
u s
Final word
Word obtained by majority voting
33
Annotation Correction
Correction by Alignment

Input Cluster C of Wi 1,,n words
Output Cluster representative WR of C
for each i 1,,n do
for each j 1,,n do
if j ? i then do
Align word Wi and Wj
Record errors Ek, k 1,,m in Wi
Record possible correction Gp, p 1,,q for Ek
from Wj
end if
end for
Find correction Ch Gp by majority voting
Correct Ek with Ch
O ? O U Wi
end for
Find correct word WR from the alignments O with
majority voting.

34
Experimental Results
35
Experimental Results
Effect of cluster size on the retrieval
performance
36
Summary

Direct hashing of the word features eliminates
costly processing before building an index.
Query results can be obtained in milliseconds
using the content sensitive hashing (CSH).
Scalability of the proposed method is
demonstrated on a collection of Kalidasas books.
Two methods to improve the automatic word
labeling (annotation) accuracy are presented.
Demonstrated annotation based retrieval technique
using the automatic annotations of document
images.

37
Future Work

Indexing of documents images in different fonts.
Searching in Multi-lingual documents is one of
the challenging tasks.
Many Indian language documents are translated to
other languages.
Usage of cluster information
for improving the accuracy of character
recognizers.
Annotation becomes difficult in presence of
errors in every recognized word of a cluster.
Need to explore new techniques for annotation

38
Related Publications

Anand Kumar, C.V.Jawahar and R. Manmatha.
"Efficient Search in Document Image Collections".
Asian Conference on Computer Vision (ACCV), pages
586-595, November 18-22, 2007, Tokyo, Japan.
C.V.Jawahar and Anand Kumar. "Content Level
Annotation of Large Collection of Printed
Document Images". International Conference on
Document Analysis and Recognition (ICDAR), pages
799-803, September 23-26, 2007, Brazil.
Anand Kumar, A. Balasubramanian, Anoop M.
Namboodiri and C.V. Jawahar. "Model-Based
Annotation of Online Handwritten Datasets",
International Workshop on Frontiers in
Handwriting Recognition (IWFHR), October 23-26,
2006, La Baule, France.