Title: Document Image Matching Based on Component Blocks
1Document Image Matching Based on Component Blocks
- Fuhui Long, Hanchuan Peng, Zheru Chi, and Wanchi
Siu - Center for Multimedia Signal Processing,
- Department of Electronic Information Eng.,
- The Hong Kong Polytechnic Univ.,
- Email fhlong, phc, enzheru_at_eie.polyu.edu.hk
- http//pandora.rad.jhu.edu/phc/publications.html
2Outline
- Introduction
- Component Block Data Structure
- Matching Algorithm
- Experiments
- Discussion Conclusion
3Document Image Matching
- Key technique for document image registration
retrieval - Can be applied widely for office automation,
digital library, video-conferencing, etc.
4Current Techniques
- Existing methods are mainly based on local
features of document page image - Cesarinis form-reader system (attributed
relational graphs) - Shimotsujis cell struct based 2-dimensional hash
table - Watanabes blank form structure of repetitions
and positions of cells - Tseng and Chens line segments based method
- Fan and Changs line crossing relationship matrix
- Watanabe and Huangs predefined logical structure
for business cards - Safaris projective geometry method
- etc
5Our Approach
- Decompose a document page image into local
component blocks - Propose measurements to combine
------local block information
------global page layout information - It is closely related to our e-Doc technique,
which is developed for document databases
6e-Doc
- The Block-oriented e-Doc technique can be very
useful for document databases related
applications, including the document page image
retrieval, etc.
7Preprocessing
- Noise removing
- Region based binarization and foreground
extraction - Correlation based skew correction
- Image Blocking
- Scan from bottom to top and from left to right
- Use the simplest region growing method (not
pixel-by-pixel, but line-by-line, i.e. if there
is a pixel on the out boundary of current block,
then grow out one more line.)
8Component Block List
We only make use of the block location size
information for matching.
9Matching Algorithm
(Notation TBL Template Block List, CBL
Component Block List)
Procedure CBL-MA Input a CBL for the input
document image, a handle to a
template image database of K TBLs Output the
TBL with the minimum distance D to
CBL Preprocessing for k1,,K, do begin sort
the kth TBL by block size (from small to large)
end. Note the preprocessing is not a part of
this CBL-MA and needs to be done only once
beforehand begin sort the CBL by block size
(from small to large) for k1,,K, do begin
compute Dk, which is the distance between
CBL and the kth TBL end select the TBL
with the minimum Dk as output end.
10Distance Definitions
Size Matching
Location Matching
Total Distance
11Illustration of the Algorithm
Component block list
Sequencing with size
Sequenced block list
B
A
2. Matching with location
1. Matching with size
Template block list
BT
12Experimental Data
- A large document template data set of 1350
templates. Define 5 subsets with sizes - 50 templates
- 100 templates
- 200 templates
- 500 templates
- 1350 templates
- Use computer to generate all test data
(deformation images according to these templates)
13Deformation Types
- Detection Error
- Block misdetection rate Pm
- Block misaddition rate Pa
- Block Size Variation
- Block size variation rate Ps
- Block size variation scale Ss
- Block Location Displacement
- Block displacement rate Pd
- Block displacement scale Sd
- Block Rotation
- Block rotation rate Pr
- Block rotation angle Dr
14Data Examples
Template image
Deformation image
15Results for Detection Errors
(1)Â Â Â Â Â Â Â Â CBL-MA can perform well (rc gt 85)
even when 50 blocks in the block list are lost
or wrongly added (see the column of Pm 0.5 and
Pa 0.5). Even when 80 blocks are lost or
added, this algorithm can still produce matching
accuracy nearly 60 (see the column of Pm 0.8
and Pa 0.8). (2)Â Â Â Â Â Â Â Â CBL-MA is more
insensitive to block misaddition than to block
misdetection. This is reasonable because when
additional blocks are wrongly put into CBL, the
original blocks still play, although weaker,
roles. On the contrary, the lost information due
to block misdetection is non-recoverable.
16Results for Detection Errors
Pm Block misdetection rate Pa Block
misaddition rate
17Results for Block Size Variation
For block size variation, the influence of
parameters Ps and Ss (here we set the same scale
factor for both block width and height) on rc is
given in Table 2. When blocks expand or shrink
greatly, CBL-MA can keep rc above 90 (the fourth
row of Table 2). At the same time, even when all
blocks have size variations (Ps 1.0), our
algorithm can produce a very high matching
accuracy of 95. Note that the latter corresponds
to many office automation applications, where Ss
is not very large, however, most blocks are
subject to some degree of size variation, i.e. Ps
is close to 1.
18Results for Block Size Variation
Ps Block size variation rate Ss Block size
variation scale
19Results for Block Displacement
For block location displacement, the influence of
parameters Pd and Sd on rc is given in Table 3.
Evidently CBL-MA is robust to block location
variation (rc always larger than 95).
20Results for Block Displacement
Pd Block displacement rate Sd Block
displacement scale
21Results for Block Rotation
For block rotation, the influence of parameters
Pr and Dr on rc is given in Table 4. For both
cases Dr15?, Pr varies from 0.2 to 1.0 and
Pr0.5, Dr varies from 5? to 45?, CBL-MA
produces satisfying classification, even when the
test images contain strong deformation, e.g. 50
component blocks have at most 45? rotation, or
all component blocks have at most 15? rotation.
Notice that block rotation will directly lead to
the significant change of block sizes.
22Results for Block Rotation
Pr Block rotation rate Dr Block rotation
angle
23Results for Template Set Size
Here under a general setting of parameters Pa
0.2, Pm0.2, Ps 0.2, Ss0.2, Pd 0.5, Sd 0.5,
Pr 0.5, Dr15?, we examine the influence of the
template image set size on the matching accuracy.
All the five template sets are used. For each
template set, we independently generate at least
2000 images for testing. The results are listed
in Table 5. It is clear that even when the
template set size grows to 500, the matching
accuracy is satisfying (gt80). For the
1350-template set (Set-E), rc is still around
70.
24Results for Template Set Size
25Comparison to Other Algorithms
- It is noticed that the failure in detecting local
features (e.g. line-segments) usually immediately
results in bad performance of several other
algorithms. - However, our experiments demonstrate that even
when the block information is partially lost or
inaccurate, there is no significant performance
reduction of CBL-MA.
26Computational Complexity
- Denote n as the number of blocks in a CBL, m as
the number of blocks in a TBL, K as the number of
template images in a document image database.
When we use quicksort and binary search
algorithms, the typical computational complexity
is - O(nlogn) for CBL sorting
- O(Kmlogn) for CBL/TBL size matching
- O(Km(2TC1)) for CBL/TBL location matching
- O(K) for CBL/TBL distance calculation
- O(logK) for finding the minimum distance
- Totally O((Kmn)logn)
27Application Column Table Data Extraction
This page matching algorithm is then refined and
applied to automatic data extraction of column
forms.
---All fields in input image differ from each
other a lot and the local-feature based approach
can not work well. ---Our algorithm is a powerful
tool to find out the correct image template,
which is used to annotate the image data fields
to accomplish the data extraction successfully.
28Conclusions
- Based on block list (and tree), our algorithm can
effectively make use of the local information of
each page block and the global information of
page layout. - We present a method for effective document image
matching. The algorithm gives satisfying
performance for various image deformations. - The algorithm is robust to image distortion,
filled-in text, and noises. - We report an successful application of our
algorithm in column table data auto-reading.