Automatic Document Orientation Detection and Categorization by Document Vectorization PowerPoint PPT Presentation

presentation player overlay
1 / 1
About This Presentation
Transcript and Presenter's Notes

Title: Automatic Document Orientation Detection and Categorization by Document Vectorization


1
Automatic Document Orientation Detection and
Categorization by Document Vectorization
Shijian Lu ( dcslsj_at_nus.edu.sg ) and Chew Lim Tan
(tancl_at_comp.nus.edu.sg) School of Computing,
National University of Singapore
Figure 4 shows the document vectors of four
scripts (Arabic, Hebrew, Chinese, Roman) under
study where eight document vectors are plotted
for each script at each orientation (upright and
upside-down).
Introduction With the proliferation of digital
libraries, an increasing number of documents of
different scripts and orientations illustrated in
Figure 1 are produced. The knowledge of the
underlying document script and document
orientation is required to facilitate the ensuing
document processing tasks such as layout analysis
and OCR.
Figure 1 Sample documents of different scripts
with different orientations
Methods
A series of document preprocessing is conducted
as illustrated in Figure 2. Firstly, a median
filter is implemented shown in Figure 2(b)
After the document binarization shown in Figure
2(c), document images are then filtered by a size
filter as shown in Figure 2(d)
Figure 4 Document vectors of four scripts at two
opposite orientations
Lastly, document scripts and document orientation
can be determined based on the Bray Curtis
distance between the converted document vector
and multiple pre-constructed reference document
vector according to the KNN algorithm
It should be noted the proposed technique is
tolerant to the document skew provided that the
skew angle lies within a small range. Figure 5 on
the right shows the Bray Curtis distance between
document vectors of skewed document images and
that of the one with no skew. Clearly, the
distance is close to zero when the skew angle
lies within a reasonable range.
Figure 2 Document preprocessing procedure
Each labeled connected component is then
converted into a component vector based on the
vertical component cuts illustrated in Figure3
Figure 5 Document vector variation document
skew
Experimental Results
The first eight and the last two vector elements
are constructed as follows
The proposed technique has been tested by 80
document images with every 20 printed in one of
the four scripts under study. Table 1 below shows
the Bray Curtis distance between document vectors
of the 80 test documents and the document vector
templates derived by averaging the reference
documents vectors of the corresponding scripts.
Table 1 Bray Curtis distance between document
vector and the vector template
Figure2 Definition of vertical component run.
Figure 3 Illustration of the vertical component
cuts
  • Conclusions
  • A fast and efficient identification technique is
    reported, which is capable of identifying the
    document script and orientation effectively.
  • 2. The proposed technique works well for scripts
    with different stroke density and distribution.
    However, it cannot handle those scripts with
    similar stroke density and distribution. We will
    work on this in our future work.

Each document image can thus be converted into a
document vector by summing up the converted
component vectors as follows
Write a Comment
User Comments (0)
About PowerShow.com