Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR - PowerPoint PPT Presentation

About This Presentation
Title:

Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR

Description:

M. Delalandre. Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR. DAG Meeting, Barcelona, Spain, 19th of November 2008. – PowerPoint PPT presentation

Number of Views:0
Slides: 16
Provided by: mathieu.delalandre
Category:
Tags:

less

Transcript and Presenter's Notes

Title: Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR


1
Generation of Synthetic Datasets for Performance
Evaluation of Text/Graphics Document OCR
  • Mathieu Delalandre
  • CVC, Barcelona, Spain
  • DAG Meeting
  • CVC, Barcelona, Spain
  • Wednesday 19th of November 2008

2
Introduction
  • Text/graphics documents

Text/graphics documents are used in a variety of
fields like geography, engineering, social
sciences
Some examples are
Huge amount of data exist, two main sources
3
Introduction
  • OCR of text/graphics documents

Character recognition system working with
text/graphics documents First related work
Brown1979 More than 50 references on this
topic today Fletcher1988 Zenzo1992
Goto1999 Adam2000
Text/Graphics separation
full image of text-lines
Problematics - letter segmentation -
multi-font recognition - scale variation
- text/graphics separation - rotation
variation - text-line detection - no
reading order - no dictionary
Text-line detection
general to any documents
images of single text-line
Character segmentation
specific to text/graphics documents
images of single character
Character recognition
ASCII
4
Introduction
  • About performance evaluation

The case of general OCR Kanungo1999 More than
40 references on the topic Kanungo1999 Several
standard databases exist (NIST, MARS, CD-ROM
English, ) Annual evaluation reports Rice1992
Rice1993 Black-box evaluation The evaluation
considers the OCR system as an indivisible unit
and evaluates it from its final results (i.e. OCR
output vs. ASCII transcription of the text using
string edit distances). White-box evaluation
The evaluation aims to characterize the
performance of individual sub-modules of the OCR
system (skewing, letter segmentation, block
identification, character recognition, etc.).
The case of text/graphic document OCR
Wenyin1997 Only 1 reference on the topic No
standard databases None complete evaluation done
through 20 years of research
5
Introduction
  • Scope of the proposed work

Text/graphics separation Text-line detection Character segmentation Character recognition
Groundtruthing
Characterization
Performance evaluation of text/graphics document
OCR white-box evaluation
groundtruthing step datasets for text/line
detection and character recognition
generation algorithms are simple, the
main purpose of the talk will concern the setting
contributions
6
Plan
  1. Groundtruth definition
  2. Datasets for character recognition
  3. Datasets for text-line detection
  4. In progress datasets

7
Groundtruth definition
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
  • Character level
  • ASCII code
  • font (name, size, style)
  • location point
  • orientated bounding box
  • orientation (?)
  • scale (?)
  • Text level
  • first location point
  • groundtruth of characters
  • characters/word positions

char H e l l o W o r l d
p-word 0 0 0 0 0 1 1 1 1 1
p-char 0 1 2 3 4 0 1 2 3 4
8
Datasets for character recognition (1/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
  • Problematics
  • Published experiments

image size class size learning font(s) rotation scaling
Brown1981 682 ??/10 20 000 yes yes
Zenzo92 ?? ??/62 72 000 yes yes
Takahashi1992 242 ??/10 6 400 50 yes yes
Adam2000 282 51/62 15 000 33 yes yes
Chen2003 162-5122 26/26 1 000 14 1 no yes
Choisy2004 282 51/62 15 000 80 yes yes
Hase2004 322 ??/26 3 000 33 3 yes no
Pal2006 132-342 40/62 18 000 80 2 yes yes
Roy2008 132-742 40/62 8 000 80 many yes yes
(1) (2) (3) (3) (4) (5) (5)
How to generate single character images ?
Which number of class ? Which image
resolution ? Which size for the datasets ?
Which fonts ? Etc .
  • Main conclusions
  1. The real sizes of characters can be only
    estimated.
  2. The confusion problem (e.g. 6 vs 9) is not still
    well defined, the 62 class problem (a-z A-Z 0-9)
    is the main goal.
  3. It is not possible to fix a standard size for the
    training/test sets, this information is still
    well defined, several thousands of images are
    mandatory for the training.
  4. The impact of fonts is few studied and should be
    take into account in the evaluation
  5. The invariance to rotation and scaling is the
    final goal, they are few studied independently.

9
Datasets for character recognition (2/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
  • Generation setting
  • Datasets

Geometry invariance
letter class 62 a-z A-Z 0-9
font class 30 fonts http//www.codestyle.org/ with lower and upper case, no cursive
basic fonts 3 times, courier, arial
character size 322 pixels max dxdy of font symbols
dataset size 5 000 / font 62 classes 40 samples/class 50/50
training free ranked files allow a training specification 20 training on file-4001 file-5000
character scaling 1.0 to 2.0 with a gap of 1/1000
character rotation 0 to 2p with a gap of p/500
tests scaling rotation font(s)/test fonts images
3 no no 1 3 15 000
3 yes no 1 3 15 000
3 no yes 1 3 15 000
3 yes yes 1 3 15 000
Font adequacy
tests scaling rotation font(s)/test fonts images
30 yes yes 1 30 150 000
Font scalability
  • Generation algorithm
  • font manager, centering, scale and
    rotation processes

tests scaling rotation font(s)/test fonts images
4 yes yes 3 6 9 12 12 150 000
15 000 30 000 45 000 60 000
10
Datasets for text-line detection (1/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
  • Problematics

use-case images text-lines curved font/img scaling
Roy2008 geographic map ?? 5 000 yes many yes
Pal2004 artistic document ?? 1 521 yes many yes
Loo2002 poster, newspaper 2 118 yes many yes
Park2001 poster, publicity 30 1265 yes many yes
Goto1999 Japanese form 170 9 831 yes many yes
Tan1998 map 8 96 no many yes
He1996 drawing 1 16 no many yes
Burgue1995 cadastral map 4 150 no many yes
Deseilligny1995 cadastral map 3 1 250 no many yes
(1) (1) (1) (2) (3) (3)
How to generate single character images ?
Which number of word per image ? Which image
size ? Which size for the datasets ?
Which number of font ? Etc .
  • Main conclusions
  1. The use-cases are heterogeneous, the sizes and
    resolutions of images are few provided, the text
    density is then difficult to estimate, images
    with significant text content are preferred.
  2. Depending the use-cases, not all the methods work
    on curved text, a combination of curved and
    straight text is necessary.
  3. All the methods use context to extract the
    text-line (i.e. font type, character size, line
    model). The size of characters could change a
    lot, the number of font is generally small (less
    to ten).

11
Datasets for text-line detection (2/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
  • Setting
  • Datasets

Text-line density
dictionary 422 text-lines countries and capitals
font class 30 fonts http//www.codestyle.org/ with lower and upper case, no cursive
character size 322 pixels max dxdy of font symbols
image size 6402 10-50 text-lines per image
dataset size 100 images
text scaling 1.0 to 1.5 with a gap of 1/1000
text rotation -p/2 to p/2 with a gap of p/500
test text-line/img scaling curved font(s)/test words
1 low yes no 3 in progress
1 medium yes no 3 in progress
1 high yes no 3 in progress
Font context
test text-line/img scaling curved font(s)/test words
1 medium no no 9 in progress
1 medium no no 6 in progress
1 medium no no 3 in progress
1 medium no no 1 in progress
  • Generation algorithm

Size context
test text-line/img scaling curved font(s)/test words
1 medium no no 1 in progress
1 medium yes no 1 in progress
12
In progress datasets
1. Groundtruth definition and setting 2. Datasets
for character recognition 3. Datasets for
text-line detection 4. In progress datasets
13
Conclusions
  • Conclusions
  • in progress work
  • character recognition datasets are ready
  • bags of words still under packaging, but will
    be ready soon.
  • Perspectives
  • middle term, experimentations with standard
    feature extraction methods Roy2008
    Valveny2007
  • long term, experimentations with bags of
    word and text/graphics documents
    Delalandre2007 Wenyin1997

14
References (1/2)
  1. R. Brown and M. Lybanon and L. K. Gronmeyer.
    Recognition of Handprinted Characters for
    Automated Cartography A Progress Report.
    Proceedings of the SPIE, Vol. 205, 1979.
  2. L.A. Fletcher R. Kasturi. A Robust Algorithm
    for Text String Separation from Mixed
    Text/Graphics Images. Transactions on Pattern
    Analysis and Machine Intelligence (PAMI), vol
    (10), pp. 910-918 , 1988.
  3. S.D. Zenzo M.D. Buno M. Meucci A. Spirito.
    Optical recognition of hand-printed characters of
    any size, position, and orientation. IBM Journal
    of Research and Development, vol (36), pp.
    487-501 , 1992.
  4. H. Goto H. Aso. Extracting curved text lines
    using local linearity of the text line.
    International Journal on Document Analysis and
    Recognition (IJDAR), vol (2), pp. 111-119 , 1999.
  5. S. Adam J.M. Ogier C. Cariou R. Mullot J.
    Labiche J. Gardes. Symbol and Character
    Recognition Application to Engineering
    Drawings. International Journal on Document
    Analysis and Recognition (IJDAR), vol (3), pp.
    89-101 , 2000.
  6. T. Kanungo G.A. Marton O. Bulbu. Performance
    evaluation of two Arabic OCR products. Workshop
    on Advances in Computer-Assisted Recognition
    (AIPR) , SPIE Proceedings, vol (3584), pp. 76-83
    , 1999.
  7. S.V. Rice J. Kanai T.A. Nartker. A Report on
    the Accuracy of OCR Devices. Information Science
    Research Institute, University of Nevada, USA,
    1992.
  8. S.V. Rice J. Kanai T.A. Nartker. An Evaluation
    of OCR Accuracy. Information Science Research
    Institute, University of Nevada, USA, 1993.
  9. L. Wenyin D. Dori. A Protocol for Performance
    Evaluation of Line Detection Algorithms. Machine
    Vision and Applications, vol (9), pp. 240-250 ,
    1997.
  10. R.M. Brown. Handprinted Symbol Recognition
    System A Very High Performance Approach To
    Pattern Analysis Of Free-form Symbols. Conference
    Southeastcon , pp. 5-8 , 1981.
  11. H. Takahashi. Neural network architectures for
    rotated character recognition. International
    Conference on Pattern Recognition (ICPR) , pp.
    623-626 , 1992.
  12. Q. Chen. Evaluation of OCR algorithms for images
    with different spatial resolutions and noises.
    School of Information Technology and Engineering,
    University of Ottawa, Canada, 2003.
  13. C. Choisy H. Cecotti A. Belaid. Character
    Rotation Absorption Using a Dynamic Neural
    Network Topology Comparison With Invariant
    Features. International Conference on Enterprise
    Information Systems (ICEIS) , pp. 90-97 , 2004.

15
References (2/2)
  1. H. Hase T. Shinokawa S. Tokai C.Y. Suen. A
    robust method of recognizing multi-font rotated
    characters. International Conference on Pattern
    Recognition (ICPR) , vol (2), pp. 363- 366 ,
    2004.
  2. U. Pal F. Kimura K. Roy T. Pal. Recognition
    of English Multi-oriented Characters.
    International Conference on Pattern Recognition
    (ICPR) , vol (2), pp. 873-876 , 2006.
  3. P.P. Roy U. Pal J. Llados. Multi-oriented
    character recognition from graphical documents.
    International Conference on Cognition and
    Recognition (ICCR) , pp. 30-35 , 2008.
  4. U. Pal P. P. Roy. Multi-oriented and curved
    text lines extraction from Indian documents. IEEE
    Transactions on Systems, Man and Cybernetics-
    Part B, vol (34), pp. 1676-1684 , 2004.
  5. P.K. Loo and C.L. Tan. Word and Sentence
    Extraction Using Irregular Pyramid. Workshop on
    Document Analysis System (DAS) , Lecture Notes in
    Computer Science (LNCS), vol (2423), pp. 307-318
    , 2002.
  6. H.C. Park S.Y. Ok Y.J. Yu H.G. Cho. Word
    Extraction in Text/Graphic Mixed Image Using
    3-Dimensional Graph Model. International Journal
    on Document Analysis and Recognition (IJDAR), vol
    (4), pp. 115 130 , 2001.
  7. H. Goto H. Aso. Extracting curved text lines
    using local linearity of the text line.
    International Journal on Document Analysis and
    Recognition (IJDAR), vol (2), pp. 111-119 , 1999.
  8. C.L. Tan P.O. Ng. Text extraction using
    pyramid. Pattern Recognition (PR), vol (31), pp.
    63-72 , 1998.
  9. S. He, N. Abe C. L. Tan. A clustering-based
    approach to the separation of text strings from
    mixed text/graphics documents. International
    Conference on Pattern Recognition (ICPR) , pp.
    706-710 , 1996.
  10. M. Burge G. Monagan. Extracting Words and Multi
    Part Symbols in Graphics Rich Documents.
    International Conference on Image Analysis and
    Processing (ICIAP) , 1995.
  11. M. Deseilligny H. Le Men G. Stamon. Characters
    string recognition on maps, a method for high
    level reconstruction. International Conference on
    Document Analysis and Recognition (ICDAR) , pp.
    249 252 , 1995.
  12. E. Valveny S. Tabbone O. Ramos E. Philippot.
    Performance Characterization of Shape Descriptors
    for Symbol Representation. Workshop on Graphics
    Recognition (GREC) , 2007.
  13. M. Delalandre T. Pridmore E. Valveny E. Trupin
    H. Locteau. Building Synthetic Graphical
    Documents for Performance Evaluation. Workshop on
    Graphics Recognition (GREC) , Lecture Note in
    Computer Science (LNCS), vol (5046), pp. 288-298
    , 2008.
Write a Comment
User Comments (0)
About PowerShow.com