Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR - PowerPoint PPT Presentation

About This Presentation

Title:

Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR

Description:

M. Delalandre. Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR. DAG Meeting, Barcelona, Spain, 19th of November 2008. – PowerPoint PPT presentation

Number of Views:0

Slides: 16

Provided by: mathieu.delalandre

Category:

Tags:

more less

Transcript and Presenter's Notes

Title: Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR

1
Generation of Synthetic Datasets for Performance
Evaluation of Text/Graphics Document OCR

Mathieu Delalandre
CVC, Barcelona, Spain
DAG Meeting
CVC, Barcelona, Spain
Wednesday 19th of November 2008

2
Introduction

Text/graphics documents

Text/graphics documents are used in a variety of
fields like geography, engineering, social
sciences
Some examples are
Huge amount of data exist, two main sources
3
Introduction

OCR of text/graphics documents

Character recognition system working with
text/graphics documents First related work
Brown1979 More than 50 references on this
topic today Fletcher1988 Zenzo1992
Goto1999 Adam2000
Text/Graphics separation
full image of text-lines
Problematics - letter segmentation -
multi-font recognition - scale variation
- text/graphics separation - rotation
variation - text-line detection - no
reading order - no dictionary
Text-line detection
general to any documents
images of single text-line
Character segmentation
specific to text/graphics documents
images of single character
Character recognition
ASCII
4
Introduction

About performance evaluation

The case of general OCR Kanungo1999 More than
40 references on the topic Kanungo1999 Several
standard databases exist (NIST, MARS, CD-ROM
English, ) Annual evaluation reports Rice1992
Rice1993 Black-box evaluation The evaluation
considers the OCR system as an indivisible unit
and evaluates it from its final results (i.e. OCR
output vs. ASCII transcription of the text using
string edit distances). White-box evaluation
The evaluation aims to characterize the
performance of individual sub-modules of the OCR
system (skewing, letter segmentation, block
identification, character recognition, etc.).
The case of text/graphic document OCR
Wenyin1997 Only 1 reference on the topic No
standard databases None complete evaluation done
through 20 years of research
5
Introduction

Scope of the proposed work

Text/graphics separation Text-line detection Character segmentation Character recognition
Groundtruthing
Characterization
Performance evaluation of text/graphics document
OCR white-box evaluation
groundtruthing step datasets for text/line
detection and character recognition
generation algorithms are simple, the
main purpose of the talk will concern the setting
contributions
6
Plan

Groundtruth definition
Datasets for character recognition
Datasets for text-line detection
In progress datasets

7
Groundtruth definition
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets

Character level
ASCII code
font (name, size, style)
location point
orientated bounding box
orientation (?)
scale (?)
Text level
first location point
groundtruth of characters
characters/word positions

char H e l l o W o r l d
p-word 0 0 0 0 0 1 1 1 1 1
p-char 0 1 2 3 4 0 1 2 3 4
8
Datasets for character recognition (1/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets

Problematics

Published experiments

image size class size learning font(s) rotation scaling
Brown1981 682 ??/10 20 000 yes yes
Zenzo92 ?? ??/62 72 000 yes yes
Takahashi1992 242 ??/10 6 400 50 yes yes
Adam2000 282 51/62 15 000 33 yes yes
Chen2003 162-5122 26/26 1 000 14 1 no yes
Choisy2004 282 51/62 15 000 80 yes yes
Hase2004 322 ??/26 3 000 33 3 yes no
Pal2006 132-342 40/62 18 000 80 2 yes yes
Roy2008 132-742 40/62 8 000 80 many yes yes
(1) (2) (3) (3) (4) (5) (5)
How to generate single character images ?
Which number of class ? Which image
resolution ? Which size for the datasets ?
Which fonts ? Etc .

Main conclusions

The real sizes of characters can be only
estimated.
The confusion problem (e.g. 6 vs 9) is not still
well defined, the 62 class problem (a-z A-Z 0-9)
is the main goal.
It is not possible to fix a standard size for the
training/test sets, this information is still
well defined, several thousands of images are
mandatory for the training.
The impact of fonts is few studied and should be
take into account in the evaluation
The invariance to rotation and scaling is the
final goal, they are few studied independently.

9
Datasets for character recognition (2/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets

Generation setting

Datasets

Geometry invariance
letter class 62 a-z A-Z 0-9
font class 30 fonts http//www.codestyle.org/ with lower and upper case, no cursive
basic fonts 3 times, courier, arial
character size 322 pixels max dxdy of font symbols
dataset size 5 000 / font 62 classes 40 samples/class 50/50
training free ranked files allow a training specification 20 training on file-4001 file-5000
character scaling 1.0 to 2.0 with a gap of 1/1000
character rotation 0 to 2p with a gap of p/500
tests scaling rotation font(s)/test fonts images
3 no no 1 3 15 000
3 yes no 1 3 15 000
3 no yes 1 3 15 000
3 yes yes 1 3 15 000
Font adequacy
tests scaling rotation font(s)/test fonts images
30 yes yes 1 30 150 000
Font scalability

Generation algorithm
font manager, centering, scale and
rotation processes

tests scaling rotation font(s)/test fonts images
4 yes yes 3 6 9 12 12 150 000
15 000 30 000 45 000 60 000
10
Datasets for text-line detection (1/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets

Problematics

use-case images text-lines curved font/img scaling
Roy2008 geographic map ?? 5 000 yes many yes
Pal2004 artistic document ?? 1 521 yes many yes
Loo2002 poster, newspaper 2 118 yes many yes
Park2001 poster, publicity 30 1265 yes many yes
Goto1999 Japanese form 170 9 831 yes many yes
Tan1998 map 8 96 no many yes
He1996 drawing 1 16 no many yes
Burgue1995 cadastral map 4 150 no many yes
Deseilligny1995 cadastral map 3 1 250 no many yes
(1) (1) (1) (2) (3) (3)
How to generate single character images ?
Which number of word per image ? Which image
size ? Which size for the datasets ?
Which number of font ? Etc .

Main conclusions

The use-cases are heterogeneous, the sizes and
resolutions of images are few provided, the text
density is then difficult to estimate, images
with significant text content are preferred.
Depending the use-cases, not all the methods work
on curved text, a combination of curved and
straight text is necessary.
All the methods use context to extract the
text-line (i.e. font type, character size, line
model). The size of characters could change a
lot, the number of font is generally small (less
to ten).

11
Datasets for text-line detection (2/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets

Setting

Datasets

Text-line density
dictionary 422 text-lines countries and capitals
font class 30 fonts http//www.codestyle.org/ with lower and upper case, no cursive
character size 322 pixels max dxdy of font symbols
image size 6402 10-50 text-lines per image
dataset size 100 images
text scaling 1.0 to 1.5 with a gap of 1/1000
text rotation -p/2 to p/2 with a gap of p/500
test text-line/img scaling curved font(s)/test words
1 low yes no 3 in progress
1 medium yes no 3 in progress
1 high yes no 3 in progress
Font context
test text-line/img scaling curved font(s)/test words
1 medium no no 9 in progress
1 medium no no 6 in progress
1 medium no no 3 in progress
1 medium no no 1 in progress

Generation algorithm

Size context
test text-line/img scaling curved font(s)/test words
1 medium no no 1 in progress
1 medium yes no 1 in progress
12
In progress datasets
1. Groundtruth definition and setting 2. Datasets
for character recognition 3. Datasets for
text-line detection 4. In progress datasets
13
Conclusions

Conclusions
in progress work
character recognition datasets are ready
bags of words still under packaging, but will
be ready soon.
Perspectives
middle term, experimentations with standard
feature extraction methods Roy2008
Valveny2007
long term, experimentations with bags of
word and text/graphics documents
Delalandre2007 Wenyin1997

14
References (1/2)

R. Brown and M. Lybanon and L. K. Gronmeyer.
Recognition of Handprinted Characters for
Automated Cartography A Progress Report.
Proceedings of the SPIE, Vol. 205, 1979.
L.A. Fletcher R. Kasturi. A Robust Algorithm
for Text String Separation from Mixed
Text/Graphics Images. Transactions on Pattern
Analysis and Machine Intelligence (PAMI), vol
(10), pp. 910-918 , 1988.
S.D. Zenzo M.D. Buno M. Meucci A. Spirito.
Optical recognition of hand-printed characters of
any size, position, and orientation. IBM Journal
of Research and Development, vol (36), pp.
487-501 , 1992.
H. Goto H. Aso. Extracting curved text lines
using local linearity of the text line.
International Journal on Document Analysis and
Recognition (IJDAR), vol (2), pp. 111-119 , 1999.
S. Adam J.M. Ogier C. Cariou R. Mullot J.
Labiche J. Gardes. Symbol and Character
Recognition Application to Engineering
Drawings. International Journal on Document
Analysis and Recognition (IJDAR), vol (3), pp.
89-101 , 2000.
T. Kanungo G.A. Marton O. Bulbu. Performance
evaluation of two Arabic OCR products. Workshop
on Advances in Computer-Assisted Recognition
(AIPR) , SPIE Proceedings, vol (3584), pp. 76-83
, 1999.
S.V. Rice J. Kanai T.A. Nartker. A Report on
the Accuracy of OCR Devices. Information Science
Research Institute, University of Nevada, USA,
1992.
S.V. Rice J. Kanai T.A. Nartker. An Evaluation
of OCR Accuracy. Information Science Research
Institute, University of Nevada, USA, 1993.
L. Wenyin D. Dori. A Protocol for Performance
Evaluation of Line Detection Algorithms. Machine
Vision and Applications, vol (9), pp. 240-250 ,
1997.
R.M. Brown. Handprinted Symbol Recognition
System A Very High Performance Approach To
Pattern Analysis Of Free-form Symbols. Conference
Southeastcon , pp. 5-8 , 1981.
H. Takahashi. Neural network architectures for
rotated character recognition. International
Conference on Pattern Recognition (ICPR) , pp.
623-626 , 1992.
Q. Chen. Evaluation of OCR algorithms for images
with different spatial resolutions and noises.
School of Information Technology and Engineering,
University of Ottawa, Canada, 2003.
C. Choisy H. Cecotti A. Belaid. Character
Rotation Absorption Using a Dynamic Neural
Network Topology Comparison With Invariant
Features. International Conference on Enterprise
Information Systems (ICEIS) , pp. 90-97 , 2004.

15
References (2/2)

H. Hase T. Shinokawa S. Tokai C.Y. Suen. A
robust method of recognizing multi-font rotated
characters. International Conference on Pattern
Recognition (ICPR) , vol (2), pp. 363- 366 ,
2004.
U. Pal F. Kimura K. Roy T. Pal. Recognition
of English Multi-oriented Characters.
International Conference on Pattern Recognition
(ICPR) , vol (2), pp. 873-876 , 2006.
P.P. Roy U. Pal J. Llados. Multi-oriented
character recognition from graphical documents.
International Conference on Cognition and
Recognition (ICCR) , pp. 30-35 , 2008.
U. Pal P. P. Roy. Multi-oriented and curved
text lines extraction from Indian documents. IEEE
Transactions on Systems, Man and Cybernetics-
Part B, vol (34), pp. 1676-1684 , 2004.
P.K. Loo and C.L. Tan. Word and Sentence
Extraction Using Irregular Pyramid. Workshop on
Document Analysis System (DAS) , Lecture Notes in
Computer Science (LNCS), vol (2423), pp. 307-318
, 2002.
H.C. Park S.Y. Ok Y.J. Yu H.G. Cho. Word
Extraction in Text/Graphic Mixed Image Using
3-Dimensional Graph Model. International Journal
on Document Analysis and Recognition (IJDAR), vol
(4), pp. 115 130 , 2001.
H. Goto H. Aso. Extracting curved text lines
using local linearity of the text line.
International Journal on Document Analysis and
Recognition (IJDAR), vol (2), pp. 111-119 , 1999.
C.L. Tan P.O. Ng. Text extraction using
pyramid. Pattern Recognition (PR), vol (31), pp.
63-72 , 1998.
S. He, N. Abe C. L. Tan. A clustering-based
approach to the separation of text strings from
mixed text/graphics documents. International
Conference on Pattern Recognition (ICPR) , pp.
706-710 , 1996.
M. Burge G. Monagan. Extracting Words and Multi
Part Symbols in Graphics Rich Documents.
International Conference on Image Analysis and
Processing (ICIAP) , 1995.
M. Deseilligny H. Le Men G. Stamon. Characters
string recognition on maps, a method for high
level reconstruction. International Conference on
Document Analysis and Recognition (ICDAR) , pp.
249 252 , 1995.
E. Valveny S. Tabbone O. Ramos E. Philippot.
Performance Characterization of Shape Descriptors
for Symbol Representation. Workshop on Graphics
Recognition (GREC) , 2007.
M. Delalandre T. Pridmore E. Valveny E. Trupin
H. Locteau. Building Synthetic Graphical
Documents for Performance Evaluation. Workshop on
Graphics Recognition (GREC) , Lecture Note in
Computer Science (LNCS), vol (5046), pp. 288-298
, 2008.