Title: Generation of Synthetic Datasets for Performance Evaluation of Text/Graphics Document OCR
1Generation of Synthetic Datasets for Performance
Evaluation of Text/Graphics Document OCR
- Mathieu Delalandre
- CVC, Barcelona, Spain
- DAG Meeting
- CVC, Barcelona, Spain
- Wednesday 19th of November 2008
2Introduction
Text/graphics documents are used in a variety of
fields like geography, engineering, social
sciences
Some examples are
Huge amount of data exist, two main sources
3Introduction
- OCR of text/graphics documents
Character recognition system working with
text/graphics documents First related work
Brown1979 More than 50 references on this
topic today Fletcher1988 Zenzo1992
Goto1999 Adam2000
Text/Graphics separation
full image of text-lines
Problematics - letter segmentation -
multi-font recognition - scale variation
- text/graphics separation - rotation
variation - text-line detection - no
reading order - no dictionary
Text-line detection
general to any documents
images of single text-line
Character segmentation
specific to text/graphics documents
images of single character
Character recognition
ASCII
4Introduction
- About performance evaluation
The case of general OCR Kanungo1999 More than
40 references on the topic Kanungo1999 Several
standard databases exist (NIST, MARS, CD-ROM
English, ) Annual evaluation reports Rice1992
Rice1993 Black-box evaluation The evaluation
considers the OCR system as an indivisible unit
and evaluates it from its final results (i.e. OCR
output vs. ASCII transcription of the text using
string edit distances). White-box evaluation
The evaluation aims to characterize the
performance of individual sub-modules of the OCR
system (skewing, letter segmentation, block
identification, character recognition, etc.).
The case of text/graphic document OCR
Wenyin1997 Only 1 reference on the topic No
standard databases None complete evaluation done
through 20 years of research
5Introduction
- Scope of the proposed work
Text/graphics separation Text-line detection Character segmentation Character recognition
Groundtruthing
Characterization
Performance evaluation of text/graphics document
OCR white-box evaluation
groundtruthing step datasets for text/line
detection and character recognition
generation algorithms are simple, the
main purpose of the talk will concern the setting
contributions
6Plan
- Groundtruth definition
- Datasets for character recognition
- Datasets for text-line detection
- In progress datasets
7Groundtruth definition
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
- Character level
- ASCII code
- font (name, size, style)
- location point
- orientated bounding box
- orientation (?)
- scale (?)
- Text level
- first location point
- groundtruth of characters
- characters/word positions
char H e l l o W o r l d
p-word 0 0 0 0 0 1 1 1 1 1
p-char 0 1 2 3 4 0 1 2 3 4
8Datasets for character recognition (1/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
image size class size learning font(s) rotation scaling
Brown1981 682 ??/10 20 000 yes yes
Zenzo92 ?? ??/62 72 000 yes yes
Takahashi1992 242 ??/10 6 400 50 yes yes
Adam2000 282 51/62 15 000 33 yes yes
Chen2003 162-5122 26/26 1 000 14 1 no yes
Choisy2004 282 51/62 15 000 80 yes yes
Hase2004 322 ??/26 3 000 33 3 yes no
Pal2006 132-342 40/62 18 000 80 2 yes yes
Roy2008 132-742 40/62 8 000 80 many yes yes
(1) (2) (3) (3) (4) (5) (5)
How to generate single character images ?
Which number of class ? Which image
resolution ? Which size for the datasets ?
Which fonts ? Etc .
- The real sizes of characters can be only
estimated. - The confusion problem (e.g. 6 vs 9) is not still
well defined, the 62 class problem (a-z A-Z 0-9)
is the main goal. - It is not possible to fix a standard size for the
training/test sets, this information is still
well defined, several thousands of images are
mandatory for the training. - The impact of fonts is few studied and should be
take into account in the evaluation - The invariance to rotation and scaling is the
final goal, they are few studied independently.
9Datasets for character recognition (2/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
Geometry invariance
letter class 62 a-z A-Z 0-9
font class 30 fonts http//www.codestyle.org/ with lower and upper case, no cursive
basic fonts 3 times, courier, arial
character size 322 pixels max dxdy of font symbols
dataset size 5 000 / font 62 classes 40 samples/class 50/50
training free ranked files allow a training specification 20 training on file-4001 file-5000
character scaling 1.0 to 2.0 with a gap of 1/1000
character rotation 0 to 2p with a gap of p/500
tests scaling rotation font(s)/test fonts images
3 no no 1 3 15 000
3 yes no 1 3 15 000
3 no yes 1 3 15 000
3 yes yes 1 3 15 000
Font adequacy
tests scaling rotation font(s)/test fonts images
30 yes yes 1 30 150 000
Font scalability
- Generation algorithm
- font manager, centering, scale and
rotation processes
tests scaling rotation font(s)/test fonts images
4 yes yes 3 6 9 12 12 150 000
15 000 30 000 45 000 60 000
10Datasets for text-line detection (1/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
use-case images text-lines curved font/img scaling
Roy2008 geographic map ?? 5 000 yes many yes
Pal2004 artistic document ?? 1 521 yes many yes
Loo2002 poster, newspaper 2 118 yes many yes
Park2001 poster, publicity 30 1265 yes many yes
Goto1999 Japanese form 170 9 831 yes many yes
Tan1998 map 8 96 no many yes
He1996 drawing 1 16 no many yes
Burgue1995 cadastral map 4 150 no many yes
Deseilligny1995 cadastral map 3 1 250 no many yes
(1) (1) (1) (2) (3) (3)
How to generate single character images ?
Which number of word per image ? Which image
size ? Which size for the datasets ?
Which number of font ? Etc .
- The use-cases are heterogeneous, the sizes and
resolutions of images are few provided, the text
density is then difficult to estimate, images
with significant text content are preferred. - Depending the use-cases, not all the methods work
on curved text, a combination of curved and
straight text is necessary. - All the methods use context to extract the
text-line (i.e. font type, character size, line
model). The size of characters could change a
lot, the number of font is generally small (less
to ten).
11Datasets for text-line detection (2/2)
1. Groundtruth definition 2. Datasets for
character recognition 3. Datasets for text-line
detection 4. In progress datasets
Text-line density
dictionary 422 text-lines countries and capitals
font class 30 fonts http//www.codestyle.org/ with lower and upper case, no cursive
character size 322 pixels max dxdy of font symbols
image size 6402 10-50 text-lines per image
dataset size 100 images
text scaling 1.0 to 1.5 with a gap of 1/1000
text rotation -p/2 to p/2 with a gap of p/500
test text-line/img scaling curved font(s)/test words
1 low yes no 3 in progress
1 medium yes no 3 in progress
1 high yes no 3 in progress
Font context
test text-line/img scaling curved font(s)/test words
1 medium no no 9 in progress
1 medium no no 6 in progress
1 medium no no 3 in progress
1 medium no no 1 in progress
Size context
test text-line/img scaling curved font(s)/test words
1 medium no no 1 in progress
1 medium yes no 1 in progress
12In progress datasets
1. Groundtruth definition and setting 2. Datasets
for character recognition 3. Datasets for
text-line detection 4. In progress datasets
13Conclusions
- Conclusions
- in progress work
- character recognition datasets are ready
- bags of words still under packaging, but will
be ready soon. - Perspectives
- middle term, experimentations with standard
feature extraction methods Roy2008
Valveny2007 - long term, experimentations with bags of
word and text/graphics documents
Delalandre2007 Wenyin1997
14References (1/2)
- R. Brown and M. Lybanon and L. K. Gronmeyer.
Recognition of Handprinted Characters for
Automated Cartography A Progress Report.
Proceedings of the SPIE, Vol. 205, 1979. - L.A. Fletcher R. Kasturi. A Robust Algorithm
for Text String Separation from Mixed
Text/Graphics Images. Transactions on Pattern
Analysis and Machine Intelligence (PAMI), vol
(10), pp. 910-918 , 1988. - S.D. Zenzo M.D. Buno M. Meucci A. Spirito.
Optical recognition of hand-printed characters of
any size, position, and orientation. IBM Journal
of Research and Development, vol (36), pp.
487-501 , 1992. - H. Goto H. Aso. Extracting curved text lines
using local linearity of the text line.
International Journal on Document Analysis and
Recognition (IJDAR), vol (2), pp. 111-119 , 1999.
- S. Adam J.M. Ogier C. Cariou R. Mullot J.
Labiche J. Gardes. Symbol and Character
Recognition Application to Engineering
Drawings. International Journal on Document
Analysis and Recognition (IJDAR), vol (3), pp.
89-101 , 2000. - T. Kanungo G.A. Marton O. Bulbu. Performance
evaluation of two Arabic OCR products. Workshop
on Advances in Computer-Assisted Recognition
(AIPR) , SPIE Proceedings, vol (3584), pp. 76-83
, 1999. - S.V. Rice J. Kanai T.A. Nartker. A Report on
the Accuracy of OCR Devices. Information Science
Research Institute, University of Nevada, USA,
1992. - S.V. Rice J. Kanai T.A. Nartker. An Evaluation
of OCR Accuracy. Information Science Research
Institute, University of Nevada, USA, 1993. - L. Wenyin D. Dori. A Protocol for Performance
Evaluation of Line Detection Algorithms. Machine
Vision and Applications, vol (9), pp. 240-250 ,
1997. - R.M. Brown. Handprinted Symbol Recognition
System A Very High Performance Approach To
Pattern Analysis Of Free-form Symbols. Conference
Southeastcon , pp. 5-8 , 1981. - H. Takahashi. Neural network architectures for
rotated character recognition. International
Conference on Pattern Recognition (ICPR) , pp.
623-626 , 1992. - Q. Chen. Evaluation of OCR algorithms for images
with different spatial resolutions and noises.
School of Information Technology and Engineering,
University of Ottawa, Canada, 2003. - C. Choisy H. Cecotti A. Belaid. Character
Rotation Absorption Using a Dynamic Neural
Network Topology Comparison With Invariant
Features. International Conference on Enterprise
Information Systems (ICEIS) , pp. 90-97 , 2004.
15References (2/2)
- H. Hase T. Shinokawa S. Tokai C.Y. Suen. A
robust method of recognizing multi-font rotated
characters. International Conference on Pattern
Recognition (ICPR) , vol (2), pp. 363- 366 ,
2004. - U. Pal F. Kimura K. Roy T. Pal. Recognition
of English Multi-oriented Characters.
International Conference on Pattern Recognition
(ICPR) , vol (2), pp. 873-876 , 2006. - P.P. Roy U. Pal J. Llados. Multi-oriented
character recognition from graphical documents.
International Conference on Cognition and
Recognition (ICCR) , pp. 30-35 , 2008. - U. Pal P. P. Roy. Multi-oriented and curved
text lines extraction from Indian documents. IEEE
Transactions on Systems, Man and Cybernetics-
Part B, vol (34), pp. 1676-1684 , 2004. - P.K. Loo and C.L. Tan. Word and Sentence
Extraction Using Irregular Pyramid. Workshop on
Document Analysis System (DAS) , Lecture Notes in
Computer Science (LNCS), vol (2423), pp. 307-318
, 2002. - H.C. Park S.Y. Ok Y.J. Yu H.G. Cho. Word
Extraction in Text/Graphic Mixed Image Using
3-Dimensional Graph Model. International Journal
on Document Analysis and Recognition (IJDAR), vol
(4), pp. 115 130 , 2001. - H. Goto H. Aso. Extracting curved text lines
using local linearity of the text line.
International Journal on Document Analysis and
Recognition (IJDAR), vol (2), pp. 111-119 , 1999.
- C.L. Tan P.O. Ng. Text extraction using
pyramid. Pattern Recognition (PR), vol (31), pp.
63-72 , 1998. - S. He, N. Abe C. L. Tan. A clustering-based
approach to the separation of text strings from
mixed text/graphics documents. International
Conference on Pattern Recognition (ICPR) , pp.
706-710 , 1996. - M. Burge G. Monagan. Extracting Words and Multi
Part Symbols in Graphics Rich Documents.
International Conference on Image Analysis and
Processing (ICIAP) , 1995. - M. Deseilligny H. Le Men G. Stamon. Characters
string recognition on maps, a method for high
level reconstruction. International Conference on
Document Analysis and Recognition (ICDAR) , pp.
249 252 , 1995. - E. Valveny S. Tabbone O. Ramos E. Philippot.
Performance Characterization of Shape Descriptors
for Symbol Representation. Workshop on Graphics
Recognition (GREC) , 2007. - M. Delalandre T. Pridmore E. Valveny E. Trupin
H. Locteau. Building Synthetic Graphical
Documents for Performance Evaluation. Workshop on
Graphics Recognition (GREC) , Lecture Note in
Computer Science (LNCS), vol (5046), pp. 288-298
, 2008.