Extraction of text data and hyperlink structure from scanned images of mathematical journals - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Extraction of text data and hyperlink structure from scanned images of mathematical journals

Description:

alphanumeric characters and Greek characters, about 250 kinds of ... and upright font of alpha numeric characters. However, the distinction of the boldface from ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 39

Provided by: expmathU

Category:

more less

Transcript and Presenter's Notes

Title: Extraction of text data and hyperlink structure from scanned images of mathematical journals

1
Extraction of text data and hyperlink structure
from scanned images of mathematical journals

Ann Arbor, March 19, 2002
Masakazu Suzuki
(Kyushu University)

2
Outline of the talk

Motivation of our project INFTY.
What are the goal?
3. What are the difficulties in mathematical
document recognition?
4. Present state of our system, with demo.
5. Work flow of retrodigitization
6. Alpha-Test Home Page
7. Conclusion.

3
1. INFTY
INFTY the OCR system (document reader), -
for mathematical documents, - developed in my
laboratory in Kyushu University, - in
cooperation with the section of OCR in Toshiba
Corporation e-Solution Company, specially with
the developer team of the Toshiba document
reader called ExpressReader Pro.
4
1. INFTY

Recognition of scanned page images of (English /
Japanese) mathematical documents
Intuitive and easy user interface to correct the
recognition results
Output of the recognition results in XML, MathML,
LaTeX, and Braille codes

5
1. INFTY

Clearly printed documents
400600DPI

Recognition of scanned page images of (English /
Japanese) mathematical documents
Intuitive and easy user interface to correct the
recognition results
Output of the recognition results in XML, MathML,
LaTeX, and Braille codes

6
1. Motivation

Help visually impaired students / people to study
/ work in scientific fields
Retro-digitization of mathematical journals to
include them in a searchable digital libraries.

7
2. Goal

Text data with coordinates ? Title, Author
info., , References, Keywords,
Hyperlink structure.
Full recognition including mathematical
expressions and logical structure of the
document ? Reproduction of Contents, Automatic
translation, Verification

8
3. Case of Mathematical Journals

After 1960
Good quality in printing and paper
1940 1960
Low quality papers ? noize
18C, 19C, beginning of 20C
1.Sometimes stained yellow ? noize
2.Use of fonts (beautiful fonts) different
from recent ones

9
3. What are difficult?

Noise reduction.
Character and symbol recognition.
3. Layout analysis
1. Block segmentation
2. Line segmentation
3. Segmentation of Text / Math Areas
4. Structure Analysis of mathematical
expressions.
5. Logical structure analysis.

10
3. Recognition Process Flow

Skew correction and Noise reduction
Layout analysis (Block segmentation),
Segmentation of text area into lines,
Character recognition in text area
Segmentation of text/math areas,
Character and symbol recognition in math. area,
Structure analysis of math. expressions,
Correction of text/math segmentation,
Output.

11
4. Character Recognition

Sample image database
of special symbols.
2. Touched characters and broken characters
in mathematical expressions.

12
4. Character Recognition

Sample image database
of special symbols.
2. Touched characters and broken characters
in mathematical expressions.

It is a very hard work to collect a large number
of sample images of mathematical symbols.
13
4. Character Recognition

Currently, INFTY recognizes, in addition to
alphanumeric characters and Greek characters,
about 250 kinds of other mathematical symbols.
It distinguishes well the difference of italic
font
and upright font of alpha numeric characters.
However, the distinction of the boldface from
normal font is left to the future research.

14
4. Character Recognition

Sample image database
of special symbols.
2. Touched characters and broken characters
in mathematical expressions.

In text area, 1. DP Method, 2.
Bi-grams, Tri-grams, 3. Word Dictionaries,
etc. However, in math area, ?
15
5. Layout Analysis
16
5. Layout Analysis
17
5. Layout Analysis
18
5. Layout Analysis
19
5. Layout Analysis

Currently, Infty supports only graphical layout
analysis.
Logical structure analysis, such as titles,
author information, section/subsection structure,
indexing, theorem description areas, citation
links, etc.
are all left to future works.

20
6. Line Segmentation
21
6. Line Segmentation
22
6. Line Segmentation (sample)
23
6. Line Segmentation (sample)
24
6. Line Segmentation (sample)
25
6. Line Segmentation (sample)
26
7. Text/Math Segmentation
Math
Math
Text
Text
27
7. Text/Math Segmentation
Segmentation of text/math areas, using character
recognition results of ExpressReader Pro
Character ans symbol recognition in Math. Area
and the structure analysis of math. expressions
Correction of text/math segmentation
28
7. Text/Math Segmentation

Difficulties in criteria
Isolated letter a in italic font,
Isolated Capital letters, (Initial, etc.)
Numerals (Items, Citations, Section numbers,
Theorem numbers, or Numbers in math.
Expressions?)
Abbreviations (i.e., e.g., etc.)

29
7. Text/Math Segmentation

Examples
See the demonstration html files
1. Comment_Math_Helv_69_039_048.html
2. Comment_Math_Helv_71_060_069.html
These are the samples automatically generated by
our recognition system INFTY, on March 19, 2002
at Ann Arbor. They includes some errors and show
the present state of our system, since no manual
correction is processed on the results. The
hyperlinks are also generated by the system.
To look the results correctly, you have to
install INFTY fonts
Infty Font 1.TTF, Infty Font 2.TTF,
Infty Font 3.TTF,
in your computer, before opening these html
files.
(Notes added on April 4th,2002 at
Fukuoka)

30
8. Structure Analysis of Mathematical Expressions
31
8. Structure Analysis of Mathematical Expressions
32
8 Structure Analysis of Mathematical Expressions
33
9. Output format

Intermediate XML format ?
XML format as final result output ?
Embedding of hyper Link structure ?
LaTeX, HTML, etc.

34
10. Work Flow of Digitization

Pre-Processing for image files- Erase large
peripheral noises,- Erase figure areas and table
areas
Get the recognition results using Andos
interface,
Extract various data which you need from our XML
output.

35
INFTY a-test cite

Currently, we have an a-test cite of our system
http//133.5.158.104/Infty/index.html
If you upload TIF files of scanned page images of
mathematical paper, (TIF Grade3, 400DPI/600DTI),
Then, you can download the recognition results,
either in LaTeX format or in HTML format.

36
Further problems

Further Improvement of recognition rate of
characters,
Further Improvement of layout analysis,
Recognition of touched characters and broken
characters,
Logical structure analysis of the document,
Automatic detection of keywords, etc.

37
Database

In order to progress further the research of
mathematical/scientific document recognition, we
need a large scale of database of page image
files with correct recognition results keeping
the coordinates correspondence of each character
with the original image (ground truth).

38
INFTY