Title: Document Analysis and Recognition
1Document Analysis and Recognition
2What is a Document?
- A written or printed paper that bears the
original, official, or legal form of something
and can be used to furnish decisive evidence or
information. - Something, such as a recording or a photograph,
that can be used to furnish evidence or
information. - A writing that contains information.
- Computer Science. A piece of work created with an
application, as by a word processor. - Computer Science. A computer file that is not an
executable file and contains data for use by
applications
3Document Image Analysis
- DIA is the theory and practice of recovering the
symbol structures of digital images scanned from
paper or produced by computer - DIA is a subfield of Digital Image processing
- Digital images of natural objects X-rays,
fingerprints, faces, scenery, etc. are NOT part
of DIA - Digital images of symbolic objects Postal
addresses, printed articles, forms, music sheets,
engineering drawings, topographic maps belong to
DIA - Source Scanners, printers, fax machines, hand!
- Incidental text license plates, billboards,
subtitles, in photos and video - WWW ??
- DIAs grand goal is take us to the land of
paperless office
4Paperless Office?
- Traditional transmission and storage of
information has been by paper documents - Documents are increasingly originating on the
computer - Documents printed for reading, dissemination, and
markup - Paper in the office has increased!!
- Goal Deal with the flow of electronic and paper
documents in an efficient and integrated manner - Implication Unlike computer media, paper
documents should be read by both the computer and
people
5Short Tour of DIA
- Field started before digital computers could
represent information traditionally appeared on
paper - Patents on OCR for telegraph and reading machines
for the blind filed in the 19th century and
working models demonstrated in 1916 - OCR on specially designed fonts used in 1950s
- First postal address reader installed in 1965
- OCRs to read scanned pages came into their own in
1980s with the advent of the low cost
microprocessors, bit-mapped displays, and
scanners - Large capacity storage devices have now ignited
the field with the prospects of Digital Libraries - Document imaging today is a billion dollar
industry but document interpretation is only a
small part of it
6Document Image Analysis
Graphical Processing
Textual Processing
Optical Character Recognition
Page Layout Analysis
Line Processing
Region and Symbol Processing
Skew, blocks, paragraphs
Lines, curves, corners
Filled regions
Text
7Current
- Processors getting faster
- Storage costs are down
- Pictures are typically 512 x 512 pixels
- Speech signals are typically 256 sample points
- Business letters are typically 2550 x 3300 pixels
at 300 dpi - Eng drawings are typically 34000 x 44000 pixels
at 1000 dpi - Digital libraries need WWW interface
- Information retrieval and search
- OCR accuracy on the rise
- Contextual models improved
8Document page
300 dpi, 8.5x11 in 255 gray X 3 color 2,550 x
3,300 pixels
Data capture
107 pixels
Pixel-level processing
7,500 character boxes, 15x20
pixels each
500 line and curve segments, 20 to 20,000
pixels each
10 filled regions 20x20
to 200x200 pixels each
Feature-level processing
10x5 region features
7500x10 character features
500x5 line and curve features
Text analysis recognition
Graphics analysis recognition
2 line diagrams, 1 company logo, etc.
1,500 words, 10 paragraphs, 1 title, 2 subtitles,
etc.
Document Description
9Document Image Analysis
10Document Taxonomy
11Postal Examples
12Forms
13Unconstrained Text
14Graphics Documents
15Personal DL
16DAS 02, Princeton, NJ
- OCR Features and Systems
- Degradation models, script ID, Bilingual OCR,
Kannada OCR, Tamil OCR, mp versus hw checks,
traffic ticket reading - Handwriting Recognition
- Stochastic models, holistic methods, Japanese OCR
- Classifiers and Learning
- Multi-classifier systems
- Layout Analysis
- Skew correction, geometric methods, test/graphics
separation, logical labeling - Tables and Forms
- Detecting tables in HTML documents, use of graph
grammars, semantics - Text Extraction
- Indexing and Retrieval
- Document Engineering
- New Applications
- CAPTCHA, Tachograph chart system, accessing
driving directions
17ICDAR 03, Edinburgh, UK
- Multiple Classifiers
- Postal Automation and Check Processing
- Document Understanding
- HMM Classifiers
- Segmentation
- Character Recognition
- Graphics Recognition
- Non-Latin Alphabets- Kanji/Chinese,
Korean/Hangul, Arabic/Indian - Web Documents, Video
- Word Recognition
- Image Processing
- Writer Identification
- Forms and Tables
18CS 661 Class Schedule
19Grading
- Home Assignments and Quizzes
- 4 x 10 40 points
- schedule is tentative to preserve surprise
element - Based on class participation and paper handouts
- Midterm project
- Demo 10
- Report 15
- Final project
- Demo 10
- Report 25
20References
- Handbook of Character Recognition and Document
Image Analysis, H. Bunke and PSP Wang (editors),
World Scientific Press - Document Image Analysis, Gorman and Kasturi ,
IEEE Computer Society Press - International Conference on Document Analysis and
Recognition proceedings - International Workshop on Document Analysis
Systems proceedings - Symposium on Document Image Understanding
Technology