Document Image Content Inventories - PowerPoint PPT Presentation

About This Presentation

Title:

Document Image Content Inventories

Description:

Document Image Content Inventories – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 34

Provided by: cseLe

Learn more at: https://www.cse.lehigh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Document Image Content Inventories

1
Document Image Content Inventories

Henry S. Baird
Michael A. Moll
Chang An
Matthew R. Casey

DRR XIVFebruary 1, 2007
2
Document Image Content

Given an image of a document
Find regions containing handwriting,
machine-print text, graphics, line-art, logos,
photographs, noise, etc
Challenges Select images to cover vast variety
Bitonal, Greyscale and Color
English, Arabic, Chinese
Simple to Complex Layouts
Historical and Modern Documents
Low Quality Distorted Images to Higher Quality
Clear Images
Machine Print, Handwriting, Photographs and Blank
Content
Combinations of all of these types

3
Test Document Examples
4
More Document Examples
5
More Document Examples
6
Classification Algorithms

Consider brute-force 5-Nearest Neighbors as gold
standard
Hashed k-D Trees used to approximate 5-NN
Very fast, large speedup
Small loss in accuracy
Classifiers discussed in detail in previously
published work (DRR XIII 06)
A sample is a single pixel in a document image
Not classifying entire regions
Classifiers run to completion in CPU hours
Just fast enough to allow us to push accuracy,
etc

7
Classification Example
8
Classification Example
9
Choosing the Feature Set

Each test point is a single pixel in image
We do not classify regions
Therefore we do not restrict region shape
We allow complex non rectangular shapes
Represented by scalar features
Extracted from Luminosity channel of HSL image
Extracted over small window surrounding the pixel
Windows from /- 11 pixels /- 20 pixels
Trial and error exploration of over 60 features
We now use set of 26 features used for testing

10
Features Used

Region Luminosity
1x1 region
Line Luminosity Average
Horizontal and Vertical
Line length of 25 pixels
Line Average Difference
Line length of 25 pixels
Line Luminosity Average
Difference
Diagonals only
Line length of 25 pixels

Line Luminosity Max Difference
Four directions
Line length of 41 pixels
Revised Distance to Max-Difference Pair
Eight directions
Line length of 41 pixels
Revised Distance to Max-Difference Pixel
Eight directions
Line length of 41 pixels

11
Experimental Design

Training set of 31 images and Testing set of 86
images
Each image in test set, has at least one similar
image in training set (from same source)
We are not testing strong generalizing ability of
classifier to different images
Testing set consists of 178793163 samples

12
Speed-Ups by Decimation

Trials and intuition show that large speed-ups in
classification (regardless of method) can be
obtained by randomly throwing away training data
Since we are classifying pixels, we expect a high
redundancy in training data
Partially due to isogeny the tendency for data
in the same image to have been generated by the
same source and process
Following slide shows example
Classifier trained on one image and tested on
second image from same source

13
Speed-Ups by Decimation
Factor 1 10 100 500 1000
Speed-Up 0 7.9 57.9 212.5 354.2
Accuracy 80.4 72.9 76.2 70.0 66.6
14
Example Analysis

Consider this image and its output
Per-pixel Accuracy 62

15
Per-pixel Classification versus Inventory

Per-pixel Classification Confusion Matrix
Per-page Inventory Fraction of Content

BL HW MP PH Type1
BL 0.0661 0.0194 0.00217 0.00183 0.0234
HW 0 0 0 0 0
MP 0.0863 0.0603 0.417 0.0464 0.193
PH 0.0119 0.00848 0.0734 0.207 0.0938
Type2 0.0982 0.0882 .0756 .04823 .3706
Content True Classifier Accuracy
BL 6.817 24.18 20.96
HW 0 11.3 0
MP 46.75 42.14 75.85
PH 23.06 22.38 70.91
16
Per-Pixel Accuracy

Fraction of all pixels in a document that are
correctly classified
Class label matches class specified by ground
truth
Objective and Quantitative
However, arbitrary due to methods of
ground-truthing and inconsistencies
Per-pixel accuracy score prone to be worse than
image may subjectively appear
For our test set, across all images, average
per-pixel accuracy score was 62.4

17
Confusion Matrix
BL HW MP PH Type 1
BL .159 .0279 .0318 .00539 .0651
HW .0283 .0231 .0135 .00128 .0431
MP .0456 .0291 .353 .0390 .114
PH .0228 .00739 .0465 .167 .0767
Type 2 .0967 .0644 .0918 .0457 .386
18
Per-Page Inventory

For each content class we measure the fraction of
each page that is measured as that class
Allows a user to query a data base of page images
in a variety of useful and natural ways
For example, answer a query like
Find all pages containing ? 70 Photograph
and ? 10 Machine Print
This is an information retrieval problem for
which precision and recall are natural evaluation
metrics
Most images in test class are of mixed content
and do not contain a majority of any one class

19
Per-Page Inventory

We tried queries on our complete test set, e.g.
Find all images that contain
at least the fraction 20 of MP pixels

20
Precision and Recall Curves
21
Machine Print PR Curves
22
Per-Page Inventory

If we assume all thresholds are equally likely,
we can estimate expected precision and recall

Recall Precision
BL 96.7 55.6
HW 45.1 80.1
MP 80.9 77.2
PH 76.0 78.8
Vs. 62.4 per-pixel accuracy
23
Discussion of Results
24
Discussion of Results
25
Discussion of Results
26
Discussion of Results
27
Discussion of Results
28
Discussion of Results
29
Discussion of Results
30
Conclusion

Modest per-pixel classification accuracies (of
e.g. 60-70) support useful recall and precision
rates (of e.g. 80-90) for retrieval queries of
document collections seeking pages containing a
given minimum fraction of a certain content type
On per page basis inventories tend to be more
accurate than per-pixel classification

31
Future Work

Analysis of relationship between per-pixel
accuracy scores and per-page inventory queries
Under what assumptions can we relate the two?
Which is more useful/descriptive
Building classifiers on classified images
(iterated classification)
Content class masks
Automated feature selection
Massive tests with GRID computing

32
Iterated Classification
33
Content Class Masks
HW MP PH
34
Thank You!

Henry S. Baird
Michael A. Moll
Chang An
Matthew R. Casey

35
Raw Pixel Counts
BL HW MP PH Total 1
BL 28385810 4992050 5678089 963871 40019820
HW 5054560 4128702 2422925 228225 11834412
MP 8150037 5196000 63137187 6968964 83452188
PH 4080809 1321661 8310500 29773773 43486743
Total 2 45671216 15638413 79548701 37934833 178793163
36
Analyzing the Confusion Matrix

Classifier is best at classifying PH and MP
Some trouble with BL and lots of trouble with HW
43 of HW misclassified at BL
However, similar amount of content is classified
as each class as was labeled in ground truth
Suggests problems with zoning, not necessarily
with classification

37
Photograph PR Curves
38
Discussion of Results

Locates HW in detail, not just as rectangular
blocks like zoning
Some difficulty in separating handwriting from
background (lines on legal pad) that were
included in zoning

39
Discussion of Results

Good segmentation of machine print, some
confusions with handwriting
In photograph of football player identifies
jersey letters as MP

40
Discussion of Results

Discriminates text from photos very well
Does very good job of discovering actual text
layout
Trouble distinguishing HW from MP

41
Discussion of Results
Both show remarkable segmentation of MP from PH,
regardless of background, etc
42
Discussion of Results

Complex magazine layouts,
Non rectilinear text layouts,
Text overlapping photographs, etc

43
Discussion of Results

On left, a particularly complex layout that
reveals classifier making many correct small
distinctions between photograph and machine print
On right, interesting case where blank background
indistinguishably blends into photograph at
bottom

44
Discussion of Results

Image of left shows methodological problem of
zoning large, dim areas of photographs that are
statistically indistinguishable from blank
Image on right shows excellent subjective
segmentation for MP and PH but confusion with HW
and BL

Write a Comment

User Comments (0)