Title: Document Image Content Inventories
1Document Image Content Inventories
- Henry S. Baird
- Michael A. Moll
- Chang An
- Matthew R. Casey
DRR XIVFebruary 1, 2007
2Document Image Content
- Given an image of a document
- Find regions containing handwriting,
machine-print text, graphics, line-art, logos,
photographs, noise, etc - Challenges Select images to cover vast variety
- Bitonal, Greyscale and Color
- English, Arabic, Chinese
- Simple to Complex Layouts
- Historical and Modern Documents
- Low Quality Distorted Images to Higher Quality
Clear Images - Machine Print, Handwriting, Photographs and Blank
Content - Combinations of all of these types
3Test Document Examples
4More Document Examples
5More Document Examples
6Classification Algorithms
- Consider brute-force 5-Nearest Neighbors as gold
standard - Hashed k-D Trees used to approximate 5-NN
- Very fast, large speedup
- Small loss in accuracy
- Classifiers discussed in detail in previously
published work (DRR XIII 06) - A sample is a single pixel in a document image
- Not classifying entire regions
- Classifiers run to completion in CPU hours
- Just fast enough to allow us to push accuracy,
etc
7Classification Example
8Classification Example
9Choosing the Feature Set
- Each test point is a single pixel in image
- We do not classify regions
- Therefore we do not restrict region shape
- We allow complex non rectangular shapes
- Represented by scalar features
- Extracted from Luminosity channel of HSL image
- Extracted over small window surrounding the pixel
- Windows from /- 11 pixels /- 20 pixels
- Trial and error exploration of over 60 features
- We now use set of 26 features used for testing
10Features Used
- Region Luminosity
- 1x1 region
- Line Luminosity Average
- Horizontal and Vertical
- Line length of 25 pixels
- Line Average Difference
- Line length of 25 pixels
- Line Luminosity Average
- Difference
- Diagonals only
- Line length of 25 pixels
- Line Luminosity Max Difference
- Four directions
- Line length of 41 pixels
- Revised Distance to Max-Difference Pair
- Eight directions
- Line length of 41 pixels
- Revised Distance to Max-Difference Pixel
- Eight directions
- Line length of 41 pixels
11Experimental Design
- Training set of 31 images and Testing set of 86
images - Each image in test set, has at least one similar
image in training set (from same source) - We are not testing strong generalizing ability of
classifier to different images - Testing set consists of 178793163 samples
12Speed-Ups by Decimation
- Trials and intuition show that large speed-ups in
classification (regardless of method) can be
obtained by randomly throwing away training data - Since we are classifying pixels, we expect a high
redundancy in training data - Partially due to isogeny the tendency for data
in the same image to have been generated by the
same source and process - Following slide shows example
- Classifier trained on one image and tested on
second image from same source
13Speed-Ups by Decimation
Factor 1 10 100 500 1000
Speed-Up 0 7.9 57.9 212.5 354.2
Accuracy 80.4 72.9 76.2 70.0 66.6
14Example Analysis
- Consider this image and its output
- Per-pixel Accuracy 62
15Per-pixel Classification versus Inventory
- Per-pixel Classification Confusion Matrix
- Per-page Inventory Fraction of Content
BL HW MP PH Type1
BL 0.0661 0.0194 0.00217 0.00183 0.0234
HW 0 0 0 0 0
MP 0.0863 0.0603 0.417 0.0464 0.193
PH 0.0119 0.00848 0.0734 0.207 0.0938
Type2 0.0982 0.0882 .0756 .04823 .3706
Content True Classifier Accuracy
BL 6.817 24.18 20.96
HW 0 11.3 0
MP 46.75 42.14 75.85
PH 23.06 22.38 70.91
16Per-Pixel Accuracy
- Fraction of all pixels in a document that are
correctly classified - Class label matches class specified by ground
truth - Objective and Quantitative
- However, arbitrary due to methods of
ground-truthing and inconsistencies - Per-pixel accuracy score prone to be worse than
image may subjectively appear - For our test set, across all images, average
per-pixel accuracy score was 62.4
17Confusion Matrix
BL HW MP PH Type 1
BL .159 .0279 .0318 .00539 .0651
HW .0283 .0231 .0135 .00128 .0431
MP .0456 .0291 .353 .0390 .114
PH .0228 .00739 .0465 .167 .0767
Type 2 .0967 .0644 .0918 .0457 .386
18Per-Page Inventory
- For each content class we measure the fraction of
each page that is measured as that class - Allows a user to query a data base of page images
in a variety of useful and natural ways - For example, answer a query like
- Find all pages containing ? 70 Photograph
- and ? 10 Machine Print
- This is an information retrieval problem for
which precision and recall are natural evaluation
metrics - Most images in test class are of mixed content
and do not contain a majority of any one class
19Per-Page Inventory
- We tried queries on our complete test set, e.g.
- Find all images that contain
- at least the fraction 20 of MP pixels
20Precision and Recall Curves
21Machine Print PR Curves
22Per-Page Inventory
- If we assume all thresholds are equally likely,
we can estimate expected precision and recall
Recall Precision
BL 96.7 55.6
HW 45.1 80.1
MP 80.9 77.2
PH 76.0 78.8
Vs. 62.4 per-pixel accuracy
23Discussion of Results
24Discussion of Results
25Discussion of Results
26Discussion of Results
27Discussion of Results
28Discussion of Results
29Discussion of Results
30Conclusion
- Modest per-pixel classification accuracies (of
e.g. 60-70) support useful recall and precision
rates (of e.g. 80-90) for retrieval queries of
document collections seeking pages containing a
given minimum fraction of a certain content type - On per page basis inventories tend to be more
accurate than per-pixel classification
31Future Work
- Analysis of relationship between per-pixel
accuracy scores and per-page inventory queries - Under what assumptions can we relate the two?
- Which is more useful/descriptive
- Building classifiers on classified images
(iterated classification) - Content class masks
- Automated feature selection
- Massive tests with GRID computing
32Iterated Classification
33Content Class Masks
HW MP PH
34Thank You!
- Henry S. Baird
- Michael A. Moll
- Chang An
- Matthew R. Casey
35Raw Pixel Counts
BL HW MP PH Total 1
BL 28385810 4992050 5678089 963871 40019820
HW 5054560 4128702 2422925 228225 11834412
MP 8150037 5196000 63137187 6968964 83452188
PH 4080809 1321661 8310500 29773773 43486743
Total 2 45671216 15638413 79548701 37934833 178793163
36Analyzing the Confusion Matrix
- Classifier is best at classifying PH and MP
- Some trouble with BL and lots of trouble with HW
- 43 of HW misclassified at BL
- However, similar amount of content is classified
as each class as was labeled in ground truth - Suggests problems with zoning, not necessarily
with classification
37Photograph PR Curves
38Discussion of Results
- Locates HW in detail, not just as rectangular
blocks like zoning - Some difficulty in separating handwriting from
background (lines on legal pad) that were
included in zoning
39Discussion of Results
- Good segmentation of machine print, some
confusions with handwriting - In photograph of football player identifies
jersey letters as MP
40Discussion of Results
- Discriminates text from photos very well
- Does very good job of discovering actual text
layout - Trouble distinguishing HW from MP
41Discussion of Results
Both show remarkable segmentation of MP from PH,
regardless of background, etc
42Discussion of Results
- Complex magazine layouts,
- Non rectilinear text layouts,
- Text overlapping photographs, etc
43Discussion of Results
- On left, a particularly complex layout that
reveals classifier making many correct small
distinctions between photograph and machine print - On right, interesting case where blank background
indistinguishably blends into photograph at
bottom
44Discussion of Results
- Image of left shows methodological problem of
zoning large, dim areas of photographs that are
statistically indistinguishable from blank - Image on right shows excellent subjective
segmentation for MP and PH but confusion with HW
and BL