Document Image Retrieval

About This Presentation

Title:

Document Image Retrieval

Description:

To get an appreciation of the issues in document image indexing ... Researchers are seeing a progression from full conversion to image based approach ... – PowerPoint PPT presentation

Number of Views:558

Avg rating:3.0/5.0

Slides: 70

Provided by: daviddo5

Category:

more less

Transcript and Presenter's Notes

Title: Document Image Retrieval

1
Document Image Retrieval

LBSC 796/CMSC 828o
Douglas W. Oard
April 12, 2004
mostly adapted from
A lecture by David Doermann

2
Agenda

Questions
Definitions - Document, Image, Retrieval
Document Image Analysis
Page decomposition
Optical character recognition
Traditional Indexing with Conversion
Confusion matrix
Shape codes
Doing things Without Conversion
Duplicate Detection, Classification,
Summarization, Abstracting
Keyword spotting, etc
Example Chinese document images

3
Goals of this Class

Expand your definition of what is a DOCUMENT
To get an appreciation of the issues in document
image indexing
To look at different ways of solving the same
problems with different media
Your job compare/contrast with other media

4
Document

Basic Medium for Recording Information
Transient
Space
Time
Multiple Forms
Hardcopy (paper, stone, ..) / Electronic (CDROM,
Internet, )
Written/Auditory/Visual (symbolic, scenic)
Access Requirements
Search
Browse
Read

5
Sources of Document Images

The Web
Some PDF files come from scanned documents
Arabic news stories are often GIF images
Digital copiers
Produce corporate memory as a byproduct
Digitization projects
Provide improved access to hardcopy documents

6
Some Definitions

Modality
A means of expression
Linguistic modalities
Electronic text, printed, handwritten, spoken,
signed
Nonlinguistic modalities
Music, drawings, paintings, photographs, video
Media
The means by which the expression reaches you
Internet, videotape, paper, canvas,

7
Document Images

A collection of dots called pixels
Arranged in a grid and called a bitmap
Pixels often binary-valued (black, white)
But greyscale or color is sometimes needed
300 dots per inch (dpi) gives the best results
But images are quite large (1 MB per page)
Faxes are normally 72 dpi
Usually stored in TIFF or PDF format

8
Images

Pixel representation of intensity map
No explicit content, only relations
Image analysis
Attempts to mimic human visual behavior
Draw conclusions, hypothesize and verify

Image databases Use primitive image analysis to
represent content Transform semantic queries into
image features color, shape, texture spatial
relations
9
Document Images

Scanned Pixel representation of document
Data Intensive (100-300dpi, 1-24 bpp)
NO EXPLICIT CONTENT
Document image analysis or manual annotation
required
takes pixels -gt contents
automatic means are not guaranteed
Yet we want to be able to process them like text
files!

10
Document Image Database

Collection of scanned images
Need to be available for indexing and retrieval,
abstracting, routing, editing, dissemination,
interpretation

11
Information Retrieval
Document Understanding
Document Image Retrieval
12
Managing Document Image Databases

Document Image Databases are often influenced by
traditional DB indexing and retrieval
philosophies
We are comfortable with them
They work
Problem Requires content to be accessible
Techniques
Content based retrieval (keywords, natural
language)
Query by structure (logical/physical)
Query by Functional attributes (titles, bold, )
Requirements
Ability to Browse, search and read

13
Indexing Page Images(Traditional)
Page Image
Structure Representation
Document
Page Decomposition
Scanner
Text Regions
Character or Shape Codes
Optical Character Recognition
14
Document Image Analysis

General Flow
Obtain Image - Digitize
Preprocessing
Feature Extraction
Classification
General Tasks
Logical and Physical Page Structure Analysis
Zone Classification
Language ID
Zone Specific Processing
Recognition
Vectorization

15
Page Analysis

Skew correction
Based on finding the primary orientation of lines
Image and text region detection
Based on texture and dominant orientation
Structural classification
Infer logical structure from physical layout
Text region classification
Title, author, letterhead, signature block, etc.

16
Image Detection
17
Text Region Detection
18
Language Identification

Language-independent skew detection
Accommodate horizontal and vertical writing
Script class recognition
Asian script have blocky characters
Connected scripts cant be segmented easily
Language identification
Shape statistics work well for western languages
Competing classifiers work for Asian languages

19
Optical Character Recognition

Pattern-matching approach
Standard approach in commercial systems
Segment individual characters
Recognize using a neural network classifier
Hidden Markov model approach
Experimental approach developed at BBN
Segment into sub-character slices
Limited lookahead to find best character choice
Useful for connected scripts (e.g., Arabic)

20
OCR Accuracy Problems

Character segmentation errors
In English, segmentation often changes m to
rn
Character confusion
Characters with similar shapes often confounded
OCR on copies is much worse than on originals
Pixel bloom, character splitting, binding bend
Uncommon fonts can cause problems
If not used to train a neural network

21
Measures of OCR Accuracy

Character accuracy
Word accuracy
IDF coverage
Query coverage

22
Improving OCR Accuracy

Image preprocessing
Mathematical morphology for bloom and splitting
Particularly important for degraded images
Voting between several OCR engines helps
Individual systems depend on specific training
data
Linguistic analysis can correct some errors
Use confusion statistics, word lists, syntax,
But more harmful errors might be introduced

23
OCR Speed

Neural networks take about 10 seconds a page
Hidden Markov models are slower
Voting can improve accuracy
But at a substantial speed penalty
Easy to speed things up with several machines
For example, by batch processing - using desktop
computers at night

24
Problem Logical Page Analysis (Reading Order)

Can be hard to guess in some cases
Newspaper columns, figure captions, appendices,
Sometimes there are explicit guides
Continued on page 4 (but page 4 may be big!)
Structural cues can help
Column 1 might continue to column 2
Content analysis is also useful
Word co-occurrence statistics, syntax analysis

25
Processing Converted Text

Typical Document Image Indexing
Convert hardcopy to an electronic document
OCR
Page Layout Analysis
Graphics Recognition
Use structure to add metadata
Manually supplement with keywords
Use traditional text indexing and retrieval
techniques?

26
Information Retrieval on OCR

Requires robust ways of indexing
Statistical methods with large documents work
best
Key Evaluations
Success for high quality OCR (Croft et al 1994,
Taghva 1994)
Limited success for poor quality OCR (1996 TREC,
UNLV)
Clustering successful for gt 85 accuracy (Tsuda
et al, 1995)

27
Proposed Solutions

Improve OCR
Automatic Correction
Taghva et al, 1994
Enhance IR techniques
Lopresti and Zhou, 1996
NGrams
Applications
Cornell CS TR Collection (Lagoze et al, 1995)
Degraded Text Simulator (Doermann and Yao, 1995)

28
N-Grams

Powerful, Inexpensive statistical method for
characterizing populations
Approach
Split up document into n-character pairs fails
Use traditional indexing representations to
perform analysis
DOCUMENT -gt DOC, OCU, CUM, UME, MEN, ENT
Advantages
Statistically robust to small numbers of errors
Rapid indexing and retrieval
Works from 70-85 character accuracy where
traditional IR fails

29
Matching with OCR Errors

Above 80 character accuracy, use words
With linguistic correction
Between 75 and 80, use n-grams
With n somewhat shorter than usual
And perhaps with character confusion statistics
Below 75, use word-length shape codes

30
Handwriting Recognition

With stroke information, can be automated
Basis for input pads
Simple things can be read without strokes
Postal addresses, filled-in forms
Free text requires human interpretation
But repeated recognition is then possible

31
Conversion?

Full Conversion often required
Conversion is difficult!
Noisy data
Complex Layouts
Non-text components

Points to Ponder
Do we really need to convert?
Can we expect to fully describe documents
without assumptions?

32
Researchers are seeing a progression from full
conversion to image based approach

Applications
Indexing and Retrieval
Information Extraction
Duplicate Detection
Clustering (Document Similarity)
Summarization
Advantages
Makes use of powerful image properties (Function,
IVC 1998)
Can be cheaper then conversion
Makes use of redundancy in the language.

33
Outline

Processing Converted Text
Manipulating Images of Text
Title Extraction
Named Entity Extraction
Keyword Spotting
Abstracting and Summarization
Indexing based on Structure
Graphics and Drawings
Related Work and Applications

34
Processing Images of Text

Characteristics
Does not require expensive OCR/Conversion
Applicable to filtering applications
May be more robust to noise
Possible Disadvantages
Application domain may be very limited
Processing time may be an issue if indexing is
otherwise required

35
Proper Noun Detection (DeSilva and Hull, 1994)

Problem Filter proper nouns in images of text
People, Places, Things
Advantages of the Image Domain
Saves converting all of the text
Allows application of word recognition approaches
Limits post-processing to a subset of words
Able to use features which are not available in
the text
Approach
Identify Word Features
Capitalization, location, length, and syntactic
categories
Classify using rule-set
Achieve 75-85 accuracy without conversion

36
Keyword Spotting

Techniques
Work Shape/HMM - (Chen et al, 1995)
Word Image Matching - (Trenkle and Vogt, 1993
Hull et al)
Character Stroke Features - (Decurtins and Chen,
1995)
Shape Coding - (Tanaka and Torii Spitz 1995
Kia, 1996)
Applications
Filing System (Spitz - SPAM, 1996)
Numerous IR
Processing handwritten documents
Formal Evaluation
Scribble vs. OCR (DeCurtins, SDIUT 1997)

37
Shape Coding

Approach
Use of Generic Character Descriptors
Make Use of Power of Language to resolve
ambiguity
Map Character based on Shape features including
ascenders, descenders, punctuation and character
with holes

38
Shape Codes

Group all characters that have similar shapes
A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P,
Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7,
8, 9, 0
a, c, e, n, o, r, s, u, v, x, z
b, d, h, k,
f, t
g, p, q, y
i, j, l, 1
m, w

39
Why Use Shape Codes?

Can recognize shapes faster than characters
Seconds per page, and very accurate
Preserves recall, but with lower precision
Useful as a first pass in any system
Easily extracted from JPEG-2 images
Because JPEG-2 uses object-based compression

40
Additional Applications

Handwritten Archival Manuscripts
(Manmatha, 1997)
Page Classification
(Decurtins and Chen, 1995)
Matching Handwritten Records
(Ganzberger et al, 1994)
Headline Extraction
Document Image Compression (UMD, 1996-1998)

41
Outline

Processing Converted Text
Manipulating Images of Text
Indexing Based on Structure
Logical
Physical
Functional
Graphics and Drawings
Related Work and Applications

42
Document FunctionalityICDAR 1997

Humans process documents very robustly
When interacting with documents, we can interpret
without recognition
We can judge relevance without reading
We can rapidly navigate documents to find the
information we want
Claims
We must provide basic ways to interact with
documents, and interaction often relies as much
on the structure of a document, as on the content
Traditional geometric properties and
type-dependent logical models are not sufficient

43
The Role of Documents

The role or function of a document is to store
data in symbolic form which has been produced by
a sender (the author) to facilitate transfer to
a receiver (the reader)
Documents are designed to be interpreted by
humans
Authors typically tailor this design to optimize
the transfer of information
Readers use structure to enhance interpretation
In what ways does the design facilitate,
disambiguate or enhance the flow of information?

44
(No Transcript)
45
Functional Structures
46
(No Transcript)
47
(No Transcript)
48
Outline

Processing Converted Text
Manipulating Images of Text
Indexing based on Structure
Graphics and Drawings
Related Work and Applications

49
Graphics

Maps and Drawings
Lorenz and Monagan, 1995
Samet and Soffer, 1995
Amlani and Kasturi, 1988
Graphs
Koga et al, 1993
Logos and Icons
Jaisimha et al, 1996
Doermann et al, 1996
Gudivada and Raghavan, 1993
Technical Drawings
Syeda-Mahmood, 1995

50
Map InterpretationSamet et al

Identify Legend on the Map Image
Extract Images map labels and descriptions
Identify labels in the map images
Allow user to query based on extracted images
Bootstraps the information extraction and
interpretation problems

51
Outline

Processing Converted Text
Manipulating Images of Text
Indexing based on Structure
Graphics and Drawings
Related Work and Applications

52
Duplicate Detection

Same content, same format
For example, a xerox copy
Same content, different format
For example, as a web page or on paper
Shared content, same format
For example, a paper with annotations
Shared content, different format
For example, including text with cut-and-paste

53
Duplicate Reconciliation
54
Approach

Use global features to restrict search
Number of pages, number of lines, page moments
Extract a signature
using shape codes
Convert signature
use a set of n-gram keys to index the database
Rank and verify
return top N documents
visual or algorithmic refinement
Advantages
Robust to noise, extracted quickly, extracted
easily, efficiently stored

55
Cross-Language Duplicate Detection ( finding
translations!)
56
Evaluation

The usual approach Model-based evaluation
Apply confusion statistics to an existing
collection
A bit better Print-scan evaluation
Scanning is slow, but availability is no problem
Best Scan-only evaluation
No existing IR collections have printed materials

57
Summary

Many applications benefit from image based
indexing
Less discriminatory features
Features may therefore be easier to compute
More robust to noise
Often computationally more efficient
Many classical IR techniques have application for
DIR
Structure as well as content are important for
indexing
Preservation of structure is essential for
in-depth understanding

58
Example Title Pages (4 9)
59
Title Page Overall Accuracy

57 Title pages, 891 non-title pages
Overall Accuracy 906/948 95.57
Title Page Accuracy 37/57 64.91
False Positives 22
False Negatives 20
Observations
All without Type-Specific Information
Need Functional (or Logical) Features

60
Agenda

Questions
Definitions - Document, Image, Retrieval
Document Image Analysis
Traditional Indexing with Conversion
Doing things Without Conversion
Recent work on IR with Chinese document images
Tseng and Oard

61
Document Retrieval Approaches for Images of Text

Full-text search based on manually re-keying the
text
Prohibitively expensive at large scale
Search based on bibliographic metadata
May be difficult to adequately describe the
materials.
Full text based on Optical Character Recognition
(OCR)
Inexpensive and relatively rapid
Sensitive to OCR accurracy

62
Key Questions for Information Retrieval

What to index?
Phrase, words, character, or shape codes
Unigrams or n-grams
How to weight a term in a document?
Term frequency (TF)
Document frequency (DF)
Document length normalization
(Term position)
How to assign scores to documents?
Boolean, vector space, and probabilistic models

63
Chinese Text Retrieval Issues

Words may be any number of characters (typically
2-5)
But some that contain only 1 character or more
than 5 characters
e.g., ? (cat), ???????? (UNESCO)
Longer words (over 2 characters) often have
shorter sub-word units
Transliteration is an exception
Written Chinese has no word separator
A sentence can be segmented in different ways,
all may be legal
Similar to the phrase detection problem in
English
Chinese character inventory is very large
13,500 characters in Big-5 code (traditional
Chinese Taiwan and Hong Kong)
Over 6,000 characters in GB code (simplified
Chinese China, Singapore)
About 3,000 commonly used characters in each
character set

64
Socio-Cultural Research Center (SCRC) Collection

800,000 newspaper clippings from 1950-1976
Scanned over 300,000 at 300 dpi
30 China, Hong Kong, and Taiwan news agencies
Mostly simplified Chinese, some traditional
Chinese
Focus on diplomatic and military activities

65
Document Preparation

Selected 11,108 scanned document images
OCR yielded 8,438 valid docs (Presto! OCR Pro,
Big-5)
Avg valid document had a 69 system-reported
recognition rate
Computed on a sample of 1,300 documents
Second version prepared using Big-5 to GB
conversion
GB version used in experiments

66
Topic Preparation

Based on contemporaneous Chinese journal articles
From 100 paper titles, 30 were selected and
rewritten as Chinese topics
Made English translations for cross-language
experiments
Translated by native speakers of Chinese

lttopgt ltnumgt 12 lttitlegt Anti-Chinese
Movements ltdescriptiongt Activities related to
the anti-Chinese movements in Indonesia ltnarrative
gt Articles must deal with activities related
to the anti-Chinese movement in Indonesia case
reports or articles dealing with PRC's criticism
of the Anti-Chinese movement will be considered
partly relevant. lt/topgt
67
Relevance Judgments

Exhaustive tri-state relevance judgments
Irrelevant (0), partially relevant (1), fully
relevant (2)
Every topic-document pair judged by 3 assessors
2 majored in history, 1 majored in library
science
Averaged 4 minutes per document image (for all 30
topics)
Sum of the judgments provides a final estimate
0not relevant, 15partially relevant, 6fully
relevant
Threshold as desired to reflect the intended
application
In our experiments, any score gt 0 is treated as
relevant

68
Chinese OCR Text Retrieval Strategies

Indexing method
Both 1-gram (for partial match) and 2-gram (for
preserving sequence)
Example ABC will be indexed with A, B,
C, AB, BC
Compared to 1-gram only and 2-gram only
Weighting scheme
document terms TFIDF log(1 tf ) log(N/df)
query terms tf (3w-1), where w is the length
of the term
Retrieval model
Vector space model compared with probabilistic
model
Document length normalization
byte size for document terms, compared to cosine

69
OCR and Length Normalization

Experiments by Taghva et al showed that
some sophisticated weighting schemes shown to be
more effective for ordinary text might lead to
more unstable results for OCR degraded text.
Singhal, Salton, Buckley 96 analyzed this
phenomenon by
Vector space model (SMART system)
Word-based indexing
simulated OCR output of a TREC collection (2GB of
742,202 docs)
50 TREC queries (numbered from 151 to 200)
Specifically, effects of cosine normalization and
IDF are analyzed
Incorrect terms like systom have large IDF and
thus affect weights of other terms in the same
document if cosine normalization is used
They correct this problem by using byte size
normalization
(byte size)0.375