Title: Document Image Retrieval
1Document Image Retrieval
- LBSC 796/CMSC 828o
- Douglas W. Oard
- April 12, 2004
- mostly adapted from
- A lecture by David Doermann
2Agenda
- Questions
- Definitions - Document, Image, Retrieval
- Document Image Analysis
- Page decomposition
- Optical character recognition
- Traditional Indexing with Conversion
- Confusion matrix
- Shape codes
- Doing things Without Conversion
- Duplicate Detection, Classification,
Summarization, Abstracting - Keyword spotting, etc
- Example Chinese document images
3Goals of this Class
- Expand your definition of what is a DOCUMENT
- To get an appreciation of the issues in document
image indexing - To look at different ways of solving the same
problems with different media - Your job compare/contrast with other media
4Document
- Basic Medium for Recording Information
- Transient
- Space
- Time
- Multiple Forms
- Hardcopy (paper, stone, ..) / Electronic (CDROM,
Internet, ) - Written/Auditory/Visual (symbolic, scenic)
- Access Requirements
- Search
- Browse
- Read
5Sources of Document Images
- The Web
- Some PDF files come from scanned documents
- Arabic news stories are often GIF images
- Digital copiers
- Produce corporate memory as a byproduct
- Digitization projects
- Provide improved access to hardcopy documents
6Some Definitions
- Modality
- A means of expression
- Linguistic modalities
- Electronic text, printed, handwritten, spoken,
signed - Nonlinguistic modalities
- Music, drawings, paintings, photographs, video
- Media
- The means by which the expression reaches you
- Internet, videotape, paper, canvas,
7Document Images
- A collection of dots called pixels
- Arranged in a grid and called a bitmap
- Pixels often binary-valued (black, white)
- But greyscale or color is sometimes needed
- 300 dots per inch (dpi) gives the best results
- But images are quite large (1 MB per page)
- Faxes are normally 72 dpi
- Usually stored in TIFF or PDF format
8Images
- Pixel representation of intensity map
- No explicit content, only relations
- Image analysis
- Attempts to mimic human visual behavior
- Draw conclusions, hypothesize and verify
Image databases Use primitive image analysis to
represent content Transform semantic queries into
image features color, shape, texture spatial
relations
9Document Images
- Scanned Pixel representation of document
- Data Intensive (100-300dpi, 1-24 bpp)
- NO EXPLICIT CONTENT
- Document image analysis or manual annotation
required - takes pixels -gt contents
- automatic means are not guaranteed
- Yet we want to be able to process them like text
files!
10Document Image Database
- Collection of scanned images
- Need to be available for indexing and retrieval,
abstracting, routing, editing, dissemination,
interpretation
11Information Retrieval
Document Understanding
Document Image Retrieval
12Managing Document Image Databases
- Document Image Databases are often influenced by
traditional DB indexing and retrieval
philosophies - We are comfortable with them
- They work
- Problem Requires content to be accessible
- Techniques
- Content based retrieval (keywords, natural
language) - Query by structure (logical/physical)
- Query by Functional attributes (titles, bold, )
- Requirements
- Ability to Browse, search and read
13Indexing Page Images(Traditional)
Page Image
Structure Representation
Document
Page Decomposition
Scanner
Text Regions
Character or Shape Codes
Optical Character Recognition
14Document Image Analysis
- General Flow
- Obtain Image - Digitize
- Preprocessing
- Feature Extraction
- Classification
- General Tasks
- Logical and Physical Page Structure Analysis
- Zone Classification
- Language ID
- Zone Specific Processing
- Recognition
- Vectorization
15Page Analysis
- Skew correction
- Based on finding the primary orientation of lines
- Image and text region detection
- Based on texture and dominant orientation
- Structural classification
- Infer logical structure from physical layout
- Text region classification
- Title, author, letterhead, signature block, etc.
16Image Detection
17 Text Region Detection
18Language Identification
- Language-independent skew detection
- Accommodate horizontal and vertical writing
- Script class recognition
- Asian script have blocky characters
- Connected scripts cant be segmented easily
- Language identification
- Shape statistics work well for western languages
- Competing classifiers work for Asian languages
19Optical Character Recognition
- Pattern-matching approach
- Standard approach in commercial systems
- Segment individual characters
- Recognize using a neural network classifier
- Hidden Markov model approach
- Experimental approach developed at BBN
- Segment into sub-character slices
- Limited lookahead to find best character choice
- Useful for connected scripts (e.g., Arabic)
20OCR Accuracy Problems
- Character segmentation errors
- In English, segmentation often changes m to
rn - Character confusion
- Characters with similar shapes often confounded
- OCR on copies is much worse than on originals
- Pixel bloom, character splitting, binding bend
- Uncommon fonts can cause problems
- If not used to train a neural network
21Measures of OCR Accuracy
- Character accuracy
- Word accuracy
- IDF coverage
- Query coverage
22Improving OCR Accuracy
- Image preprocessing
- Mathematical morphology for bloom and splitting
- Particularly important for degraded images
- Voting between several OCR engines helps
- Individual systems depend on specific training
data - Linguistic analysis can correct some errors
- Use confusion statistics, word lists, syntax,
- But more harmful errors might be introduced
23OCR Speed
- Neural networks take about 10 seconds a page
- Hidden Markov models are slower
- Voting can improve accuracy
- But at a substantial speed penalty
- Easy to speed things up with several machines
- For example, by batch processing - using desktop
computers at night
24Problem Logical Page Analysis (Reading Order)
- Can be hard to guess in some cases
- Newspaper columns, figure captions, appendices,
- Sometimes there are explicit guides
- Continued on page 4 (but page 4 may be big!)
- Structural cues can help
- Column 1 might continue to column 2
- Content analysis is also useful
- Word co-occurrence statistics, syntax analysis
25Processing Converted Text
- Typical Document Image Indexing
- Convert hardcopy to an electronic document
- OCR
- Page Layout Analysis
- Graphics Recognition
- Use structure to add metadata
- Manually supplement with keywords
- Use traditional text indexing and retrieval
techniques?
26Information Retrieval on OCR
- Requires robust ways of indexing
- Statistical methods with large documents work
best - Key Evaluations
- Success for high quality OCR (Croft et al 1994,
Taghva 1994) - Limited success for poor quality OCR (1996 TREC,
UNLV) - Clustering successful for gt 85 accuracy (Tsuda
et al, 1995)
27Proposed Solutions
- Improve OCR
- Automatic Correction
- Taghva et al, 1994
- Enhance IR techniques
- Lopresti and Zhou, 1996
- NGrams
- Applications
- Cornell CS TR Collection (Lagoze et al, 1995)
- Degraded Text Simulator (Doermann and Yao, 1995)
28N-Grams
- Powerful, Inexpensive statistical method for
characterizing populations - Approach
- Split up document into n-character pairs fails
- Use traditional indexing representations to
perform analysis - DOCUMENT -gt DOC, OCU, CUM, UME, MEN, ENT
- Advantages
- Statistically robust to small numbers of errors
- Rapid indexing and retrieval
- Works from 70-85 character accuracy where
traditional IR fails
29Matching with OCR Errors
- Above 80 character accuracy, use words
- With linguistic correction
- Between 75 and 80, use n-grams
- With n somewhat shorter than usual
- And perhaps with character confusion statistics
- Below 75, use word-length shape codes
30Handwriting Recognition
- With stroke information, can be automated
- Basis for input pads
- Simple things can be read without strokes
- Postal addresses, filled-in forms
- Free text requires human interpretation
- But repeated recognition is then possible
31Conversion?
- Full Conversion often required
- Conversion is difficult!
- Noisy data
- Complex Layouts
- Non-text components
- Points to Ponder
- Do we really need to convert?
- Can we expect to fully describe documents
without assumptions?
32Researchers are seeing a progression from full
conversion to image based approach
- Applications
- Indexing and Retrieval
- Information Extraction
- Duplicate Detection
- Clustering (Document Similarity)
- Summarization
- Advantages
- Makes use of powerful image properties (Function,
IVC 1998) - Can be cheaper then conversion
- Makes use of redundancy in the language.
33Outline
- Processing Converted Text
- Manipulating Images of Text
- Title Extraction
- Named Entity Extraction
- Keyword Spotting
- Abstracting and Summarization
- Indexing based on Structure
- Graphics and Drawings
- Related Work and Applications
34Processing Images of Text
- Characteristics
- Does not require expensive OCR/Conversion
- Applicable to filtering applications
- May be more robust to noise
- Possible Disadvantages
- Application domain may be very limited
- Processing time may be an issue if indexing is
otherwise required
35Proper Noun Detection (DeSilva and Hull, 1994)
- Problem Filter proper nouns in images of text
- People, Places, Things
- Advantages of the Image Domain
- Saves converting all of the text
- Allows application of word recognition approaches
- Limits post-processing to a subset of words
- Able to use features which are not available in
the text - Approach
- Identify Word Features
- Capitalization, location, length, and syntactic
categories - Classify using rule-set
- Achieve 75-85 accuracy without conversion
36 Keyword Spotting
- Techniques
- Work Shape/HMM - (Chen et al, 1995)
- Word Image Matching - (Trenkle and Vogt, 1993
Hull et al) - Character Stroke Features - (Decurtins and Chen,
1995) - Shape Coding - (Tanaka and Torii Spitz 1995
Kia, 1996) - Applications
- Filing System (Spitz - SPAM, 1996)
- Numerous IR
- Processing handwritten documents
- Formal Evaluation
- Scribble vs. OCR (DeCurtins, SDIUT 1997)
37Shape Coding
- Approach
- Use of Generic Character Descriptors
- Make Use of Power of Language to resolve
ambiguity - Map Character based on Shape features including
ascenders, descenders, punctuation and character
with holes
38Shape Codes
- Group all characters that have similar shapes
- A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P,
Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7,
8, 9, 0 - a, c, e, n, o, r, s, u, v, x, z
- b, d, h, k,
- f, t
- g, p, q, y
- i, j, l, 1
- m, w
39Why Use Shape Codes?
- Can recognize shapes faster than characters
- Seconds per page, and very accurate
- Preserves recall, but with lower precision
- Useful as a first pass in any system
- Easily extracted from JPEG-2 images
- Because JPEG-2 uses object-based compression
40Additional Applications
- Handwritten Archival Manuscripts
- (Manmatha, 1997)
- Page Classification
- (Decurtins and Chen, 1995)
- Matching Handwritten Records
- (Ganzberger et al, 1994)
- Headline Extraction
- Document Image Compression (UMD, 1996-1998)
41Outline
- Processing Converted Text
- Manipulating Images of Text
- Indexing Based on Structure
- Logical
- Physical
- Functional
- Graphics and Drawings
- Related Work and Applications
42Document FunctionalityICDAR 1997
- Humans process documents very robustly
- When interacting with documents, we can interpret
without recognition - We can judge relevance without reading
- We can rapidly navigate documents to find the
information we want - Claims
- We must provide basic ways to interact with
documents, and interaction often relies as much
on the structure of a document, as on the content - Traditional geometric properties and
type-dependent logical models are not sufficient
43The Role of Documents
- The role or function of a document is to store
data in symbolic form which has been produced by
a sender (the author) to facilitate transfer to
a receiver (the reader) - Documents are designed to be interpreted by
humans - Authors typically tailor this design to optimize
the transfer of information - Readers use structure to enhance interpretation
- In what ways does the design facilitate,
disambiguate or enhance the flow of information?
44(No Transcript)
45Functional Structures
46(No Transcript)
47(No Transcript)
48Outline
- Processing Converted Text
- Manipulating Images of Text
- Indexing based on Structure
- Graphics and Drawings
- Related Work and Applications
49Graphics
- Maps and Drawings
- Lorenz and Monagan, 1995
- Samet and Soffer, 1995
- Amlani and Kasturi, 1988
- Graphs
- Koga et al, 1993
- Logos and Icons
- Jaisimha et al, 1996
- Doermann et al, 1996
- Gudivada and Raghavan, 1993
- Technical Drawings
- Syeda-Mahmood, 1995
50Map InterpretationSamet et al
- Identify Legend on the Map Image
- Extract Images map labels and descriptions
- Identify labels in the map images
- Allow user to query based on extracted images
- Bootstraps the information extraction and
interpretation problems
51Outline
- Processing Converted Text
- Manipulating Images of Text
- Indexing based on Structure
- Graphics and Drawings
- Related Work and Applications
52Duplicate Detection
- Same content, same format
- For example, a xerox copy
- Same content, different format
- For example, as a web page or on paper
- Shared content, same format
- For example, a paper with annotations
- Shared content, different format
- For example, including text with cut-and-paste
53Duplicate Reconciliation
54Approach
- Use global features to restrict search
- Number of pages, number of lines, page moments
- Extract a signature
- using shape codes
- Convert signature
- use a set of n-gram keys to index the database
- Rank and verify
- return top N documents
- visual or algorithmic refinement
- Advantages
- Robust to noise, extracted quickly, extracted
easily, efficiently stored
55Cross-Language Duplicate Detection ( finding
translations!)
56Evaluation
- The usual approach Model-based evaluation
- Apply confusion statistics to an existing
collection - A bit better Print-scan evaluation
- Scanning is slow, but availability is no problem
- Best Scan-only evaluation
- No existing IR collections have printed materials
57Summary
- Many applications benefit from image based
indexing - Less discriminatory features
- Features may therefore be easier to compute
- More robust to noise
- Often computationally more efficient
- Many classical IR techniques have application for
DIR - Structure as well as content are important for
indexing - Preservation of structure is essential for
in-depth understanding
58Example Title Pages (4 9)
59Title Page Overall Accuracy
- 57 Title pages, 891 non-title pages
- Overall Accuracy 906/948 95.57
- Title Page Accuracy 37/57 64.91
- False Positives 22
- False Negatives 20
- Observations
- All without Type-Specific Information
- Need Functional (or Logical) Features
60Agenda
- Questions
- Definitions - Document, Image, Retrieval
- Document Image Analysis
- Traditional Indexing with Conversion
- Doing things Without Conversion
- Recent work on IR with Chinese document images
- Tseng and Oard
61Document Retrieval Approaches for Images of Text
- Full-text search based on manually re-keying the
text - Prohibitively expensive at large scale
- Search based on bibliographic metadata
- May be difficult to adequately describe the
materials. - Full text based on Optical Character Recognition
(OCR) - Inexpensive and relatively rapid
- Sensitive to OCR accurracy
62Key Questions for Information Retrieval
- What to index?
- Phrase, words, character, or shape codes
- Unigrams or n-grams
- How to weight a term in a document?
- Term frequency (TF)
- Document frequency (DF)
- Document length normalization
- (Term position)
- How to assign scores to documents?
- Boolean, vector space, and probabilistic models
63Chinese Text Retrieval Issues
- Words may be any number of characters (typically
2-5) - But some that contain only 1 character or more
than 5 characters - e.g., ? (cat), ???????? (UNESCO)
- Longer words (over 2 characters) often have
shorter sub-word units - Transliteration is an exception
- Written Chinese has no word separator
- A sentence can be segmented in different ways,
all may be legal - Similar to the phrase detection problem in
English - Chinese character inventory is very large
- 13,500 characters in Big-5 code (traditional
Chinese Taiwan and Hong Kong) - Over 6,000 characters in GB code (simplified
Chinese China, Singapore) - About 3,000 commonly used characters in each
character set
64Socio-Cultural Research Center (SCRC) Collection
- 800,000 newspaper clippings from 1950-1976
- Scanned over 300,000 at 300 dpi
- 30 China, Hong Kong, and Taiwan news agencies
- Mostly simplified Chinese, some traditional
Chinese - Focus on diplomatic and military activities
65Document Preparation
- Selected 11,108 scanned document images
- OCR yielded 8,438 valid docs (Presto! OCR Pro,
Big-5) - Avg valid document had a 69 system-reported
recognition rate - Computed on a sample of 1,300 documents
- Second version prepared using Big-5 to GB
conversion - GB version used in experiments
66Topic Preparation
- Based on contemporaneous Chinese journal articles
- From 100 paper titles, 30 were selected and
rewritten as Chinese topics - Made English translations for cross-language
experiments - Translated by native speakers of Chinese
lttopgt ltnumgt 12 lttitlegt Anti-Chinese
Movements ltdescriptiongt Activities related to
the anti-Chinese movements in Indonesia ltnarrative
gt Articles must deal with activities related
to the anti-Chinese movement in Indonesia case
reports or articles dealing with PRC's criticism
of the Anti-Chinese movement will be considered
partly relevant. lt/topgt
67Relevance Judgments
- Exhaustive tri-state relevance judgments
- Irrelevant (0), partially relevant (1), fully
relevant (2) - Every topic-document pair judged by 3 assessors
- 2 majored in history, 1 majored in library
science - Averaged 4 minutes per document image (for all 30
topics) - Sum of the judgments provides a final estimate
- 0not relevant, 15partially relevant, 6fully
relevant - Threshold as desired to reflect the intended
application - In our experiments, any score gt 0 is treated as
relevant
68Chinese OCR Text Retrieval Strategies
- Indexing method
- Both 1-gram (for partial match) and 2-gram (for
preserving sequence) - Example ABC will be indexed with A, B,
C, AB, BC - Compared to 1-gram only and 2-gram only
- Weighting scheme
- document terms TFIDF log(1 tf ) log(N/df)
- query terms tf (3w-1), where w is the length
of the term - Retrieval model
- Vector space model compared with probabilistic
model - Document length normalization
- byte size for document terms, compared to cosine
69OCR and Length Normalization
- Experiments by Taghva et al showed that
- some sophisticated weighting schemes shown to be
more effective for ordinary text might lead to
more unstable results for OCR degraded text. - Singhal, Salton, Buckley 96 analyzed this
phenomenon by - Vector space model (SMART system)
- Word-based indexing
- simulated OCR output of a TREC collection (2GB of
742,202 docs) - 50 TREC queries (numbered from 151 to 200)
- Specifically, effects of cosine normalization and
IDF are analyzed - Incorrect terms like systom have large IDF and
thus affect weights of other terms in the same
document if cosine normalization is used - They correct this problem by using byte size
normalization - (byte size)0.375
70Results Summary
- 12 gram is best
- ByteSize beats Cosine
- Long queries beat Titles
- Inquery does well
Mean Average Precision
Long Queries
Title Queries
71Conclusions of Study
- The SCRC test collection is useful
- But more than 30 topics may be needed for
statistical significance - Indexing 1-grams and 2-grams together works well
- If 2-grams are given greater weight in the query
- Byte size normalization outperforms cosine
normalization - But Inquery does better than either on short
queries - OCR errors adversely affect blind relevance
feedback - A clean comparable collection would probably work
better - Pruning seems to help
- Considerable parameter tuning is needed (?, ?,
and k)