Document Image Retrieval - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Document Image Retrieval

Description:

To get an appreciation of the issues in document image indexing ... Researchers are seeing a progression from full conversion to image based approach ... – PowerPoint PPT presentation

Number of Views:558
Avg rating:3.0/5.0
Slides: 70
Provided by: daviddo5
Category:

less

Transcript and Presenter's Notes

Title: Document Image Retrieval


1
Document Image Retrieval
  • LBSC 796/CMSC 828o
  • Douglas W. Oard
  • April 12, 2004
  • mostly adapted from
  • A lecture by David Doermann

2
Agenda
  • Questions
  • Definitions - Document, Image, Retrieval
  • Document Image Analysis
  • Page decomposition
  • Optical character recognition
  • Traditional Indexing with Conversion
  • Confusion matrix
  • Shape codes
  • Doing things Without Conversion
  • Duplicate Detection, Classification,
    Summarization, Abstracting
  • Keyword spotting, etc
  • Example Chinese document images

3
Goals of this Class
  • Expand your definition of what is a DOCUMENT
  • To get an appreciation of the issues in document
    image indexing
  • To look at different ways of solving the same
    problems with different media
  • Your job compare/contrast with other media

4
Document
  • Basic Medium for Recording Information
  • Transient
  • Space
  • Time
  • Multiple Forms
  • Hardcopy (paper, stone, ..) / Electronic (CDROM,
    Internet, )
  • Written/Auditory/Visual (symbolic, scenic)
  • Access Requirements
  • Search
  • Browse
  • Read

5
Sources of Document Images
  • The Web
  • Some PDF files come from scanned documents
  • Arabic news stories are often GIF images
  • Digital copiers
  • Produce corporate memory as a byproduct
  • Digitization projects
  • Provide improved access to hardcopy documents

6
Some Definitions
  • Modality
  • A means of expression
  • Linguistic modalities
  • Electronic text, printed, handwritten, spoken,
    signed
  • Nonlinguistic modalities
  • Music, drawings, paintings, photographs, video
  • Media
  • The means by which the expression reaches you
  • Internet, videotape, paper, canvas,

7
Document Images
  • A collection of dots called pixels
  • Arranged in a grid and called a bitmap
  • Pixels often binary-valued (black, white)
  • But greyscale or color is sometimes needed
  • 300 dots per inch (dpi) gives the best results
  • But images are quite large (1 MB per page)
  • Faxes are normally 72 dpi
  • Usually stored in TIFF or PDF format

8
Images
  • Pixel representation of intensity map
  • No explicit content, only relations
  • Image analysis
  • Attempts to mimic human visual behavior
  • Draw conclusions, hypothesize and verify

Image databases Use primitive image analysis to
represent content Transform semantic queries into
image features color, shape, texture spatial
relations
9
Document Images
  • Scanned Pixel representation of document
  • Data Intensive (100-300dpi, 1-24 bpp)
  • NO EXPLICIT CONTENT
  • Document image analysis or manual annotation
    required
  • takes pixels -gt contents
  • automatic means are not guaranteed
  • Yet we want to be able to process them like text
    files!

10
Document Image Database
  • Collection of scanned images
  • Need to be available for indexing and retrieval,
    abstracting, routing, editing, dissemination,
    interpretation

11
Information Retrieval
Document Understanding
Document Image Retrieval
12
Managing Document Image Databases
  • Document Image Databases are often influenced by
    traditional DB indexing and retrieval
    philosophies
  • We are comfortable with them
  • They work
  • Problem Requires content to be accessible
  • Techniques
  • Content based retrieval (keywords, natural
    language)
  • Query by structure (logical/physical)
  • Query by Functional attributes (titles, bold, )
  • Requirements
  • Ability to Browse, search and read

13
Indexing Page Images(Traditional)
Page Image
Structure Representation
Document
Page Decomposition
Scanner
Text Regions
Character or Shape Codes
Optical Character Recognition
14
Document Image Analysis
  • General Flow
  • Obtain Image - Digitize
  • Preprocessing
  • Feature Extraction
  • Classification
  • General Tasks
  • Logical and Physical Page Structure Analysis
  • Zone Classification
  • Language ID
  • Zone Specific Processing
  • Recognition
  • Vectorization

15
Page Analysis
  • Skew correction
  • Based on finding the primary orientation of lines
  • Image and text region detection
  • Based on texture and dominant orientation
  • Structural classification
  • Infer logical structure from physical layout
  • Text region classification
  • Title, author, letterhead, signature block, etc.

16
Image Detection
17
Text Region Detection
18
Language Identification
  • Language-independent skew detection
  • Accommodate horizontal and vertical writing
  • Script class recognition
  • Asian script have blocky characters
  • Connected scripts cant be segmented easily
  • Language identification
  • Shape statistics work well for western languages
  • Competing classifiers work for Asian languages

19
Optical Character Recognition
  • Pattern-matching approach
  • Standard approach in commercial systems
  • Segment individual characters
  • Recognize using a neural network classifier
  • Hidden Markov model approach
  • Experimental approach developed at BBN
  • Segment into sub-character slices
  • Limited lookahead to find best character choice
  • Useful for connected scripts (e.g., Arabic)

20
OCR Accuracy Problems
  • Character segmentation errors
  • In English, segmentation often changes m to
    rn
  • Character confusion
  • Characters with similar shapes often confounded
  • OCR on copies is much worse than on originals
  • Pixel bloom, character splitting, binding bend
  • Uncommon fonts can cause problems
  • If not used to train a neural network

21
Measures of OCR Accuracy
  • Character accuracy
  • Word accuracy
  • IDF coverage
  • Query coverage

22
Improving OCR Accuracy
  • Image preprocessing
  • Mathematical morphology for bloom and splitting
  • Particularly important for degraded images
  • Voting between several OCR engines helps
  • Individual systems depend on specific training
    data
  • Linguistic analysis can correct some errors
  • Use confusion statistics, word lists, syntax,
  • But more harmful errors might be introduced

23
OCR Speed
  • Neural networks take about 10 seconds a page
  • Hidden Markov models are slower
  • Voting can improve accuracy
  • But at a substantial speed penalty
  • Easy to speed things up with several machines
  • For example, by batch processing - using desktop
    computers at night

24
Problem Logical Page Analysis (Reading Order)
  • Can be hard to guess in some cases
  • Newspaper columns, figure captions, appendices,
  • Sometimes there are explicit guides
  • Continued on page 4 (but page 4 may be big!)
  • Structural cues can help
  • Column 1 might continue to column 2
  • Content analysis is also useful
  • Word co-occurrence statistics, syntax analysis

25
Processing Converted Text
  • Typical Document Image Indexing
  • Convert hardcopy to an electronic document
  • OCR
  • Page Layout Analysis
  • Graphics Recognition
  • Use structure to add metadata
  • Manually supplement with keywords
  • Use traditional text indexing and retrieval
    techniques?

26
Information Retrieval on OCR
  • Requires robust ways of indexing
  • Statistical methods with large documents work
    best
  • Key Evaluations
  • Success for high quality OCR (Croft et al 1994,
    Taghva 1994)
  • Limited success for poor quality OCR (1996 TREC,
    UNLV)
  • Clustering successful for gt 85 accuracy (Tsuda
    et al, 1995)

27
Proposed Solutions
  • Improve OCR
  • Automatic Correction
  • Taghva et al, 1994
  • Enhance IR techniques
  • Lopresti and Zhou, 1996
  • NGrams
  • Applications
  • Cornell CS TR Collection (Lagoze et al, 1995)
  • Degraded Text Simulator (Doermann and Yao, 1995)

28
N-Grams
  • Powerful, Inexpensive statistical method for
    characterizing populations
  • Approach
  • Split up document into n-character pairs fails
  • Use traditional indexing representations to
    perform analysis
  • DOCUMENT -gt DOC, OCU, CUM, UME, MEN, ENT
  • Advantages
  • Statistically robust to small numbers of errors
  • Rapid indexing and retrieval
  • Works from 70-85 character accuracy where
    traditional IR fails

29
Matching with OCR Errors
  • Above 80 character accuracy, use words
  • With linguistic correction
  • Between 75 and 80, use n-grams
  • With n somewhat shorter than usual
  • And perhaps with character confusion statistics
  • Below 75, use word-length shape codes

30
Handwriting Recognition
  • With stroke information, can be automated
  • Basis for input pads
  • Simple things can be read without strokes
  • Postal addresses, filled-in forms
  • Free text requires human interpretation
  • But repeated recognition is then possible

31
Conversion?
  • Full Conversion often required
  • Conversion is difficult!
  • Noisy data
  • Complex Layouts
  • Non-text components
  • Points to Ponder
  • Do we really need to convert?
  • Can we expect to fully describe documents
    without assumptions?

32
Researchers are seeing a progression from full
conversion to image based approach
  • Applications
  • Indexing and Retrieval
  • Information Extraction
  • Duplicate Detection
  • Clustering (Document Similarity)
  • Summarization
  • Advantages
  • Makes use of powerful image properties (Function,
    IVC 1998)
  • Can be cheaper then conversion
  • Makes use of redundancy in the language.

33
Outline
  • Processing Converted Text
  • Manipulating Images of Text
  • Title Extraction
  • Named Entity Extraction
  • Keyword Spotting
  • Abstracting and Summarization
  • Indexing based on Structure
  • Graphics and Drawings
  • Related Work and Applications

34
Processing Images of Text
  • Characteristics
  • Does not require expensive OCR/Conversion
  • Applicable to filtering applications
  • May be more robust to noise
  • Possible Disadvantages
  • Application domain may be very limited
  • Processing time may be an issue if indexing is
    otherwise required

35
Proper Noun Detection (DeSilva and Hull, 1994)
  • Problem Filter proper nouns in images of text
  • People, Places, Things
  • Advantages of the Image Domain
  • Saves converting all of the text
  • Allows application of word recognition approaches
  • Limits post-processing to a subset of words
  • Able to use features which are not available in
    the text
  • Approach
  • Identify Word Features
  • Capitalization, location, length, and syntactic
    categories
  • Classify using rule-set
  • Achieve 75-85 accuracy without conversion

36
Keyword Spotting
  • Techniques
  • Work Shape/HMM - (Chen et al, 1995)
  • Word Image Matching - (Trenkle and Vogt, 1993
    Hull et al)
  • Character Stroke Features - (Decurtins and Chen,
    1995)
  • Shape Coding - (Tanaka and Torii Spitz 1995
    Kia, 1996)
  • Applications
  • Filing System (Spitz - SPAM, 1996)
  • Numerous IR
  • Processing handwritten documents
  • Formal Evaluation
  • Scribble vs. OCR (DeCurtins, SDIUT 1997)

37
Shape Coding
  • Approach
  • Use of Generic Character Descriptors
  • Make Use of Power of Language to resolve
    ambiguity
  • Map Character based on Shape features including
    ascenders, descenders, punctuation and character
    with holes

38
Shape Codes
  • Group all characters that have similar shapes
  • A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P,
    Q, R, S, T, U, V, W, X, Y, Z, 2, 3, 4, 5, 6, 7,
    8, 9, 0
  • a, c, e, n, o, r, s, u, v, x, z
  • b, d, h, k,
  • f, t
  • g, p, q, y
  • i, j, l, 1
  • m, w

39
Why Use Shape Codes?
  • Can recognize shapes faster than characters
  • Seconds per page, and very accurate
  • Preserves recall, but with lower precision
  • Useful as a first pass in any system
  • Easily extracted from JPEG-2 images
  • Because JPEG-2 uses object-based compression

40
Additional Applications
  • Handwritten Archival Manuscripts
  • (Manmatha, 1997)
  • Page Classification
  • (Decurtins and Chen, 1995)
  • Matching Handwritten Records
  • (Ganzberger et al, 1994)
  • Headline Extraction
  • Document Image Compression (UMD, 1996-1998)

41
Outline
  • Processing Converted Text
  • Manipulating Images of Text
  • Indexing Based on Structure
  • Logical
  • Physical
  • Functional
  • Graphics and Drawings
  • Related Work and Applications

42
Document FunctionalityICDAR 1997
  • Humans process documents very robustly
  • When interacting with documents, we can interpret
    without recognition
  • We can judge relevance without reading
  • We can rapidly navigate documents to find the
    information we want
  • Claims
  • We must provide basic ways to interact with
    documents, and interaction often relies as much
    on the structure of a document, as on the content
  • Traditional geometric properties and
    type-dependent logical models are not sufficient

43
The Role of Documents
  • The role or function of a document is to store
    data in symbolic form which has been produced by
    a sender (the author) to facilitate transfer to
    a receiver (the reader)
  • Documents are designed to be interpreted by
    humans
  • Authors typically tailor this design to optimize
    the transfer of information
  • Readers use structure to enhance interpretation
  • In what ways does the design facilitate,
    disambiguate or enhance the flow of information?

44
(No Transcript)
45
Functional Structures
46
(No Transcript)
47
(No Transcript)
48
Outline
  • Processing Converted Text
  • Manipulating Images of Text
  • Indexing based on Structure
  • Graphics and Drawings
  • Related Work and Applications

49
Graphics
  • Maps and Drawings
  • Lorenz and Monagan, 1995
  • Samet and Soffer, 1995
  • Amlani and Kasturi, 1988
  • Graphs
  • Koga et al, 1993
  • Logos and Icons
  • Jaisimha et al, 1996
  • Doermann et al, 1996
  • Gudivada and Raghavan, 1993
  • Technical Drawings
  • Syeda-Mahmood, 1995

50
Map InterpretationSamet et al
  • Identify Legend on the Map Image
  • Extract Images map labels and descriptions
  • Identify labels in the map images
  • Allow user to query based on extracted images
  • Bootstraps the information extraction and
    interpretation problems

51
Outline
  • Processing Converted Text
  • Manipulating Images of Text
  • Indexing based on Structure
  • Graphics and Drawings
  • Related Work and Applications

52
Duplicate Detection
  • Same content, same format
  • For example, a xerox copy
  • Same content, different format
  • For example, as a web page or on paper
  • Shared content, same format
  • For example, a paper with annotations
  • Shared content, different format
  • For example, including text with cut-and-paste

53
Duplicate Reconciliation
54
Approach
  • Use global features to restrict search
  • Number of pages, number of lines, page moments
  • Extract a signature
  • using shape codes
  • Convert signature
  • use a set of n-gram keys to index the database
  • Rank and verify
  • return top N documents
  • visual or algorithmic refinement
  • Advantages
  • Robust to noise, extracted quickly, extracted
    easily, efficiently stored

55
Cross-Language Duplicate Detection ( finding
translations!)
56
Evaluation
  • The usual approach Model-based evaluation
  • Apply confusion statistics to an existing
    collection
  • A bit better Print-scan evaluation
  • Scanning is slow, but availability is no problem
  • Best Scan-only evaluation
  • No existing IR collections have printed materials

57
Summary
  • Many applications benefit from image based
    indexing
  • Less discriminatory features
  • Features may therefore be easier to compute
  • More robust to noise
  • Often computationally more efficient
  • Many classical IR techniques have application for
    DIR
  • Structure as well as content are important for
    indexing
  • Preservation of structure is essential for
    in-depth understanding

58
Example Title Pages (4 9)
59
Title Page Overall Accuracy
  • 57 Title pages, 891 non-title pages
  • Overall Accuracy 906/948 95.57
  • Title Page Accuracy 37/57 64.91
  • False Positives 22
  • False Negatives 20
  • Observations
  • All without Type-Specific Information
  • Need Functional (or Logical) Features

60
Agenda
  • Questions
  • Definitions - Document, Image, Retrieval
  • Document Image Analysis
  • Traditional Indexing with Conversion
  • Doing things Without Conversion
  • Recent work on IR with Chinese document images
  • Tseng and Oard

61
Document Retrieval Approaches for Images of Text
  • Full-text search based on manually re-keying the
    text
  • Prohibitively expensive at large scale
  • Search based on bibliographic metadata
  • May be difficult to adequately describe the
    materials.
  • Full text based on Optical Character Recognition
    (OCR)
  • Inexpensive and relatively rapid
  • Sensitive to OCR accurracy

62
Key Questions for Information Retrieval
  • What to index?
  • Phrase, words, character, or shape codes
  • Unigrams or n-grams
  • How to weight a term in a document?
  • Term frequency (TF)
  • Document frequency (DF)
  • Document length normalization
  • (Term position)
  • How to assign scores to documents?
  • Boolean, vector space, and probabilistic models

63
Chinese Text Retrieval Issues
  • Words may be any number of characters (typically
    2-5)
  • But some that contain only 1 character or more
    than 5 characters
  • e.g., ? (cat), ???????? (UNESCO)
  • Longer words (over 2 characters) often have
    shorter sub-word units
  • Transliteration is an exception
  • Written Chinese has no word separator
  • A sentence can be segmented in different ways,
    all may be legal
  • Similar to the phrase detection problem in
    English
  • Chinese character inventory is very large
  • 13,500 characters in Big-5 code (traditional
    Chinese Taiwan and Hong Kong)
  • Over 6,000 characters in GB code (simplified
    Chinese China, Singapore)
  • About 3,000 commonly used characters in each
    character set

64
Socio-Cultural Research Center (SCRC) Collection
  • 800,000 newspaper clippings from 1950-1976
  • Scanned over 300,000 at 300 dpi
  • 30 China, Hong Kong, and Taiwan news agencies
  • Mostly simplified Chinese, some traditional
    Chinese
  • Focus on diplomatic and military activities

65
Document Preparation
  • Selected 11,108 scanned document images
  • OCR yielded 8,438 valid docs (Presto! OCR Pro,
    Big-5)
  • Avg valid document had a 69 system-reported
    recognition rate
  • Computed on a sample of 1,300 documents
  • Second version prepared using Big-5 to GB
    conversion
  • GB version used in experiments

66
Topic Preparation
  • Based on contemporaneous Chinese journal articles
  • From 100 paper titles, 30 were selected and
    rewritten as Chinese topics
  • Made English translations for cross-language
    experiments
  • Translated by native speakers of Chinese

lttopgt ltnumgt 12 lttitlegt Anti-Chinese
Movements ltdescriptiongt Activities related to
the anti-Chinese movements in Indonesia ltnarrative
gt Articles must deal with activities related
to the anti-Chinese movement in Indonesia case
reports or articles dealing with PRC's criticism
of the Anti-Chinese movement will be considered
partly relevant. lt/topgt
67
Relevance Judgments
  • Exhaustive tri-state relevance judgments
  • Irrelevant (0), partially relevant (1), fully
    relevant (2)
  • Every topic-document pair judged by 3 assessors
  • 2 majored in history, 1 majored in library
    science
  • Averaged 4 minutes per document image (for all 30
    topics)
  • Sum of the judgments provides a final estimate
  • 0not relevant, 15partially relevant, 6fully
    relevant
  • Threshold as desired to reflect the intended
    application
  • In our experiments, any score gt 0 is treated as
    relevant

68
Chinese OCR Text Retrieval Strategies
  • Indexing method
  • Both 1-gram (for partial match) and 2-gram (for
    preserving sequence)
  • Example ABC will be indexed with A, B,
    C, AB, BC
  • Compared to 1-gram only and 2-gram only
  • Weighting scheme
  • document terms TFIDF log(1 tf ) log(N/df)
  • query terms tf (3w-1), where w is the length
    of the term
  • Retrieval model
  • Vector space model compared with probabilistic
    model
  • Document length normalization
  • byte size for document terms, compared to cosine

69
OCR and Length Normalization
  • Experiments by Taghva et al showed that
  • some sophisticated weighting schemes shown to be
    more effective for ordinary text might lead to
    more unstable results for OCR degraded text.
  • Singhal, Salton, Buckley 96 analyzed this
    phenomenon by
  • Vector space model (SMART system)
  • Word-based indexing
  • simulated OCR output of a TREC collection (2GB of
    742,202 docs)
  • 50 TREC queries (numbered from 151 to 200)
  • Specifically, effects of cosine normalization and
    IDF are analyzed
  • Incorrect terms like systom have large IDF and
    thus affect weights of other terms in the same
    document if cosine normalization is used
  • They correct this problem by using byte size
    normalization
  • (byte size)0.375

70
Results Summary
  • 12 gram is best
  • ByteSize beats Cosine
  • Long queries beat Titles
  • Inquery does well

Mean Average Precision
Long Queries
Title Queries
71
Conclusions of Study
  • The SCRC test collection is useful
  • But more than 30 topics may be needed for
    statistical significance
  • Indexing 1-grams and 2-grams together works well
  • If 2-grams are given greater weight in the query
  • Byte size normalization outperforms cosine
    normalization
  • But Inquery does better than either on short
    queries
  • OCR errors adversely affect blind relevance
    feedback
  • A clean comparable collection would probably work
    better
  • Pruning seems to help
  • Considerable parameter tuning is needed (?, ?,
    and k)
Write a Comment
User Comments (0)
About PowerShow.com