Document Image Indexing - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Document Image Indexing

Description:

full document conversion. geometric analysis, OCR, logical analysis. results mostly incomplete ... partial document conversion. only recognize important features ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 78
Provided by: Faculte2
Category:

less

Transcript and Presenter's Notes

Title: Document Image Indexing


1
Document Image Indexing
2
Indexing of document images
  • Apply IR techniques in modified form
  • Different approaches
  • Methods differ in how much analysis they do i.e
    how rich are the document representations
    involved.
  • Image objects and/or image structure
  • layout objects and/or layout structure
  • logical objects and/or logical structure

3
Approaches
  • full document conversion
  • geometric analysis, OCR, logical analysis
  • results mostly incomplete
  • methods remain valid only when OCR quality is
    reasonable
  • expensive process, not always feasible to apply
    to millions of pages
  • partial document conversion
  • only recognize important features
  • cheap analysis or cheap pre-processing with
    expensive processing of limited document parts
    only

4
Text characterization (deSilva)
  • Focus on proper nouns
  • Examples
  • names of people, places, important objects
  • Characteristics
  • important for indexing
  • difficult to do extensive post-processing as the
    set of proper nouns can be very large

5
Text characterization (deSilva)
  • Observations made in experiments
  • 95.5 of proper nouns are capitalized
  • 35 of capitalized words are proper nouns
  • beginning of sentence 10 proper nouns
  • 85 of capitalized words following a one letter
    word are proper nouns
  • average length is one larger than for other words

6
Text characterization
Proper nouns
High level features - syntactic category of
previous/next word
Candidate nouns
Low level features - capitalization - length of
word - length of previous/next word - position in
the sentence
Box based image abstraction
Characters and words
7
Identification of document function (Doermann)
  • Document functions
  • reading user is supposed to read the whole
    document
  • browsing user is supposed to quickly go through
    the document
  • searching user is supposed to look for specific
    parts of the document
  • Observed properties
  • reading few titles, large content blocks
  • browsing large number of head/body pairings
  • searching large number of small similar-sized
    blocks

8
Identification of document functionReading,
Browsing, or Searching?
Browsing
Searching
Reading
9
Identification of document function
Document function
High level features - distribution of functional
units
Salient regions titles, abstracts, index keys
etc.
Low level features - zone properties - position
on the page
Box based image abstraction
Zones
10
Presentation
  • Basis is logical tree
  • Functions
  • reading
  • dept first search of the logical tree
  • browsing
  • pruned depth first search of the logical tree
  • searching
  • decision tree based on the logical tree

11
Presentation by document function
Searching path depends on user
Browsing
Reading
12
Layout similarity (Doermann)
  • Which documents are similar?

13
Layout similarity
  • Different measures required
  • mapping between the typed component
  • one-to-one mapping between components
  • overlap of components which are matched
  • relative positions of document parts
  • shape of document parts
  • The above measures are not independent
  • order and relevance of the different measures has
    to be chosen

14
Edit distance
  • Definition
  • the minimum number of actions that you have to
    perform to transform the one layout structure to
    the other
  • Actions
  • delete an object
  • move an object
  • change the shape of an object
  • Weighting
  • the different actions can have different
    weighting depending on the application

15
Graphics indexing (Lorenz)
  • Basic components
  • lines, parallel lines, adjacent lines, junctions
  • text
  • (circles, ellipses, etc.)
  • Feature frequency weighting
  • technique similar to text indexing as before
  • indexing focussed on salient basic components
    which occur often in the query graphics, but are
    rare with respect to the whole collection of
    graphics
  • allows access to heterogeneous collections of
    data (e.g. text and graphics)

16
Example document
17
Example histogram
FREQUENCY
Symbol
18
Indexing by spatial information
  • observation
  • It turns out that many people in fact do access
    archives by remembering partial layout, knowing
    approximately where things are positioned and how
    they are related

19
Spatial relations and indexing
  • Spatial queries on document information
  • mixture of
  • document labels, document properties, keywords
  • examples abstract, title, footnote
  • examples large square text box, text box with
    low high aspect ratio
  • examples text box containing the word
    motorcycle, picture with keyword typing unit
    in it
  • spatial relations
  • left-of, right-of, above, below, adjacent, etc.

20
Example
  • give me documents with a large box above a (box
    with large aspect ratio containing the word
    Titanic)

Result
Specification
T e x t
Any type
figure
above
Titanic
Titanic
The Titanic arrived
21
Conclusion (indexing)
  • partial document analysis
  • relatively cheap methods based on simple
    characteristics
  • capable of indexing documents efficiently and
    effectively
  • can always be combined with full OCR and layout
    and logical analysis
  • methods for text do apply in the same way for
    graphics

22
Multimedia Indexing
23
Authoring versus visual analysis
Content descriptive metadata
Intentions
multimedia content text, video objects etc.
Partial script
Sensory content objects, images etc.
Multimedia script
Extracted multimedia structure and content
Structure multimedia document
(Analog) multimedia document
Digital multimedia document
Multimedia document
24
Multimedia structures
  • Geometric structure
  • the layout of the multimedia document
  • Logical structure
  • the interpretation of the multimedia document
  • Non-linear (hypertext structure)
  • relations between (logical) elements in the
    document

Note structures and relations can also be time
based, hence synchronization important
25
Video example
26
Introduction
  • Single media indexing
  • text (standard information retrieval)
  • video (Brunelli)
  • documents (Doermann)
  • image (Informedia)
  • figure (HyperDoc)
  • audio (Informedia)

27
Multi media indexing (examples)
  • Figure and text
  • manuals with labels in figures and explanations
    in the text
  • caption of the figure explaining the content
  • Text and image
  • caption of newspaper picture
  • context of a picture on a web page
  • Audio and video
  • commentator explaining what you see

28
Multi media indexing (examples)
  • Audio and image
  • expert describing a picture
  • photographer annotating his picture
  • Text and video
  • film script
  • closed captions of news

29
Multi media analysis
  • General approach
  • find common ontology
  • analyze both media and express the result in the
    common ontology
  • Most often text other modality based

30
Overview
  • Multimodal Document Indexing
  • The HyperDocument system
  • From document to hypertext
  • The IMAT system
  • From document to reusable fragments
  • Multimodal Video Indexing
  • Name-It
  • Face-Name association
  • Informedia
  • Multimodal Video Summaries
  • Review paper (Snoek)
  • General framework and overview

31
Multimodal Document Indexing
32
The HyperDoc system
  • Data
  • an (old) manual with annotated pictures and
    associated texts
  • Goal
  • WWW based access to the paper version of the
    document

33
(No Transcript)
34
Document Structures
geometric
logical
hypertext
header
figure
page number
figure
caption
figure
textbody
text
35
Document structures
  • Structure definition
  • a set of objects and their relations (links)
  • Structure types
  • we identify different (hypertext) structures
    which pose restriction on the admissable
    relations between objects

36
Hierarchical Structure
  • Tree shaped structure
  • links at each level
  • Example Geometric structure
  • grouping of elements in columns
  • Example Logical structure
  • grouping of captions and figures
  • sections, subsections

37
Linear Structure
  • Set of connected links
  • no loops
  • access to first element only
  • relative links
  • Example Reading order
  • depth first traversal of logical structure of
    main text body
  • Example Lists
  • tables
  • figures

38
Index Structure
  • Ordered set of links
  • outgoing links only
  • Examples
  • index to text elements
  • keywords
  • labels in figures

39
Side-loop Structure
  • Structure consisting of
  • two links in opposite direction, from and to one
    component
  • no other links out of the component
  • Examples
  • footnotes
  • references

40
Cross-group Structure
  • Structure with
  • two components
  • links between them
  • Examples
  • whole text body and set of figures
  • defines scope of each figure
  • one figure and its scope
  • relations between figure content and text

41
Cross-reference Structure
  • Remaining relations
  • semantic relations between keywords
  • semantic relations between paragraphs

42
From Paper to HyperDocument Access
  • Document Image Analysis
  • layout analysis
  • content analysis
  • objects and text (by OCR) in figures
  • text of the paragraphs (using OCR)
  • Logical Analysis
  • interpretation of document parts
  • Hypertext analysis
  • identifying instantiations of the six hypertext
    structures
  • Presentation design
  • present the structures to the user

43
Figure content analysis
  • Here focus on labels in an image
  • plain text labels
  • generic text labels
  • icon labels
  • legend labels

44
From object to content
  • Figure label detection
  • candidate characters should have height in
    (1-a)
    modal_height,(1a) modal_height grouping into
    complete multi-line labels based on predicates
    and actions as explained in document analysis
    lecture
  • Object content
  • text objects and figure labels are processed with
    commercial OCR
  • logical labels are identified by processing OCR
    output
  • e.g. titles (indicated by view), notes
    (indicated by note)

45
Legends
  • Definition
  • a legend is a list of icon-name pairs
  • Use
  • legends can be very important in document image
    analysis as they provide a relation between
    objects in the image and associated semantic
    concepts

46
Legend label detection and analysis
To decompose the legend picture,
projection profiles in x- and y- direction
(counting the number of pixels) are used
47
From objects to Geometric and Logical Structure
Basic geometric object
Basic logical object
content
Column detection
Grouping and text analysis
Reading order
Geometric structure
Logical structure
Logical structure search for occurrences from
start of each textline chapter
ltwhite_spacegtltnumeralgt section
ltwhite_spacegtltnumeralgtlt.gtltnumeralgt check
whether sequence is increasing properly
48
Hypertext Analysis
Logical structure
- Hierarchical structure - Linear structure -
Index structure - Side-loop structure -
Cross-group structure - Cross-references
Hypertext analysis
Structured HyperDocument
49
Hypertext Analysis
  • Hierarchical structure
  • geometric structure irrelevant after document
    image analysis
  • logical structure most important
  • Linear structure
  • detected reading order
  • list of detected figures

50
Hypertext Analysis
  • Index structure
  • list of detected labels
  • important keywords
  • can be found using statistical analysis as
    explained in document indexing
  • Side-loop structure
  • relies on OCR to detect superscripts or other
    conventions
  • Cross-reference structure
  • should be found by semantic analysis of the text

51
Cross-group structure
  • Cross group-links from text to whole figures
  • search for reference patterns e.g
  • ltNote reference figuregt ltnumeralgt
  • ltNote reference figuregt ltnumeralgt ltandgt
    ltltnumeralgtlt,gtgt
  • Consistency checking
  • check figure number range
  • check for order in one reference sequence
  • Figure scope
  • the part of the text between different references
    defines the scope of the figure in the text

52
Cross-group structure
  • Cross group links between figure and text
  • use scope(s) of specific figure as found in above
    step
  • match text of label with the body of text
  • match each individual word, combine close matches
  • match semantic label of an icon with the body of
    text by considering the legend

53
Presentation rules
  • Make structures explicit
  • provide access to all 6 structures identified
  • Allow for media specific navigation
  • provide access to the set of figures and the text
  • Leave out irrelevant information
  • dont show irrelevant layout information
  • show side-loops only on request

54
Document presentation (HyperDoc)
  • Make structures explicit
  • make explicit the logical structure and all links
    derived from the logical structure
  • introduce anchors in both text and figures for
    the links in cross-goup structures
  • Allow for media specific navigation
  • Use different frames for figures and text
  • next/prev buttons for figures, scrollbar for text
  • Leave out all irrelevant information
  • remove page numbers
  • show footnotes only on request

55
HyperDoc presentation
56
HyperDoc summary
  • Model
  • for hyperdocuments at least 6 different
    structures can be identified
  • Processing
  • scanning
  • layout analysis
  • content analysis
  • logical analysis
  • hypertext analysis
  • presentation
  • based on the structures

57
The IMAT system
  • Data
  • a large set of manuals from different companies
    with text (in digital format) and figures (in
    both digital and paper format)
  • Goal
  • automatic decomposition of the dataset into
    reusable fragments so that they can be used in
    system assisted generation of training material

58
Introduction
Value Assets x Reconfigurability (R. Jain, ACM
Multimedia 2000)
Index terms
high value
59
Introduction
Both should be decomposed and indexed for reuse
60
Applications
  • Course development assistance
  • Example scenarios
  • Query based selection of fragments
  • On the job-training
  • Consult limited part of the manual when you need
    it
  • Personalized delivery
  • Deliver information based on task, level of
    expertise, etc

61
Why Difficult?
  • Not meant for reuse
  • Based on linear reading order
  • Information implicit
  • Document structure
  • Conventions used

62
Goal
  • Automatic decomposition and annotation based on
  • Explicit representation of the different levels
    of representation of a document
  • Formalization of the implicit information
  • A general approach suited for both text and
    graphics

63
Datamodel
  • Three levels of document representation
  • Layout primitives and their structure
  • Logical primitives and their structure
  • Indexed fragments and their structure

64
Example graphics data
65
Example text data
ltitemgt ltboldgt The processor is connected to
amp-1. The purpose of the connection is to
allow disabling .. of the processor
lt/boldgt ltitalicsgt A more elaborate description
tells you that .
lt/italicsgt lt/itemgt
66
Layout Primitives
Definition the smallest components in the
document with consistent visual representation.
67
Logical Primitives
Definition the smallest components in the
document that can be assigned a role.
68
Indexed Fragments
Definition the logical primitives endowed with
semantic index terms allowing for reuse
69
Document Knowledge
  • Vocabulary
  • Domain ontology
  • concepts to describe what the manual is about
  • index terms needed for reuse
  • Visual dictionary
  • The set of symbols and their visualization

70
Document Knowledge
  • Knowledge from authoring process

Index terms
Inverse semantic style rules
Semantic style rules
Logical primitives
Layout style rules
Inverse layout style rules
Layout primitives
Document Analysis
71
Layout Analysis
  • Low level analysis
  • Standard tags for text
  • Symbol matching to image
  • Detection of text, lines etc.
  • Optical Character
  • Recognition

XML/SVG tagged datafile
72
Logical Analysis
  • Bottom-up analyis
  • to derive the possible role
  • In the document
  • Top-down analysis
  • grammar based analysis
  • to select the genuine role

Inverted layout style rules
Note not unique
73
Semantic Analysis
  • Similar analysis
  • as for layout analysis
  • instantiates each
  • component as a concept
  • in the ontology

Standardized logical primitives
Inverted semantic style rules
Again not unique
Indexed fragments
74
Graphics storage
75
Authoring functionality
Reasoning/Ontology
76
Disabling of the processor
the disable connection ...
77
Conclusion
  • Summary
  • A set of tools is presented that automatically
    converts a technical manual into a set of indexed
    fragment which can be reused for many different
    purposes
  • Extension
  • Method is general, hence applying the techniques
    to video based training material is an
    interesting and viable option
Write a Comment
User Comments (0)
About PowerShow.com