Document Image Indexing - PowerPoint PPT Presentation

1 / 77

About This Presentation

Title:

Document Image Indexing

Description:

full document conversion. geometric analysis, OCR, logical analysis. results mostly incomplete ... partial document conversion. only recognize important features ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 78

Provided by: Faculte2

Category:

more less

Transcript and Presenter's Notes

Title: Document Image Indexing

1
Document Image Indexing
2
Indexing of document images

Apply IR techniques in modified form
Different approaches
Methods differ in how much analysis they do i.e
how rich are the document representations
involved.
Image objects and/or image structure
layout objects and/or layout structure
logical objects and/or logical structure

3
Approaches

full document conversion
geometric analysis, OCR, logical analysis
results mostly incomplete
methods remain valid only when OCR quality is
reasonable
expensive process, not always feasible to apply
to millions of pages
partial document conversion
only recognize important features
cheap analysis or cheap pre-processing with
expensive processing of limited document parts
only

4
Text characterization (deSilva)

Focus on proper nouns
Examples
names of people, places, important objects
Characteristics
important for indexing
difficult to do extensive post-processing as the
set of proper nouns can be very large

5
Text characterization (deSilva)

Observations made in experiments
95.5 of proper nouns are capitalized
35 of capitalized words are proper nouns
beginning of sentence 10 proper nouns
85 of capitalized words following a one letter
word are proper nouns
average length is one larger than for other words

6
Text characterization
Proper nouns
High level features - syntactic category of
previous/next word
Candidate nouns
Low level features - capitalization - length of
word - length of previous/next word - position in
the sentence
Box based image abstraction
Characters and words
7
Identification of document function (Doermann)

Document functions
reading user is supposed to read the whole
document
browsing user is supposed to quickly go through
the document
searching user is supposed to look for specific
parts of the document
Observed properties
reading few titles, large content blocks
browsing large number of head/body pairings
searching large number of small similar-sized
blocks

8
Identification of document functionReading,
Browsing, or Searching?
Browsing
Searching
Reading
9
Identification of document function
Document function
High level features - distribution of functional
units
Salient regions titles, abstracts, index keys
etc.
Low level features - zone properties - position
on the page
Box based image abstraction
Zones
10
Presentation

Basis is logical tree
Functions
reading
dept first search of the logical tree
browsing
pruned depth first search of the logical tree
searching
decision tree based on the logical tree

11
Presentation by document function
Searching path depends on user
Browsing
Reading
12
Layout similarity (Doermann)

Which documents are similar?

13
Layout similarity

Different measures required
mapping between the typed component
one-to-one mapping between components
overlap of components which are matched
relative positions of document parts
shape of document parts
The above measures are not independent
order and relevance of the different measures has
to be chosen

14
Edit distance

Definition
the minimum number of actions that you have to
perform to transform the one layout structure to
the other
Actions
delete an object
move an object
change the shape of an object
Weighting
the different actions can have different
weighting depending on the application

15
Graphics indexing (Lorenz)

Basic components
lines, parallel lines, adjacent lines, junctions
text
(circles, ellipses, etc.)
Feature frequency weighting
technique similar to text indexing as before
indexing focussed on salient basic components
which occur often in the query graphics, but are
rare with respect to the whole collection of
graphics
allows access to heterogeneous collections of
data (e.g. text and graphics)

16
Example document
17
Example histogram
FREQUENCY
Symbol
18
Indexing by spatial information

observation
It turns out that many people in fact do access
archives by remembering partial layout, knowing
approximately where things are positioned and how
they are related

19
Spatial relations and indexing

Spatial queries on document information
mixture of
document labels, document properties, keywords
examples abstract, title, footnote
examples large square text box, text box with
low high aspect ratio
examples text box containing the word
motorcycle, picture with keyword typing unit
in it
spatial relations
left-of, right-of, above, below, adjacent, etc.

20
Example

give me documents with a large box above a (box
with large aspect ratio containing the word
Titanic)

Result
Specification
T e x t
Any type
figure
above
Titanic
Titanic
The Titanic arrived
21
Conclusion (indexing)

partial document analysis
relatively cheap methods based on simple
characteristics
capable of indexing documents efficiently and
effectively
can always be combined with full OCR and layout
and logical analysis
methods for text do apply in the same way for
graphics

22
Multimedia Indexing
23
Authoring versus visual analysis
Content descriptive metadata
Intentions
multimedia content text, video objects etc.
Partial script
Sensory content objects, images etc.
Multimedia script
Extracted multimedia structure and content
Structure multimedia document
(Analog) multimedia document
Digital multimedia document
Multimedia document
24
Multimedia structures

Geometric structure
the layout of the multimedia document
Logical structure
the interpretation of the multimedia document
Non-linear (hypertext structure)
relations between (logical) elements in the
document

Note structures and relations can also be time
based, hence synchronization important
25
Video example
26
Introduction

Single media indexing
text (standard information retrieval)
video (Brunelli)
documents (Doermann)
image (Informedia)
figure (HyperDoc)
audio (Informedia)

27
Multi media indexing (examples)

Figure and text
manuals with labels in figures and explanations
in the text
caption of the figure explaining the content
Text and image
caption of newspaper picture
context of a picture on a web page
Audio and video
commentator explaining what you see

28
Multi media indexing (examples)

Audio and image
expert describing a picture
photographer annotating his picture
Text and video
film script
closed captions of news

29
Multi media analysis

General approach
find common ontology
analyze both media and express the result in the
common ontology
Most often text other modality based

30
Overview

Multimodal Document Indexing
The HyperDocument system
From document to hypertext
The IMAT system
From document to reusable fragments
Multimodal Video Indexing
Name-It
Face-Name association
Informedia
Multimodal Video Summaries
Review paper (Snoek)
General framework and overview

31
Multimodal Document Indexing
32
The HyperDoc system

Data
an (old) manual with annotated pictures and
associated texts
Goal
WWW based access to the paper version of the
document

33
(No Transcript)
34
Document Structures
geometric
logical
hypertext
header
figure
page number
figure
caption
figure
textbody
text
35
Document structures

Structure definition
a set of objects and their relations (links)
Structure types
we identify different (hypertext) structures
which pose restriction on the admissable
relations between objects

36
Hierarchical Structure

Tree shaped structure
links at each level
Example Geometric structure
grouping of elements in columns
Example Logical structure
grouping of captions and figures
sections, subsections

37
Linear Structure

Set of connected links
no loops
access to first element only
relative links
Example Reading order
depth first traversal of logical structure of
main text body
Example Lists
tables
figures

38
Index Structure

Ordered set of links
outgoing links only
Examples
index to text elements
keywords
labels in figures

39
Side-loop Structure

Structure consisting of
two links in opposite direction, from and to one
component
no other links out of the component
Examples
footnotes
references

40
Cross-group Structure

Structure with
two components
links between them
Examples
whole text body and set of figures
defines scope of each figure
one figure and its scope
relations between figure content and text

41
Cross-reference Structure

Remaining relations
semantic relations between keywords
semantic relations between paragraphs

42
From Paper to HyperDocument Access

Document Image Analysis
layout analysis
content analysis
objects and text (by OCR) in figures
text of the paragraphs (using OCR)
Logical Analysis
interpretation of document parts
Hypertext analysis
identifying instantiations of the six hypertext
structures
Presentation design
present the structures to the user

43
Figure content analysis

Here focus on labels in an image
plain text labels
generic text labels
icon labels
legend labels

44
From object to content

Figure label detection
candidate characters should have height in
(1-a)
modal_height,(1a) modal_height grouping into
complete multi-line labels based on predicates
and actions as explained in document analysis
lecture
Object content
text objects and figure labels are processed with
commercial OCR
logical labels are identified by processing OCR
output
e.g. titles (indicated by view), notes
(indicated by note)

45
Legends

Definition
a legend is a list of icon-name pairs
Use
legends can be very important in document image
analysis as they provide a relation between
objects in the image and associated semantic
concepts

46
Legend label detection and analysis
To decompose the legend picture,
projection profiles in x- and y- direction
(counting the number of pixels) are used
47
From objects to Geometric and Logical Structure
Basic geometric object
Basic logical object
content
Column detection
Grouping and text analysis
Reading order
Geometric structure
Logical structure
Logical structure search for occurrences from
start of each textline chapter
ltwhite_spacegtltnumeralgt section
ltwhite_spacegtltnumeralgtlt.gtltnumeralgt check
whether sequence is increasing properly
48
Hypertext Analysis
Logical structure
- Hierarchical structure - Linear structure -
Index structure - Side-loop structure -
Cross-group structure - Cross-references
Hypertext analysis
Structured HyperDocument
49
Hypertext Analysis

Hierarchical structure
geometric structure irrelevant after document
image analysis
logical structure most important
Linear structure
detected reading order
list of detected figures

50
Hypertext Analysis

Index structure
list of detected labels
important keywords
can be found using statistical analysis as
explained in document indexing
Side-loop structure
relies on OCR to detect superscripts or other
conventions
Cross-reference structure
should be found by semantic analysis of the text

51
Cross-group structure

Cross group-links from text to whole figures
search for reference patterns e.g
ltNote reference figuregt ltnumeralgt
ltNote reference figuregt ltnumeralgt ltandgt
ltltnumeralgtlt,gtgt
Consistency checking
check figure number range
check for order in one reference sequence
Figure scope
the part of the text between different references
defines the scope of the figure in the text

52
Cross-group structure

Cross group links between figure and text
use scope(s) of specific figure as found in above
step
match text of label with the body of text
match each individual word, combine close matches
match semantic label of an icon with the body of
text by considering the legend

53
Presentation rules

Make structures explicit
provide access to all 6 structures identified
Allow for media specific navigation
provide access to the set of figures and the text
Leave out irrelevant information
dont show irrelevant layout information
show side-loops only on request

54
Document presentation (HyperDoc)

Make structures explicit
make explicit the logical structure and all links
derived from the logical structure
introduce anchors in both text and figures for
the links in cross-goup structures
Allow for media specific navigation
Use different frames for figures and text
next/prev buttons for figures, scrollbar for text
Leave out all irrelevant information
remove page numbers
show footnotes only on request

55
HyperDoc presentation
56
HyperDoc summary

Model
for hyperdocuments at least 6 different
structures can be identified
Processing
scanning
layout analysis
content analysis
logical analysis
hypertext analysis
presentation
based on the structures

57
The IMAT system

Data
a large set of manuals from different companies
with text (in digital format) and figures (in
both digital and paper format)
Goal
automatic decomposition of the dataset into
reusable fragments so that they can be used in
system assisted generation of training material

58
Introduction
Value Assets x Reconfigurability (R. Jain, ACM
Multimedia 2000)
Index terms
high value
59
Introduction
Both should be decomposed and indexed for reuse
60
Applications

Course development assistance
Example scenarios
Query based selection of fragments
On the job-training
Consult limited part of the manual when you need
it
Personalized delivery
Deliver information based on task, level of
expertise, etc

61
Why Difficult?

Not meant for reuse
Based on linear reading order
Information implicit
Document structure
Conventions used

62
Goal

Automatic decomposition and annotation based on
Explicit representation of the different levels
of representation of a document
Formalization of the implicit information
A general approach suited for both text and
graphics

63
Datamodel

Three levels of document representation
Layout primitives and their structure
Logical primitives and their structure
Indexed fragments and their structure

64
Example graphics data
65
Example text data
ltitemgt ltboldgt The processor is connected to
amp-1. The purpose of the connection is to
allow disabling .. of the processor
lt/boldgt ltitalicsgt A more elaborate description
tells you that .
lt/italicsgt lt/itemgt
66
Layout Primitives
Definition the smallest components in the
document with consistent visual representation.
67
Logical Primitives
Definition the smallest components in the
document that can be assigned a role.
68
Indexed Fragments
Definition the logical primitives endowed with
semantic index terms allowing for reuse
69
Document Knowledge

Vocabulary
Domain ontology
concepts to describe what the manual is about
index terms needed for reuse
Visual dictionary
The set of symbols and their visualization

70
Document Knowledge

Knowledge from authoring process

Index terms
Inverse semantic style rules
Semantic style rules
Logical primitives
Layout style rules
Inverse layout style rules
Layout primitives
Document Analysis
71
Layout Analysis

Low level analysis
Standard tags for text
Symbol matching to image
Detection of text, lines etc.
Optical Character
Recognition

XML/SVG tagged datafile
72
Logical Analysis

Bottom-up analyis
to derive the possible role
In the document
Top-down analysis
grammar based analysis
to select the genuine role

Inverted layout style rules
Note not unique
73
Semantic Analysis

Similar analysis
as for layout analysis
instantiates each
component as a concept
in the ontology

Standardized logical primitives
Inverted semantic style rules
Again not unique
Indexed fragments
74
Graphics storage
75
Authoring functionality
Reasoning/Ontology
76
Disabling of the processor
the disable connection ...
77
Conclusion

Summary
A set of tools is presented that automatically
converts a technical manual into a set of indexed
fragment which can be reused for many different
purposes
Extension
Method is general, hence applying the techniques
to video based training material is an
interesting and viable option

Write a Comment

User Comments (0)