Document Images and E-Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Document Images and E-Discovery

Description:

Shape Coding - (Tanaka and Torii; Spitz 1995; Kia, 1996) Applications: Filing System (Spitz - SPAM, 1996) Numerous IR. Processing handwritten documents ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 54

Provided by: david1645

Learn more at: http://users.umiacs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Document Images and E-Discovery

1
Document Images and E-Discovery

David Doermann, UMIACS
May 4, 2009

2
Goals of This Lecture

To help you understand
Why you may want to acquire document images?
How you acquire them?
What you get when you do?
What you can do with them?
Why it is not as easy as you may think to
organize them?
Discuss some of the issues for those doing
E-Discovery.

3
Assumptions

We have some need to access a set of documents
Archiving?
Litigation?
FOIA Request
Often have (massive) Heterogeneous Collections
Different languages
Different layouts
Different sources
We are lucky if metadata is consistent and
Uniform.

4
Why acquire Document Images?

Paperless Solution
Efficient transfer
Organization
Convenience
Access to a variety of content
Universal reader email, attachments, spread
sheets
Dont need original applications
Prevent Change?
Easier to certify?

5
How do we acquire?

Scanning?
High speed, automated, multiform books, etc
Digital Copiers
Corporate Memory
Application Output
Print to Image
Mass Conversion
Cameras?
Cell Phones?
QipIt, ScanR, Hotcard
All have implications for use

6
Where do we find them?

Internet
Email Attachments
Online Proceedings
Electronic Fax
Mass Digitization Repositories

7
What you can do with them?

Can we Access it?
Search
Browse
Read
Index and Retrieve them?
In their basic form not really!
We can
View
Print
Not much else
Why?

8
What is an image?

Pixel representation of intensity map
No explicit content, only relations
Image analysis
Attempts to mimic human visual behavior
Draw conclusions, hypothesize and verify

Image databases Use primitive image analysis to
represent content Transform semantic queries into
image features color, shape, texture spatial
relations
9
Document Images

A collection of dots called pixels
Arranged in a grid and called a bitmap
Pixels often binary-valued (black, white)
But grayscale or color is sometimes needed
300 dots per inch (dpi) gives the best results
But images are quite large (1 MB per page)
Faxes are normally 100-200 dpi
Usually stored in TIFF or PDF format
Yet we want to be able to process them like text
files!

10
Document Image Database

Collection of scanned images
Need to be available for indexing and retrieval,
abstracting, routing, editing, dissemination,
interpretation

11
(No Transcript)
12
Other Documents
13
(No Transcript)
14
(No Transcript)
15
Indexing Page Images(Traditional Conversion)
Page Image
Structure Representation
Document
Page Decomposition
Scanner
Text Regions
Character or Shape Codes
Optical Character Recognition
16
Document Image Analysis

General Flow
Obtain Image - Digitize
Preprocessing
Feature Extraction
Classification
General Tasks
Logical and Physical Page Structure Analysis
Zone Classification
Language ID
Zone Specific Processing
Recognition
Vectorization

17
Document Analysis

What you need to do before you can treat images
as e-documents.
Document Image Analysis
Page decomposition
Optical character recognition
Traditional Indexing with Conversion
Confusion matrix
Shape codes
Doing things Without Conversion
Duplicate Detection, Classification,
Summarization, Abstracting
Keyword spotting, etc

18
(No Transcript)
19
Why is document analysis difficult?

2D Array of values
Represents a Symbolic Language
Many Variations in symbols
AaAAAAAAAA
3-4 times larger then normal digital images
And this is just machine printed Latin Text!

20
Page Analysis(assume are looking for text)

Skew correction
Based on finding the primary orientation of lines
Image and text region detection
Based on texture and dominant orientation
Structural classification
Infer logical structure from physical layout
Text region classification
Title, author, letterhead, signature block, etc.

21
Page Layer Segmentation

Document image generation model
A document consists many layers, such as
handwriting, machine printed text, background
patterns, tables, figures, noise, etc.

22
Page Segmentation

Typically based on Spatial Proximity
White space
Margins
Differences in Content Type
Can be very sensitive to noise
Distinguish between
Top Down What know what should be there
Bottom up We know what is there locally

23
Image Detection
24
Text Region Detection
25
More Complex Example
Printed text Handwriting Noise
Before MRF-based postprocessing
After MRF-based postprocessing
26
Application to Page Segmentation
Before enhancement
After enhancement
27
Language Identification

Language-independent skew detection
Accommodate horizontal and vertical writing
Script class recognition
Asian script have blocky characters
Connected scripts cant be segmented easily
Language identification
Shape statistics work well for western languages
Competing classifiers work for Asian languages
What about handwriting?

28
Optical Character Recognition

Pattern-matching approach
Standard approach in commercial systems
Segment individual characters
Recognize using a neural network classifier
Hidden Markov model approach
Experimental approach developed at BBN
Segment into sub-character slices
Limited lookahead to find best character choice
Useful for connected scripts (e.g., Arabic)

29
OCR Accuracy Problems

Character segmentation errors
In English, segmentation often changes m to
rn
Character confusion
Characters with similar shapes often confounded
OCR on copies is much worse than on originals
Pixel bloom, character splitting, binding bend
Uncommon fonts can cause problems
If not used to train a neural network

30
Improving OCR Accuracy

Image preprocessing
Mathematical morphology for bloom and splitting
Particularly important for degraded images
Voting between several OCR engines helps
Individual systems depend on specific training
data
Linguistic analysis can correct some errors
Use confusion statistics, word lists, syntax,
But more harmful errors might be introduced

31
OCR Speed

Neural networks take about 10 seconds a page
Hidden Markov models are slower
Voting can improve accuracy
But at a substantial speed penalty
Easy to speed things up with several machines
For example, by batch processing - using desktop
computers at night

32
Problem Logical Page Analysis (Reading Order)

Can be hard to guess in some cases
Newspaper columns, figure captions, appendices,
Sometimes there are explicit guides
Continued on page 4 (but page 4 may be big!)
Structural cues can help
Column 1 might continue to column 2
Content analysis is also useful
Word co-occurrence statistics, syntax analysis

33
Retrieval of OCRd Text

Requires robust ways of indexing
Statistical methods with large documents work
best
Key Evaluations
Success for high quality OCR (Croft et al 1994,
Taghva 1994)
Limited success for poor quality OCR (1996 TREC,
UNLV)
Clustering successful for gt 85 accuracy (Tsuda
et al, 1995)

34
N-Grams

Powerful, Inexpensive statistical method for
characterizing populations
Approach
Split up document into n-character pairs fails
Use traditional indexing representations to
perform analysis
DOCUMENT -gt DOC, OCU, CUM, UME, MEN, ENT
Advantages
Statistically robust to small numbers of errors
Rapid indexing and retrieval
Works from 70-85 character accuracy where
traditional IR fails

35
Matching with OCR Errors

Above 80 character accuracy, use words
With linguistic correction
Between 75 and 80, use n-grams
With n somewhat shorter than usual
And perhaps with character confusion statistics
Below 75, use word-length shape codes

36
Processing Images of Text

Characteristics
Does not require expensive OCR/Conversion
Applicable to filtering applications
May be more robust to noise
Possible Disadvantages
Application domain may be very limited
Processing time may be an issue if indexing is
otherwise required

37
Keyword Spotting

Techniques
Work Shape/HMM - (Chen et al, 1995)
Word Image Matching - (Trenkle and Vogt, 1993
Hull et al)
Character Stroke Features - (Decurtins and Chen,
1995)
Shape Coding - (Tanaka and Torii Spitz 1995
Kia, 1996)
Applications
Filing System (Spitz - SPAM, 1996)
Numerous IR
Processing handwritten documents
Formal Evaluation
Scribble vs. OCR (DeCurtins, SDIUT 1997)

38
Shape Coding

Approach
Use of Generic Character Descriptors
Make Use of Power of Language to resolve
ambiguity
Map Character based on Shape features including
ascenders, descenders, punctuation and character
with holes

39
Additional Applications

Handwritten Archival Manuscripts
(Manmatha, 1997)
Page Classification
(Decurtins and Chen, 1995)
Matching Handwritten Records
(Ganzberger et al, 1994)
Headline Extraction
Document Image Compression (UMD, 1996-1998)

40
Some UMD Research

Multilingual OCR
Evaluation
Duplicate detection
.

41
Detection

Stamps, Logos, Signatures
These content regions benefit from detection
based approach
Saliency Measures adapt to interclass variation
Logos location, density, symmetry, size
Signature flow, oscillations
Standard classifiers SVM, Fisher, Decision
Trees.

42
Shape matching
(a)
(b)
Illustration of signature matching using shape
contexts and local-neighborhood-graph
43
Image content categorization

Distinguishing between text and non-text
documents
We constructed a 4,500 image database by crawling
Web images from Google Image search engine using
a wide variety of text keywords

44
Page Segmentation
45
Clutter Detection and Removal
Clutter as one single connected component
46
(No Transcript)
47
Is Indexing Enough?

Many applications benefit from image based
indexing
Less discriminatory features
Features may therefore be easier to compute
More robust to noise
Often computationally more efficient
Many classical IR techniques have application for
DIR
Structure as well as content are important for
indexing
Preservation of structure is essential for
in-depth understanding

What else is useful?
Document Metadata? Logos? Signatures?
Where is research heading?
Cameras to capture Documents?
What massive collections are out there?
Tobacco Litigation Documents
49 million page images
Google Books
Other Digital Libraries

49
What next for E-Discovery?(questions to ask)

Now you want to use the images
What Meta data is required?
How was the collection created?

50
Observations

Structure is a great indicator of content
Locating Letters, Financial Forms, etc
Be careful of what you assume about the
collection
Handwriting present?
Noise, Scanning resolution?
Is there implicit information in the layout?

51
Litigation Specific Issues

Volume Scanning Separated from Metadata
Document Determination
Multiple Edits/Prints of the same document
(Duplicate Determination)
Much harder for images
May have unique BATES numbers
Cost of scanning or adding manual metadata?
When is an image sufficient?
Probably not for handwriting analysis

52
Summary

Nothing Magic about getting access to images
Metadata is typically required, and automation
can be made more difficult by quality
No substitute for eyes on the imageand most
systems are set up with this in mind

53
E-Discovery Issues?

Write a Comment

User Comments (0)