Word Spotting: Indexing Handwritten Manuscripts - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Word Spotting: Indexing Handwritten Manuscripts

Description:

Assumes an affine transformation between the words ... Affine transformation allows for scaling and shear deformations in both directions ... – PowerPoint PPT presentation

Number of Views:313

Avg rating:3.0/5.0

Slides: 43

Provided by: michael1078

Category:

more less

Transcript and Presenter's Notes

Title: Word Spotting: Indexing Handwritten Manuscripts

1
Word Spotting Indexing Handwritten Manuscripts

Michael D. Fecina
IST 497/597
22-JAN-02

2
History

OCR was used in the past for indexing machine
typed letters and documents
OCR does not work well with handwritten documents
because of
noise (ink marks, perhaps)
variations among writing styles
inconsistencies in formation of letters/words

3
More history

OCR is used to segment a page into words, then
break each word into its characters
OCR successful with clean machine fonts against
clean background
Character segmentation is too difficult with
handwritten documents

4
Motivation

To efficiently index historical hand written
documents
To simplify reading documents where the
handwriting is particularly hard to read
Eventually, just as with images, it is hoped that
automatic indexing of documents will be available

5
Specific important documents to be indexed

W.E.B. Dubois
Washington and other Presidents writings located
in Library of Congress
Over 6,400 scanned 8-bit grey level images of
Washingtons manuscripts
Serve as valuable resources for scholars as well
as others who wish to consult original source
material

6
What is word spotting?

A method by which handwritten material can be
indexed
Assumes documents are written by same person
Assumes that variations between same-word
occurrences is minimal
The above assumption does not always hold true
(significant contrib. to error)

7
More about word spotting

Avoid recognizing the words
Use word images
What is difficult about it??
Segmenting the page into words
Ascenders, descenders
Noise, inconsistencies
Matching the words effectively

8
Methodology

Obtain grey level image of document.
Reduce image by ½ using Gaussian filtering and
sub-sampling.
Image is then binarized by thresholding.
(characterswhite/bg black)
Binary image segmented into words (word images).

9
Methodology

Each word image is tested against every other
word image yet pruning takes place dependent
upon image area and aspect ratios.
Matching produces equivalence classes.
Top n equivalence classes chosen. Top s classes
are removed noted as stop words. Then, user
provides ASCII equivalent for remaining top m
classes.

10
Details of Word Segmentation

Spacing between characters is smaller than that
of between words
If two white pixels are separated by less than a
certain distance k, the intermediate pixels are
made white
Done in horizontal and vertical direction to
obtain descenders

11
Word segmentation

Errors do occur using this algorithm (dot over
the i,j)
However, minimum length is required. This
removes the dots of the i/j becoming separate
word images
If large gaps are left in some instances of a
word, but not in another, segmented as different
word

12
Senior Document
13
Segmented Senior Document
14
Two primary algorithms used for word matching

EDM (Euclidian Distance Mapping (D. 1980))
Fast, but assumes that no distortions have
occurred except for relative translation
Does well matching words with relatively low
variations in reference to the template
SLH (Scott and Longuet-Higgins (1991))
Assumes an affine transformation between the
words
Slow, computationally expensive in current
implementations

15
Matching with EDM

Aligning vertical alignment by baseline,
horizontal by coinciding left sides. (thus
vertical al. gt horizontal al.)
XOR image is computed XOR corresponding pixels
to produce the difference between the images
Not good for sole use in determining image
difference since equal weight is given to
isolated pixels and blobs

16
XOR for Lloyd
Whats in each one, but not both, of the images
17
EDM Step

EDM computed by assigning to each white pixel in
the image its minimum distance to a black pixel
A white pixel inside a blob will get a larger
distance than isolated white pixel
An error measure, (EEDM) can now be calculated by
summing the distance measures for each pixel

18
Forming Blobs using EDM

The distance between every white pixel and the
nearest black pixel is computed
distance lt threshold, assumed to be noise.

19
Problems with EDM

EDM does not discriminate well between good and
bad matches
Fails when there is significant distortion in the
words
Need for matching algorithm that models some
variation -gt SLH

20
SLH Matching technique

Affine transformation allows for scaling and
shear deformations in both directions
Much more accurate than the Euclidian Distance
Mapping technique
Computationally slow and expensive because the
SVD (Singular Value Decomposition) must be
computed for large matrix

21
Differences (/-)

EDM does not account for any distortions and thus
performs poorly when handwriting is bad
SLH almost always produces the correct rankings
even if the handwriting is bad
Two areas need to be improved with both
Speed and word validity discrimination

22
Tests . . .

Two documents, Senior and Hudson, were
compared using both matching algorithms
The statistical information for both documents is
as follows

23
Test Information

Since the SLH algorithm is slow but more accurate
than the EDM, EDM was applied first
A cut off threshold was used to limit the number
of classes (words) displayed, and remained
constant in both tests

24
Senior Document - classes
stop words
25
EDM on Senior

The EDM algorithm performed quite well on Senior
Document
Average precision of 78
Since the handwriting is good, this performance
was expected
Remember that EDM does not account for much
variation in word images

26
EDM matches for Lloyd
50 correct
27
EDM problems . . .

The algorithm performs poorly in that it cannot
discriminate well between valid and invalid words.

28
The Hudson Document
29
EDM on Hudson

Average precision was 57.9, much lower than that
of the Senior Document (78)
Difference in precision attributed to the
handwriting
Difficult to read even for humans looking at
greyscale images at 300 dpi

30
Problems with EDM/Hudson
31
SLH matches for Lloyd
62.5 correct
3/8 incorrect
32
SLH on Senior

Proved to be very accurate, yet as mentioned
before, slow
Average precision of SLH was 86.3, compared to
78.7 for EDM
SLH recorded the word rankings correctly, and
also showed a much greater discrimination in
match error

33
SLH on Hudson

Very difficult because of writing, but ranking
proved to be much better than EDM
Performance on templates like they was good
probably because they are simple, repetitive
words
Correct ranking for the word Standard

34
Standard template with SLH
35
Current matching techniques

SLH is more than reasonably accurate, but slow in
current implementations. References report this
will change ...
Require matching every word against every other
word
O(N2).
N2 220 x 6400 1012 matches !!!

36
Main problems with wordspotting

Handwriting style

37
Main problems with wordspotting

Word scaling and connection of ascenders,
descenders

38
Main problems with wordspotting

Skew and Noise

39
Recent Work

Fixed bugs and some problems with algorithm
Recently have successfully segmented the 6400
scanned images of George Washingtons documents
Its impractical (too labor intensive) to compute
segmentation statistics on the entire collection.

40
Future Work

Continue working on new methods to match words
effectively and efficiently
Possible ideas for better matching techniques
include
Combining multiple word features
Language probability for automatic indexing
Continue working on new methods to match words
effectively and efficiently
Integrate into indexing scheme (back of book
index)

41
Conclusion

EDM works reasonably well for matching words, but
SLH is better since it accounts for variations
SLH pays the price computationally expensive
Future work is needed but progress thus far is
very encouraging

42
Questions?