Word Spotting: Indexing Handwritten Manuscripts - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Word Spotting: Indexing Handwritten Manuscripts

Description:

Assumes an affine transformation between the words ... Affine transformation allows for scaling and shear deformations in both directions ... – PowerPoint PPT presentation

Number of Views:313
Avg rating:3.0/5.0
Slides: 43
Provided by: michael1078
Category:

less

Transcript and Presenter's Notes

Title: Word Spotting: Indexing Handwritten Manuscripts


1
Word Spotting Indexing Handwritten Manuscripts
  • Michael D. Fecina
  • IST 497/597
  • 22-JAN-02

2
History
  • OCR was used in the past for indexing machine
    typed letters and documents
  • OCR does not work well with handwritten documents
    because of
  • noise (ink marks, perhaps)
  • variations among writing styles
  • inconsistencies in formation of letters/words

3
More history
  • OCR is used to segment a page into words, then
    break each word into its characters
  • OCR successful with clean machine fonts against
    clean background
  • Character segmentation is too difficult with
    handwritten documents

4
Motivation
  • To efficiently index historical hand written
    documents
  • To simplify reading documents where the
    handwriting is particularly hard to read
  • Eventually, just as with images, it is hoped that
    automatic indexing of documents will be available

5
Specific important documents to be indexed
  • W.E.B. Dubois
  • Washington and other Presidents writings located
    in Library of Congress
  • Over 6,400 scanned 8-bit grey level images of
    Washingtons manuscripts
  • Serve as valuable resources for scholars as well
    as others who wish to consult original source
    material

6
What is word spotting?
  • A method by which handwritten material can be
    indexed
  • Assumes documents are written by same person
  • Assumes that variations between same-word
    occurrences is minimal
  • The above assumption does not always hold true
    (significant contrib. to error)

7
More about word spotting
  • Avoid recognizing the words
  • Use word images
  • What is difficult about it??
  • Segmenting the page into words
  • Ascenders, descenders
  • Noise, inconsistencies
  • Matching the words effectively

8
Methodology
  • Obtain grey level image of document.
  • Reduce image by ½ using Gaussian filtering and
    sub-sampling.
  • Image is then binarized by thresholding.
    (characterswhite/bg black)
  • Binary image segmented into words (word images).

9
Methodology
  • Each word image is tested against every other
    word image yet pruning takes place dependent
    upon image area and aspect ratios.
  • Matching produces equivalence classes.
  • Top n equivalence classes chosen. Top s classes
    are removed noted as stop words. Then, user
    provides ASCII equivalent for remaining top m
    classes.

10
Details of Word Segmentation
  • Spacing between characters is smaller than that
    of between words
  • If two white pixels are separated by less than a
    certain distance k, the intermediate pixels are
    made white
  • Done in horizontal and vertical direction to
    obtain descenders

11
Word segmentation
  • Errors do occur using this algorithm (dot over
    the i,j)
  • However, minimum length is required. This
    removes the dots of the i/j becoming separate
    word images
  • If large gaps are left in some instances of a
    word, but not in another, segmented as different
    word

12
Senior Document
13
Segmented Senior Document
14
Two primary algorithms used for word matching
  • EDM (Euclidian Distance Mapping (D. 1980))
  • Fast, but assumes that no distortions have
    occurred except for relative translation
  • Does well matching words with relatively low
    variations in reference to the template
  • SLH (Scott and Longuet-Higgins (1991))
  • Assumes an affine transformation between the
    words
  • Slow, computationally expensive in current
    implementations

15
Matching with EDM
  • Aligning vertical alignment by baseline,
    horizontal by coinciding left sides. (thus
    vertical al. gt horizontal al.)
  • XOR image is computed XOR corresponding pixels
    to produce the difference between the images
  • Not good for sole use in determining image
    difference since equal weight is given to
    isolated pixels and blobs

16
XOR for Lloyd
Whats in each one, but not both, of the images
17
EDM Step
  • EDM computed by assigning to each white pixel in
    the image its minimum distance to a black pixel
  • A white pixel inside a blob will get a larger
    distance than isolated white pixel
  • An error measure, (EEDM) can now be calculated by
    summing the distance measures for each pixel

18
Forming Blobs using EDM
  • The distance between every white pixel and the
    nearest black pixel is computed
  • distance lt threshold, assumed to be noise.

19
Problems with EDM
  • EDM does not discriminate well between good and
    bad matches
  • Fails when there is significant distortion in the
    words
  • Need for matching algorithm that models some
    variation -gt SLH

20
SLH Matching technique
  • Affine transformation allows for scaling and
    shear deformations in both directions
  • Much more accurate than the Euclidian Distance
    Mapping technique
  • Computationally slow and expensive because the
    SVD (Singular Value Decomposition) must be
    computed for large matrix

21
Differences (/-)
  • EDM does not account for any distortions and thus
    performs poorly when handwriting is bad
  • SLH almost always produces the correct rankings
    even if the handwriting is bad
  • Two areas need to be improved with both
  • Speed and word validity discrimination

22
Tests . . .
  • Two documents, Senior and Hudson, were
    compared using both matching algorithms
  • The statistical information for both documents is
    as follows

23
Test Information
  • Since the SLH algorithm is slow but more accurate
    than the EDM, EDM was applied first
  • A cut off threshold was used to limit the number
    of classes (words) displayed, and remained
    constant in both tests

24
Senior Document - classes
stop words
25
EDM on Senior
  • The EDM algorithm performed quite well on Senior
    Document
  • Average precision of 78
  • Since the handwriting is good, this performance
    was expected
  • Remember that EDM does not account for much
    variation in word images

26
EDM matches for Lloyd
50 correct
27
EDM problems . . .
  • The algorithm performs poorly in that it cannot
    discriminate well between valid and invalid words.

28
The Hudson Document
29
EDM on Hudson
  • Average precision was 57.9, much lower than that
    of the Senior Document (78)
  • Difference in precision attributed to the
    handwriting
  • Difficult to read even for humans looking at
    greyscale images at 300 dpi

30
Problems with EDM/Hudson
31
SLH matches for Lloyd
62.5 correct
3/8 incorrect
32
SLH on Senior
  • Proved to be very accurate, yet as mentioned
    before, slow
  • Average precision of SLH was 86.3, compared to
    78.7 for EDM
  • SLH recorded the word rankings correctly, and
    also showed a much greater discrimination in
    match error

33
SLH on Hudson
  • Very difficult because of writing, but ranking
    proved to be much better than EDM
  • Performance on templates like they was good
    probably because they are simple, repetitive
    words
  • Correct ranking for the word Standard

34
Standard template with SLH
35
Current matching techniques
  • SLH is more than reasonably accurate, but slow in
    current implementations. References report this
    will change ...
  • Require matching every word against every other
    word
  • O(N2).
  • N2 220 x 6400 1012 matches !!!

36
Main problems with wordspotting
  • Handwriting style

37
Main problems with wordspotting
  • Word scaling and connection of ascenders,
    descenders

38
Main problems with wordspotting
  • Skew and Noise

39
Recent Work
  • Fixed bugs and some problems with algorithm
  • Recently have successfully segmented the 6400
    scanned images of George Washingtons documents
  • Its impractical (too labor intensive) to compute
    segmentation statistics on the entire collection.

40
Future Work
  • Continue working on new methods to match words
    effectively and efficiently
  • Possible ideas for better matching techniques
    include
  • Combining multiple word features
  • Language probability for automatic indexing
  • Continue working on new methods to match words
    effectively and efficiently
  • Integrate into indexing scheme (back of book
    index)

41
Conclusion
  • EDM works reasonably well for matching words, but
    SLH is better since it accounts for variations
  • SLH pays the price computationally expensive
  • Future work is needed but progress thus far is
    very encouraging

42
Questions?
  • Thanks for listening to me blob (?) about
    Wordspotting.
  • Any Comments/Questions?
Write a Comment
User Comments (0)
About PowerShow.com