Language Models for Handwriting - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Language Models for Handwriting

Description:

Language Models for Handwriting – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 64
Provided by: josh9
Category:

less

Transcript and Presenter's Notes

Title: Language Models for Handwriting


1
Language Models for Handwriting
  • Joshua Goodman
  • Microsoft Research
  • Machine Learning and Applied Statistics Group
  • http//www.research.microsoft.com/joshuago

2
What does this say?
3
What does this say?
4
What does this say?
5
What does this say?
6
Without context, very hard to read, if its even
possible
7
Context is Key
  • Very hard to read any individual part of this
    without context.
  • But the overall string has only one reasonably
    likely interpretation.
  • Language model tells you which strings are
    likely, which ones are not.

8
Overview
  • What are language models
  • (2 Slides of Boring Preliminaries)
  • Who uses language models
  • Every natural language input technique
  • Even a few handwriting systems
  • Reduce errors by about 40 (relative)
  • Why language modeling is hard for handwriting
  • And what to do about it
  • How to build language models
  • Techniques, tools, etc.
  • The future
  • Handwriting researchers using language models for
    almost all applications
  • And contributing back to the speech and language
    modeling communities

9
What is a language model?
  • Gives a probability distribution over text
    strings (characters/words/phrases/documents)
  • May be easier to model
  • P(ink wordsequence) than
  • P(wordsequence ink)
  • Can train language model on much more/different
    data than the ink model

10
Language Model EvaluationEntropy and Perplexity
  • Language models
  • quality typically
  • measured with
  • entropy or perplexity
  • Entropy is just number of bits required to encode
    the test data (should be called cross entropy of
    model on test data)
  • Perplexity is 2entropy
  • Lower is better for both

11
Who uses language models
  • Almost every natural language input technique
    uses language models
  • Speech recognition
  • Typing in Japanese and Chinese
  • Machine Translation
  • A whole bunch of others
  • Handwriting recognition
  • Simple dictionaries with some kind of frequency
    information everyone who deals with words
  • Bigram and trigram models About 12 papers, All
    with good results

12
Error-rate correlates with entropy (for speech
recognition)
13
Pinyin Conversion
  • How to enter Chinese text
  • Type phonetically (pinyin)
  • Many characters have same sound
  • Find

1 if correct pinyin, 0 otherwise
Language Model
14
Machine Translation
  • Let f be a French sentence we wish to translate
    into English.
  • Let e be a possible English translation.

Translation Model
Language Model
Thanks to Eugene Charniak for this slide
15
Machine Translation Error rate versus Entropy
(From Franz Och)
These are old results newest results from Franz
Och are trained on 200 BILLION words of data, get
even better results.
16
Other Language Model Uses
  • Information retrieval
  • P(query document) ??
  • P(document)
  • Telephone Keypad input
  • P(numberswords) ??
  • P(words)
  • Soft Keyboard input
  • P(pendown-positionswords) ??
  • P(words)
  • Spelling Correction
  • P(observed keyswords) ??
  • P(words)

Language Model
17
Language Models in Handwriting Systems
  • Kinds of language models (unigram, bigram,
    trigram)
  • Results from handwriting papers
  • Character-based vs. word based
  • Some of the problems of using LMs for handwriting

18
What kind of language model to use
  • P(a b c q r s)
  • P(a) P(b a) P (c a b)
  • P(s a b c q r)
  • How can we compute P(s a b c q r)?
  • Too hard so approximate it by
  • P(s q r) Trigram
  • P(s r) Bigram
  • P(s) Unigram
  • P(s) ? 1/voc Uniform

19
Language modeling for handwriting recognition
  • Highlights from 12 papers that I can find
  • Hybrid neuro-Markovian (Marukatat, et al. 01)
  • Error rate drops from 30 to 18 by using a
    bigram.
  • Quiniou et al., 05 18 to 10 (bigram and
    trigram same)
  • Perraud et al., 03
  • From 34 (uniform) to 29 (unigram) to 22.5
    error rates
  • Vinciarelli et al., 04
  • Tried different test sets, with semi-realistic
    mismatch between training and test.
  • Results highly dependent on match between
    training/test
  • Unigram always much better than uniform (no
    probabilities) (typically 50 more accurate)
  • Typically marginal gains from trigram
  • Sometimes large gains from bigram (up to 20
    relative) when match was good

20
Pretty good Bibliography of LMs for handwriting
  • Using a Statistical Language Model to Improve the
    Performance of an HMM-Based Cursive Handwriting
    System, Marti et al. IJPRAI 2001
  • On the influence of vocabulary size and language
    models in unconstrained handwritten text
    recognition, Marti et al., ICDAR 2001
  • N-gram and N-Class Models for On line Handwriting
    Recognition, Perraud et al., ICDAR 2003
  • Offline Recognition of Unconstrained Handwritten
    Texts Using HMMs and Statistical Language Models,
    Vinciarelli et al, IEEE PAMI 2004
  • N-Gram Language Models for Offline Handwritten
    Text Recognition, Zimmerman et al., IWFHR 2004
  • Stability Measure of Entropy Estimate and Its
    Application to Language Model Evaluation, Kim,
    J., and Ryu, S., and Kim, J.H., IWFHR 2004
  • An Empirical Study of Statistical Language Models
    for Contextual Post-processing of Chinese Script
    Recognition, Li,Y-X. and Tan, C.L., IWFHR 2004
  • Statistical Language Models for On-line
    Handwritten Sentence Recognition, Quiniou et al.,
    ICDAR 2005
  • A Data Structure Using Hashing and Tries for
    Efficient Chinese Lexical Access,Y.-K. Lam and Q.
    Huo ICDAR 2005
  • Document Understanding System Using Stochastic
    Context-Free Grammars, J. Handley. A. Namboodiri,
    and R. Zanibbi, ICDAR 2005
  • Multiple Handwritten Text Line Recognition
    Systems Derived from Specific Integration of a
    Language Model, R. Bertolami and H. Bunke, ICDAR
    2005
  • A Priori and a posteriori integration and
    combination of language models in an on-line
    handwritten sentence recognition system, Solen
    Quiniou and Eric Anquetil, IWFHR 2006

21
Character-based or Word-based
  • Model letter sequences instead of word sequences
  • Example P( letter-3 letter-2 letter-1)
  • Typically, use a higher-order n-gram (e.g. 6)
  • This is good if you want to model
    out-of-vocabulary words, proper names,
    misspellings, digit sequences, etc.

22
Combining character-based and a word-based
  • Can model both in-vocabulary and
    out-of-vocabulary words with language models, at
    the same time.
  • Explicitly model
  • P(out-of-vocabulary previous-word)
  • P(out-of-vocabulary Mr.)
  • Model out-of-vocabulary using character model.

23
Digit Sequences
  • Might think that a language model cant help with
    digit sequences (e.g. street numbers, dollar
    amounts)
  • Numbers are much more likely to start with 1
    (30)
  • Dollar amounts much more likely to end in .00,
    .99, .95

The entropy of first digits is about 10 lower
than uniform entropy 10 lower error rate on
first digits?
24
Why its hard to use language models for
handwriting
  • Model incompatibilities
  • Lack of training data

25
Model IncompatibilitiesToy example
  • Toy example of handwriting system segment into
    words, then segment words into letters, then
    recognize letters
  • Where do we put the language modeling
    probabilities? How do we integrate them into the
    segmentation and recognition process?

26
A slightly more realistic exampleProduce
lattice of scores results
P e w s y w a w
n n l v n i a
  • For each segment, train recognizer (NN, SVM,
    etc.) to recognize the letter in the segment.
    Roughly, you learn P(letter ink)
  • What happens when you multiply by language model?
  • P(letter ink) ? P(letter previous letters)
  • This is not a meaningful probability!
  • Will it work in practice? Maybe, maybe not.

27
Some handwriting systems work very well with LMs
  • HMM-based approaches integrate very well with
    language model.
  • Hybrid models (NeuralNet with HMM, e.g. Marukatat
    01) also work well.
  • Use neural net to predict P(ink state, previous
    ink)
  • Discriminatively trained neural nets can work
    very well, if trained at the sentence level, with
    LM integrated at training time.
  • Gradient-Based Learning Applied to Document
    Recognition (LeCun et al., 98)
  • How to integrate SVMs with LM? Not clear.
  • Often SVM score is ?? P(letterink)

28
Lots of training data, or none
  • Usually, can train on millions, hundreds of
    millions, or billions of words of data
  • Easy to find text that look like addresses.
  • Hard to find text that looks like meeting notes
  • Potential gain from language models likely to be
    very application specific

29
Training Data is key
30
Quick Overview of LM Research
  • Trigram models are obviously brain-damaged lots
    of improvements
  • Smoothing
  • Caching
  • Clustering
  • Neural methods
  • Other stuff

31
Smoothing None
  • Lowest perplexity trigram on training data.
  • Terrible on test data If no occurrences of
    C(xyz), probability is 0.
  • Makes it impossible to recognize the word

32
Smoothing Key Ideas
  • Want some way to combine trigram probabilities,
    bigram probabilities, unigram probabilities
  • Use trigram/bigram for accurate modeling of
    context
  • Use unigram to get best guess you can when data
    is sparse.
  • Lots of different techniques
  • Simple interpolation
  • Absolute Discounting
  • Katz Smoothing (Good-Turing)
  • Interpolated Kneser-Ney Smoothing

33
Smoothing Interpolated Kneser-Ney
  • Simple to implement smoothing technique.
  • Consistently the best performing across a very
    wide range of conditions.
  • See appendix of A bit of progress in language
    modeling for pseudo-code

34
Caching
  • If you say something, you are likely to say it
    again later.
  • Interpolate trigram with cache

35
Caching Real Life
  • Someone writes The white house
  • System recognizes The white hor se
  • Cache remembers!
  • Person writes The whole house, and, with cache,
    system recognizes The whole hor se. errors
    are locked in.
  • Caching works well when users correct as they go,
    poorly or even hurts without correction.

36
Cache Results
37
Neural Probabilistic Language ModelsBengio et
al. 2000
  • Multi-layer neural network
  • Similar to a convolutional-neural net applied to
    language modeling
  • Largest improvement reported for any single
    technique
  • Relatively slow and complex to build and apply,
    but see ongoing research
  • (Combination techniques, e.g. Bit of Progress
    paper, have slightly better overall results.)

38
Clustering
  • CLUSTERING CLASSES (same thing)
  • What is P(Tuesday party on)
  • Similar to P(Monday party on)
  • Similar to P(Tuesday celebration on)
  • Put words in clusters
  • WEEKDAY Sunday, Monday, Tuesday,
  • EVENTparty, celebration, birthday,
  • P(Tuesday WEEKDAY) ?
  • P(WEEKDAY EVENT PREPOSITION)

39
Cluster Results
40
Language Model Compression
  • Use handwriting on devices that are too small for
    a keyboard
  • Will also have memory constraints
  • Same reasoning/problems for speech recognition
  • Use count-cutoffs discard all n-grams that
    dont occur at least k times
  • Use Stolcke (1998) entropy-based pruning
  • Works better than cutoffs
  • Use clustering techniques (Goodman and Gao, 2000)
  • Smallest models
  • Harder to implement
  • Interacts poorly with tree representation in an
    HMM

41
Lots and lots of other language model research
  • Endless ways to improve LMs
  • Sentence-mixture models
  • Skipping models
  • Parsing-based models
  • Decision-tree models
  • Maximum entropy (single layer NN) models

42
How to Build Language Models
  • Language modeling is depressingly easy
  • The best practical technique is to simply use a
    bigram or trigram model
  • Works well with tree-structured HMMs
  • Almost all other techniques dont
  • If you have correction information, also use a
    cache
  • If you have adaptation data (user-specific),
    interpolate it in (weighted average of
    probabilities)
  • Use count cutoffs or Stolcke pruning

43
Speech recognizer with language model
  • In theory,
  • In practice, language model is a better predictor
    -- acoustic probabilities arent real
    probabilities
  • In practice, penalize insertions

44
Tools
  • SRI Language Modeling Toolkit
  • http//www.speech.sri.com/projects/srilm/
  • Free for non-profit use
  • Can handles clusters, lattices, n-best lists,
    hidden tags
  • CMU Language Modeling Toolkit
  • Can handle bigram, trigrams, more
  • Can handle different smoothing schemes
  • Many separate tools output of one tool is input
    to next easy to use
  • Free for research purposes
  • http//svr-www.eng.cam.ac.uk/prc14/toolkit.html

45
Synergies between Handwriting and Speech/NLP
  • Language Modeling is only one of the similarities
    between handwriting and speech
  • The two communities have lots of useful ideas
    that they could learn from each other
  • Things for handwriting people to teach speech
    people
  • Other useful ideas in speech recognition

46
From Handwriting to Speech and NLP
  • Neural Nets for Language Modeling
  • Neural Probabilistic Language Model
  • Work by Yoshua Bengio et al. (handwriting
    researcher)
  • Like a convolutional neural net applied to
    language modeling
  • Some of the best and most exciting recent results
    in language modeling
  • Gradient-based learning with multi-layer Neural
    Nets (LeCun et al., 98)
  • Similar to CRFs, which are increasingly popular
    for NLP, but more sophisticated (CRFs are
    basically the single layer version.)
  • How to use SVMs and NNs to recognize sequence
    information
  • Probably much more this is just stuff I noticed
    as I prepared

47
From Speech to Handwriting Finite State
Transducers
  • More powerful than HMMs
  • Can encode n-gram models efficiently (like HMMs)
  • Can encode simple grammars like spelled-out
    numbers One thousand two hundred and eighteen
    or addresses.
  • Can encode error models, e.g. spelling errors.
  • Some toolkits can convert transducer to a large
    HMM, then optimize/compress it.
  • Can encode context-dependencies efficiently
    (e.g., first state of a is different when
    preceded by space than when preceded by d
    than when preceded by o.)
  • Lots of toolkits available
  • ATT Finite State Machine Library
  • MIT Finite-State Transducer Toolkit
  • SFST, the Stuttgart Finite State Transducer Tools

48
From Speech to Handwriting
  • ROVER
  • Technique for combining the final output of
    multiple recognizers, to get overall better
    results
  • HMM expertise
  • Endless tricks used in speech world to improve
    search sophisticated thresholding lattice
    processing multiple-pass techniques tree
    structuring etc.
  • Decision-Tree Senone Models
  • Ways to model context dependencies of phonemes
    (letters) without getting too much data sparsity.

49
More Resources
  • Joshuas web page www.research.microsoft.com/jos
    huago
  • These slides
  • Slides for 3 hour language modeling tutorial
  • Smoothing paper with Stan Chen good introduction
    to smoothing and lots of details too.
  • A Bit of Progress in Language Modeling
    comprehensive overview and experimental
    comparison of LM techniques best overall LM
    results.
  • Papers on fuzzy keyboard, language model
    compression, more.
  • Appendix to this talk

50
Conclusion
  • Handwriting has special challenges
  • Difficulty of getting training data for some
    applications
  • Difficulty integrating some model types (e.g.
    SVMs)
  • Language Modeling has been very helpful for every
    other kind of natural language input technique
  • Can reduce error rate by 1/3 or ½
  • Results for handwriting with similar improvements
  • Potential of language modeling is not just for
    note-taking
  • Addresses
  • Check-writing
  • Any application where the distribution is not
    uniform
  • May not be easy, but it will be worth it.

51
  • If you can read this slide, youre too close

52
Pretty good Bibliography of LMs for handwriting
  • Using a Statistical Language Model to Improve the
    Performance of an HMM-Based Cursive Handwriting
    System, Marti et al. IJPRAI 2001
  • On the influence of vocabulary size and language
    models in unconstrained handwritten text
    recognition, Marti et al., ICDAR 2001
  • N-gram and N-Class Models for On line Handwriting
    Recognition, Perraud et al., ICDAR 2003
  • Offline Recognition of Unconstrained Handwritten
    Texts Using HMMs and Statistical Language Models,
    Vinciarelli et al, IEEE PAMI 2004
  • N-Gram Language Models for Offline Handwritten
    Text Recognition, Zimmerman et al., IWFHR 2004
  • Stability Measure of Entropy Estimate and Its
    Application to Language Model Evaluation, Kim,
    J., and Ryu, S., and Kim, J.H., IWFHR 2004
  • An Empirical Study of Statistical Language Models
    for Contextual Post-processing of Chinese Script
    Recognition, Li,Y-X. and Tan, C.L., IWFHR 2004
  • Statistical Language Models for On-line
    Handwritten Sentence Recognition, Quiniou et al.,
    ICDAR 2005
  • A Data Structure Using Hashing and Tries for
    Efficient Chinese Lexical Access,Y.-K. Lam and Q.
    Huo ICDAR 2005
  • Document Understanding System Using Stochastic
    Context-Free Grammars, J. Handley. A. Namboodiri,
    and R. Zanibbi, ICDAR 2005
  • Multiple Handwritten Text Line Recognition
    Systems Derived from Specific Integration of a
    Language Model, R. Bertolami and H. Bunke, ICDAR
    2005
  • A Priori and a posteriori integration and
    combination of language models in an on-line
    handwritten sentence recognition system, Solen
    Quiniou and Eric Anquetil, IWFHR 2006

53
More Resources
  • Joshuas web page www.research.microsoft.com/jos
    huago
  • Smoothing technical report good introduction to
    smoothing and lots of details too.
  • A Bit of Progress in Language Modeling, which
    is the journal version of much of this talk.
  • Papers on fuzzy keyboard, language model
    compression, and maximum entropy.

54
More Resources
  • Eugene Charniaks web page http//www.cs.brown.ed
    u/people/ec/home.html
  • Papers on statistical parsing for its own sake
    and for language modeling, as well as using
    language modeling to measure contextual
    influence.
  • Pointers to software for statistical parsing as
    well as statistical parsers optimized for
    language-modeling

55
More ResourcesBooks
  • Books (all are OK, none focus on language models)
  • Statistical Language Learning by Eugene Charniak
  • Speech and Language Processing by Dan Jurafsky
    and Jim Martin (especially Chapter 6)
  • Foundations of Statistical Natural Language
    Processing by Chris Manning and Hinrich SchĂĽtze.
  • Statistical Methods for Speech Recognition, by
    Frederick Jelinek

56
More Resources
  • Sentence Mixture Models (also, caching)
  • Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving
    and predicting performance of statistical
    language models in sparse domains"
  • Rukmini Iyer and Mari Ostendorf. Modeling long
    distance dependence in language Topic mixtures
    versus dynamic cache models. IEEE Transactions
    on Acoustics, Speech and Audio Processing,
    730--39, January 1999.
  • Caching Above, plus
  • R. Kuhn. Speech recognition and the frequency of
    recently used words A modified markov model for
    natural language. In 12th International
    Conference on Computational Linguistics, pages
    348--350, Budapest, August 1988.
  • R. Kuhn and R. D. Mori. A cache-based natural
    language model for speech reproduction. IEEE
    Transactions on Pattern Analysis and Machine
    Intelligence, 12(6)570--583, 1990.
  • R. Kuhn and R. D. Mori. Correction to a
    cache-based natural language model for speech
    reproduction. IEEE Transactions on Pattern
    Analysis and Machine Intelligence,
    14(6)691--692, 1992.

57
More Resources Clustering
  • The seminal reference
  • P. F. Brown, V. J. DellaPietra, P. V. deSouza, J.
    C. Lai, and R. L. Mercer. Class-based n-gram
    models of natural language. Computational
    Linguistics, 18(4)467--479, December 1992.
  • Two-sided clustering
  • H. Yamamoto and Y. Sagisaka. Multi-class
    composite n-gram based on connection direction.
    In Proceedings of the IEEE International
    Conference on Acoustics, Speech and Signal
    Processing Phoenix, Arizona, May 1999.
  • Fast clustering
  • D. R. Cutting, D. R. Karger, J. R. Pedersen, and
    J. W. Tukey. Scatter/gather A cluster-based
    approach to browsing large document collections.
    In SIGIR 92, 1992.
  • Other
  • R. Kneser and H. Ney. Improved clustering
    techniques for class-based statistical language
    modeling. In Eurospeech 93, volume 2, pages
    973--976, 1993.

58
More Resources
  • Structured Language Models
  • Eugenes web page
  • Ciprian Chelbas web page
  • http//www.clsp.jhu.edu/people/chelba/
  • Maximum Entropy
  • Roni Rosenfelds home page and thesis
  • http//www.cs.cmu.edu/roni/
  • Joshuas web page
  • Stolcke Pruning
  • A. Stolcke (1998), Entropy-based pruning of
    backoff language models. Proc. DARPA Broadcast
    News Transcription and Understanding Workshop,
    pp. 270-274, Lansdowne, VA. NOTE get corrected
    version from http//www.speech.sri.com/people/stol
    cke

59
More Resources Skipping
  • Skipping
  • X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang,
    K.-F. Lee, and R. Rosenfeld. The SPHINX-II speech
    recognition system An overview. Computer,
    Speech, and Language, 2137--148, 1993.
  • Lots of stuff
  • S. Martin, C. Hamacher, J. Liermann, F. Wessel,
    and H. Ney. Assessment of smoothing methods and
    complex stochastic language modeling. In 6th
    European Conference on Speech Communication and
    Technology, volume 5, pages 1939--1942, Budapest,
    Hungary, September 1999. H. Ney, U. Essen, and
    R. Kneser.
  • On structuring probabilistic dependences in
    stochastic language modeling. Computer, Speech,
    and Language, 81--38, 1994.

60
  • If you can read this slide, youre too close

61
Fuzzy Keyboard
  • Very small users can type on key boundary, or
    hit the wrong key easily
  • A soft keyboard is an image of a keyboard e.g.
    Palm Pilot or PocketPC.

62
Fuzzy Keyboard Language model and Pen Positions
  • Math Language Model times Pen Postion
  • For pen down positions, collect data, and
    compute simple Gaussian distribution.

63
Fuzzy Keyboard Results
  • 40 Fewer errors, same speed.
Write a Comment
User Comments (0)
About PowerShow.com