Title: Outline
1Outline
- Applications
- Spelling correction
- Formal Representation
- Weighted FSTs
- Algorithms
- Bayesian Inference (Noisy channel model)
- Methods to determine weights
- Hand-coded
- Corpus-based estimation
- Dynamic Programming
- Shortest path
2Detecting and Correcting Spelling Errors
- Sources of lexical/spelling errors
- Speech lexical access and recognition errors
(more later) - Text typing and cognitive
- OCR recognition errors
- Applications
- Spell checking
- Hand-writing recognition of zip codes,
signatures, Graffiti - Issues
- Correct non-words in isolation (dg for dog, why
not dig?) - Correcting non-words could lead to valid words
- Homophone substitution parents love there
children Lets order a desert after dinner - Correcting words in context
3Patterns of Error
- Human typists make different types of errors from
OCR systems -- why? - Error classification I performance-based
- Insertion catt
- Deletion ct
- Substitution car
- Transposition cta
- Error classification II cognitive
- People dont know how to spell (nucular/nuclear
potatoe/potato) - Homonymous errors (their/there)
4Probability Refresher
- Population 10 Princeton students
- What is the probability that a randomly chosen
student (rcs) is a vegetarian? p(v) 0.4 - That a rcs is a CS major? p(c) 0.3
- That a rcs is a vegetarian and CS major? p(c,v)
0.2 - That a vegetarian is a CS major? p(cv) 0.5
- That a CS major is a vegetarian? p(vc) 0.66
- That a non-CS major is a vegetarian? p(vc) ??
5Bayes Rule and Noisy Channel model
- We know the joint probabilities
- p(c,v) p(c) p(vc) (chain rule)
- p(v,c) p(c,v) p(v) p(cv)
- So, we can define the conditional probability
p(cv) in terms of the prior probabilities p(c)
and p(v) and the likelihood p(vc). - Noisy channel metaphor channel corrupts the
input recover the original. - think cell-phone conversations!!
- Hearers challenge decode what the speaker said
(w), given a channel-corrupted observation (O).
Source model
Channel model
6How do we use this model to correct spelling
errors?
- Simplifying assumptions
- We only have to correct non-word errors
- Each non-word (O) differs from its correct word
(w) by one step (insertion, deletion,
substitution, transposition) - Generate and Test Method (Kernighan et al 1990)
- Generate a word using one of substitution,
deletion or insertion, transposition operations - Test if the resulting word is in the dictionary.
- Example
Observation Correct Correct letter Error Letter Position Type of Error
caat cat - a 2 insertion
caat carat r - 3 deletion
7How do we decide which correction is most likely?
- Validate the generated word in a dictionary.
- But there may be multiple valid words, how to
rank them? - Rank them based on a scoring function
- P(w typo) P(typo w) P(w)
- Note there could be other scoring functions
- Propose n-best solutions
- Estimate the likelihood P(typow) and the prior
P(w) - count events from a corpus to estimate these
probabilities - Labeled versus Unlabeled corpus
- For spelling correction, what do we need?
- Word occurrence information (unlabeled corpus)
- A corpus of labeled spelling errors
- Approximate word replacement by local letter
replacement probabilities Confusion matrix on
letters
8Cat vs Carat
- Estimating the Prior Suppose we look at the
occurrence of cat and carat in a large (50M word)
AP news corpus - cat occurs 6500 times, so p(cat) .00013
- carat occurs 3000 times, so p(carat) .00006
- Estimating the likelihood Now we need to find
out if inserting an a after an a is more
likely than deleting an r after an a in a
corrections corpus of 50K corrections (?
p(typoword)) - suppose a insertion after a occurs 5000 times
(p(a).1) and r deletion occurs 7500 times
(p(-r).15) - Scoring function p(wordtypo) p(typoword)
p(word) - p(catcaat) p(a) p(cat) .1 .00013
.000013 - p(caratcaat) p(-r) p(carat) .15 .000006
.000009
9Encoding One-Error Correction as WFSTs
- Let S c,a,r,t
- One-edit model
- Dictionary model
- One-Error spelling correction
- Input ? Edit ? Dictionary
t
t
10Issues
- What if there are no instances of carat in
corpus? - Smoothing algorithms
- Estimate of P(typoword) may not be accurate
- Training probabilities on typo/word pairs
- What if there is more than one error per word?
11Minimum Edit Distance
- How can we measure how different one word is from
another word? - How many operations will it take to transform one
word into another? - caat --gt cat, fplc --gt fireplace (treat
abbreviations as typos??) - Levenshtein distance smallest number of
insertion, deletion, or substitution operations
that transform one string into another
(insdelsubst1) - Alternative weight each operation by training on
a corpus of spelling errors to see which is most
frequent
12Computing Levinshtein Distance
- Dynamic Programming algorithm
- Solution for a problem is a function of the
solutions of subproblems - di,j contains the distance upto si and tj
- di,j is computed by combining the distance of
shorter substrings using insertion, deletion and
substitution operations. - optimal edit operations is recovered by storing
back-pointers.
13Edit Distance Matrix
NB errors
Cost1 for insertions and deletions Cost2 for
substitutions Recompute the matrix
insertionsdeletionssubstituitions1
14Levenstein Distance with WFSTs
- Let S c,a,r,t
- Edit model
- The two sentences to compared are encoded as
FSTs. - Levenstein distance between two sentences
- Dist(s1,s2) s1 ? Edit ? s2
ce,ae,re,te
Del
cc,aa,rr,tt
ca,cr,ct,ac,at
0
Sub
ec,ea,er,et
Ins
15Spelling Correction with WFSTs
- Dictionary FST representation of words
- Isolated word spelling correction
- AllCorrections(w) w ? Edit ? Dictionary
- BestCorrection(w) Bestpath (w ? Edit ?
Dictionary) - Spelling correction in context parents love
there children - S w1, w2, wn
- Spelling correction of wi
- Generate possible edits for wi
- Pick the edit that fits best in context
- Use a n-gram language model (LM) to rank the
alternatives. - love there vs love their there children vs
their children - SentenceCorrection (S) F(S) ? Edit ? LM
16- Aoccdrnig to a rscheearch at an Elingsh
uinervtisy, it deosn't mttaer in waht oredr the
ltteers in a wrod are, the olny iprmoetnt tihng
is taht the frist and lsat ltteers are at the
rghit pclae. The rset can be a toatl mses and you
can sitll raed it wouthit a porbelm. Tihs is
bcuseae we do not raed ervey lteter by itslef but
the wrod as a wlohe. - Can humans understand what is meant as opposed
to what is said/written? - How?
- http//www.mrc-cbu.cam.ac.uk/personal/matt.davis/C
mabrigde/
17Summary
- We can apply probabilistic modeling to NL
problems like spell-checking - Noisy channel model, Bayesian method
- Training priors and likelihoods on a corpus
- Dynamic programming approaches allow us to solve
large problems that can be decomposed into sub
problems - e.g. Minimum Edit Distance algorithm
- A number of Speech and Language tasks can be cast
in this framework. - Generate alternatives using a generator
- Select best/ Rank the alternatives using a model
- If the generator and the model are encodable as
FST - Decoding becomes
- composition followed by search for best path.
18Word Classes and Tagging
19Word Classes and Tagging
- Words can be grouped into classes based on a
number of criteria. - Application independent criterion
- Syntactic class (Nouns, Verbs, Adjectives)
- Proper names (People names, country names)
- Dates, currencies
- Application specific criterion
- Product names (Ajax, Slurpee, Lexmark 3100)
- Service names (7-cents plan, GoldPass)
- Tagging Categorizing words of a sentence into
one of the classes.
20Syntactic Classes in English Open Class Words
- Nouns
- Defined semantically words for people, places,
things - Defined syntactically words that take
determiners - Count nouns nouns that can be counted
- One book, two computers, hundred men
- Mass nouns nouns that represent homogenous
groups, can occur without articles. - snow, salt, milk, water, hair
- Proper nouns common nouns
- Verbs words for actions and processes
- Hit, love, run, fly, differ, go
- Adjectives words for describing qualities and
properties (modifiers) of objects - White, black, old, young, good, bad
- Adverbs words for describing modifiers of
actions - Unfortunately, John walked home extremely slowly
yesterday - Subclasses locative (home), degree (very),
manner (slowly), temporal (yesterday)
21Syntactic Classes in English Closed Class Words
- Closed Class words
- fixed set for a language
- Typically high frequency words
- Prepositions relational words for describing
relations among objects and events - In, on, before, by
- Particles looked up, throw out
- Articles/Determiners definite versus indefinite
- Indefinite a, an
- Definite the
- Conjunctions used to join two phrases, clauses,
sentences. - Coordinating conjunctions and, or, but
- Subordinating conjunctions that, since, because
- Pronouns shorthand to refer to objects and
events. - Personal pronouns he, she, it, they, us
- Possessive pronouns my, your, ours, theirs, his,
hers, its, ones - Wh-pronouns whose, what, who, whom, whomever
- Auxiliary verbs used to mark tense, aspect,
polarity, mood, of an action - Tense past, present, future
- Aspect completed or on-going
22Tagset
- Tagset set of tags to use depends on the
application. - Basic tags tags with some morphology
- Composition of a number of subtags
- Agglutinative languages
- Popular tagsets for English
- Penn Treebank Tagset 45 tags
- CLAWS tagset 61 tags
- C7 tagset 146 tags
- How do we decide how many tags to use?
- Application utility
- Ease of disambiguation
- Annotation consistency
- IN tag in Penn Treebank tagset subordinating
conjuntions and prepositions - TO tag represents preposition to and
infinitival marker to read - Supertags fold in syntactic information into
tagset - of the order of 1000 tags
23Tagging Disambiguating Words
- Three different models
- ENGTWOL model (Karlsson et.al. 1995)
- Transformation-based model (Brill 1995)
- Hidden Markov Model tagger
- ENGTWOL tagger
- Constraint-based tagger
- 1,100 hand-written constraints to rule out
invalid combinations of tags. - Use of probabilistic constraints and syntactic
information - Transformation-based model
- Start with the most likely assignment
- Make note of the context when the most likely
assignment is wrong. - Induce a transformation rule that corrects the
most likely assignment to the correct tag in that
context. - Rules can be seen as a ? ß d ?
- Compilable into an FST
24Again, the Noisy Channel Model
- Input to channel Part-of-speech sequence T
- Output from channel a word sequence W
- Decoding task find T P(TW)
- Using Bayes Rule
- And since P(W) doesnt change for any
hypothetical T - T P(WT) P(T)
- P(WT) is the Emit Probability, and P(T) is the
prior, or Contextual Probability
25Stochastic Tagging Markov Assumption
- The tagging model is approximated using Markov
assumptions. - T P(T) P(WT)
- Markov (first-order) assumption
- Independence assumption
- Thus
- The probability distributions are estimated from
an annotated corpus. - Maximum Likelihood Estimate
- P(wt) count(w,t)/count(t)
- P(titi-1) count(ti, ti-1)/count(ti-1)
- Dont forget to smooth the counts!!
- There are other means of estimating these
probabilities.
26Best Path Search
- Search for the best path pervades many Speech and
NLP problems. - ASR best path through a composition of acoustic,
pronunciation and language models - Tagging best path through a composition of
lexicon and contextual model - Edit distance best path through a search space
set up by insertion, deletion and substitution
operations. - In general
- Decisions/operations create a weighted search
space - Search for the best sequence of decisions
- Dynamic programming solution
- Sometimes the score is only relevant.
- Most often the path (sequence of states
derivation) is relevant.
27Multi-stage decision problems
The dog runs
.
NN
NNS
EOS
DT
VB
VBZ
BOS
P(DTBOS) 1 P(NNDT) 0.9 P(VBDT)
0.1 P(NNSNN) 0.3 P(VBZNN) 0.7 P(
NNS) 0.3 P( VBZ) 0.7 P(EOS )
1
P(dogNN) 0.99 P(dogVB) 0.01 P(theDT)
0.999 P(runsNNS) 0.63 P(runsVBZ) 0.37 P(
) 0.999
P(NNSVB) 0.7 P(VBZVB) 0.3
28Multi-stage decision problems
The dog runs
.
NN
NNS
EOS
DT
VB
VBZ
BOS
- Find the state sequence through this space that
maximizes P(wt)P(tt-1) - cost(BOS, EOS) 1cost(DT, EOS)
- cost(DT,EOS) maxP(theDT)P(NNDT)cost(NN,EOS)
, -
P(theDT)P(VBDT)cost(VB,EOS)
29Two ways of reasoning
- Forward approach (Backward reasoning)
- Compute the best way to get from a state to the
goal state. - Backward approach (Forward reasoning)
- Compute the best way from the source state to get
to a state. - A combination of these two approaches is used in
unsupervised training of HMMs. - Forward-backward algorithm (Appendix D)