Outline - PowerPoint PPT Presentation

About This Presentation

Title:

Outline

Description:

Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 30

Provided by: sri116

Learn more at: https://www.cs.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Applications
Spelling correction
Formal Representation
Weighted FSTs
Algorithms
Bayesian Inference (Noisy channel model)
Methods to determine weights
Hand-coded
Corpus-based estimation
Dynamic Programming
Shortest path

2
Detecting and Correcting Spelling Errors

Sources of lexical/spelling errors
Speech lexical access and recognition errors
(more later)
Text typing and cognitive
OCR recognition errors
Applications
Spell checking
Hand-writing recognition of zip codes,
signatures, Graffiti
Issues
Correct non-words in isolation (dg for dog, why
not dig?)
Correcting non-words could lead to valid words
Homophone substitution parents love there
children Lets order a desert after dinner
Correcting words in context

3
Patterns of Error

Human typists make different types of errors from
OCR systems -- why?
Error classification I performance-based
Insertion catt
Deletion ct
Substitution car
Transposition cta
Error classification II cognitive
People dont know how to spell (nucular/nuclear
potatoe/potato)
Homonymous errors (their/there)

4
Probability Refresher

Population 10 Princeton students

4 vegetarians

3 CS majors

What is the probability that a randomly chosen
student (rcs) is a vegetarian? p(v) 0.4
That a rcs is a CS major? p(c) 0.3
That a rcs is a vegetarian and CS major? p(c,v)
0.2
That a vegetarian is a CS major? p(cv) 0.5
That a CS major is a vegetarian? p(vc) 0.66
That a non-CS major is a vegetarian? p(vc) ??

5
Bayes Rule and Noisy Channel model

We know the joint probabilities
p(c,v) p(c) p(vc) (chain rule)
p(v,c) p(c,v) p(v) p(cv)
So, we can define the conditional probability
p(cv) in terms of the prior probabilities p(c)
and p(v) and the likelihood p(vc).
Noisy channel metaphor channel corrupts the
input recover the original.
think cell-phone conversations!!
Hearers challenge decode what the speaker said
(w), given a channel-corrupted observation (O).

Source model
Channel model
6
How do we use this model to correct spelling
errors?

Simplifying assumptions
We only have to correct non-word errors
Each non-word (O) differs from its correct word
(w) by one step (insertion, deletion,
substitution, transposition)
Generate and Test Method (Kernighan et al 1990)
Generate a word using one of substitution,
deletion or insertion, transposition operations
Test if the resulting word is in the dictionary.
Example

Observation Correct Correct letter Error Letter Position Type of Error
caat cat - a 2 insertion
caat carat r - 3 deletion
7
How do we decide which correction is most likely?

Validate the generated word in a dictionary.
But there may be multiple valid words, how to
rank them?
Rank them based on a scoring function
P(w typo) P(typo w) P(w)
Note there could be other scoring functions
Propose n-best solutions
Estimate the likelihood P(typow) and the prior
P(w)
count events from a corpus to estimate these
probabilities
Labeled versus Unlabeled corpus
For spelling correction, what do we need?
Word occurrence information (unlabeled corpus)
A corpus of labeled spelling errors
Approximate word replacement by local letter
replacement probabilities Confusion matrix on
letters

8
Cat vs Carat

Estimating the Prior Suppose we look at the
occurrence of cat and carat in a large (50M word)
AP news corpus
cat occurs 6500 times, so p(cat) .00013
carat occurs 3000 times, so p(carat) .00006
Estimating the likelihood Now we need to find
out if inserting an a after an a is more
likely than deleting an r after an a in a
corrections corpus of 50K corrections (?
p(typoword))
suppose a insertion after a occurs 5000 times
(p(a).1) and r deletion occurs 7500 times
(p(-r).15)
Scoring function p(wordtypo) p(typoword)
p(word)
p(catcaat) p(a) p(cat) .1 .00013
.000013
p(caratcaat) p(-r) p(carat) .15 .000006
.000009

9
Encoding One-Error Correction as WFSTs

Let S c,a,r,t
One-edit model
Dictionary model
One-Error spelling correction
Input ? Edit ? Dictionary

t
t
10
Issues

What if there are no instances of carat in
corpus?
Smoothing algorithms
Estimate of P(typoword) may not be accurate
Training probabilities on typo/word pairs
What if there is more than one error per word?

11
Minimum Edit Distance

How can we measure how different one word is from
another word?
How many operations will it take to transform one
word into another?
caat --gt cat, fplc --gt fireplace (treat
abbreviations as typos??)
Levenshtein distance smallest number of
insertion, deletion, or substitution operations
that transform one string into another
(insdelsubst1)
Alternative weight each operation by training on
a corpus of spelling errors to see which is most
frequent

12
Computing Levinshtein Distance

Dynamic Programming algorithm
Solution for a problem is a function of the
solutions of subproblems
di,j contains the distance upto si and tj
di,j is computed by combining the distance of
shorter substrings using insertion, deletion and
substitution operations.
optimal edit operations is recovered by storing
back-pointers.

13
Edit Distance Matrix
NB errors
Cost1 for insertions and deletions Cost2 for
substitutions Recompute the matrix
insertionsdeletionssubstituitions1
14
Levenstein Distance with WFSTs

Let S c,a,r,t
Edit model
The two sentences to compared are encoded as
FSTs.
Levenstein distance between two sentences
Dist(s1,s2) s1 ? Edit ? s2

ce,ae,re,te
Del
cc,aa,rr,tt
ca,cr,ct,ac,at
0
Sub
ec,ea,er,et
Ins
15
Spelling Correction with WFSTs

Dictionary FST representation of words
Isolated word spelling correction
AllCorrections(w) w ? Edit ? Dictionary
BestCorrection(w) Bestpath (w ? Edit ?
Dictionary)
Spelling correction in context parents love
there children
S w1, w2, wn
Spelling correction of wi
Generate possible edits for wi
Pick the edit that fits best in context
Use a n-gram language model (LM) to rank the
alternatives.
love there vs love their there children vs
their children
SentenceCorrection (S) F(S) ? Edit ? LM

Aoccdrnig to a rscheearch at an Elingsh
uinervtisy, it deosn't mttaer in waht oredr the
ltteers in a wrod are, the olny iprmoetnt tihng
is taht the frist and lsat ltteers are at the
rghit pclae. The rset can be a toatl mses and you
can sitll raed it wouthit a porbelm. Tihs is
bcuseae we do not raed ervey lteter by itslef but
the wrod as a wlohe.
Can humans understand what is meant as opposed
to what is said/written?
How?
http//www.mrc-cbu.cam.ac.uk/personal/matt.davis/C
mabrigde/

17
Summary

We can apply probabilistic modeling to NL
problems like spell-checking
Noisy channel model, Bayesian method
Training priors and likelihoods on a corpus
Dynamic programming approaches allow us to solve
large problems that can be decomposed into sub
problems
e.g. Minimum Edit Distance algorithm
A number of Speech and Language tasks can be cast
in this framework.
Generate alternatives using a generator
Select best/ Rank the alternatives using a model
If the generator and the model are encodable as
FST
Decoding becomes
composition followed by search for best path.

18
Word Classes and Tagging
19
Word Classes and Tagging

Words can be grouped into classes based on a
number of criteria.
Application independent criterion
Syntactic class (Nouns, Verbs, Adjectives)
Proper names (People names, country names)
Dates, currencies
Application specific criterion
Product names (Ajax, Slurpee, Lexmark 3100)
Service names (7-cents plan, GoldPass)
Tagging Categorizing words of a sentence into
one of the classes.

20
Syntactic Classes in English Open Class Words

Nouns
Defined semantically words for people, places,
things
Defined syntactically words that take
determiners
Count nouns nouns that can be counted
One book, two computers, hundred men
Mass nouns nouns that represent homogenous
groups, can occur without articles.
snow, salt, milk, water, hair
Proper nouns common nouns
Verbs words for actions and processes
Hit, love, run, fly, differ, go
Adjectives words for describing qualities and
properties (modifiers) of objects
White, black, old, young, good, bad
Adverbs words for describing modifiers of
actions
Unfortunately, John walked home extremely slowly
yesterday
Subclasses locative (home), degree (very),
manner (slowly), temporal (yesterday)

21
Syntactic Classes in English Closed Class Words

Closed Class words
fixed set for a language
Typically high frequency words
Prepositions relational words for describing
relations among objects and events
In, on, before, by
Particles looked up, throw out
Articles/Determiners definite versus indefinite
Indefinite a, an
Definite the
Conjunctions used to join two phrases, clauses,
sentences.
Coordinating conjunctions and, or, but
Subordinating conjunctions that, since, because
Pronouns shorthand to refer to objects and
events.
Personal pronouns he, she, it, they, us
Possessive pronouns my, your, ours, theirs, his,
hers, its, ones
Wh-pronouns whose, what, who, whom, whomever
Auxiliary verbs used to mark tense, aspect,
polarity, mood, of an action
Tense past, present, future
Aspect completed or on-going

22
Tagset

Tagset set of tags to use depends on the
application.
Basic tags tags with some morphology
Composition of a number of subtags
Agglutinative languages
Popular tagsets for English
Penn Treebank Tagset 45 tags
CLAWS tagset 61 tags
C7 tagset 146 tags
How do we decide how many tags to use?
Application utility
Ease of disambiguation
Annotation consistency
IN tag in Penn Treebank tagset subordinating
conjuntions and prepositions
TO tag represents preposition to and
infinitival marker to read
Supertags fold in syntactic information into
tagset
of the order of 1000 tags

23
Tagging Disambiguating Words

Three different models
ENGTWOL model (Karlsson et.al. 1995)
Transformation-based model (Brill 1995)
Hidden Markov Model tagger
ENGTWOL tagger
Constraint-based tagger
1,100 hand-written constraints to rule out
invalid combinations of tags.
Use of probabilistic constraints and syntactic
information
Transformation-based model
Start with the most likely assignment
Make note of the context when the most likely
assignment is wrong.
Induce a transformation rule that corrects the
most likely assignment to the correct tag in that
context.
Rules can be seen as a ? ß d ?
Compilable into an FST

24
Again, the Noisy Channel Model

Input to channel Part-of-speech sequence T
Output from channel a word sequence W
Decoding task find T P(TW)
Using Bayes Rule
And since P(W) doesnt change for any
hypothetical T
T P(WT) P(T)
P(WT) is the Emit Probability, and P(T) is the
prior, or Contextual Probability

25
Stochastic Tagging Markov Assumption

The tagging model is approximated using Markov
assumptions.
T P(T) P(WT)
Markov (first-order) assumption
Independence assumption
Thus
The probability distributions are estimated from
an annotated corpus.
Maximum Likelihood Estimate
P(wt) count(w,t)/count(t)
P(titi-1) count(ti, ti-1)/count(ti-1)
Dont forget to smooth the counts!!
There are other means of estimating these
probabilities.

26
Best Path Search

Search for the best path pervades many Speech and
NLP problems.
ASR best path through a composition of acoustic,
pronunciation and language models
Tagging best path through a composition of
lexicon and contextual model
Edit distance best path through a search space
set up by insertion, deletion and substitution
operations.
In general
Decisions/operations create a weighted search
space
Search for the best sequence of decisions
Dynamic programming solution
Sometimes the score is only relevant.
Most often the path (sequence of states
derivation) is relevant.

27
Multi-stage decision problems
The dog runs
.

NN
NNS
EOS
DT

VB
VBZ
BOS
P(DTBOS) 1 P(NNDT) 0.9 P(VBDT)
0.1 P(NNSNN) 0.3 P(VBZNN) 0.7 P(
NNS) 0.3 P( VBZ) 0.7 P(EOS )
1
P(dogNN) 0.99 P(dogVB) 0.01 P(theDT)
0.999 P(runsNNS) 0.63 P(runsVBZ) 0.37 P(
) 0.999
P(NNSVB) 0.7 P(VBZVB) 0.3

28
Multi-stage decision problems
The dog runs
.

NN
NNS
EOS
DT

VB
VBZ
BOS