Outline - PowerPoint PPT Presentation

About This Presentation
Title:

Outline

Description:

Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 30
Provided by: sri116
Category:

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Applications
  • Spelling correction
  • Formal Representation
  • Weighted FSTs
  • Algorithms
  • Bayesian Inference (Noisy channel model)
  • Methods to determine weights
  • Hand-coded
  • Corpus-based estimation
  • Dynamic Programming
  • Shortest path

2
Detecting and Correcting Spelling Errors
  • Sources of lexical/spelling errors
  • Speech lexical access and recognition errors
    (more later)
  • Text typing and cognitive
  • OCR recognition errors
  • Applications
  • Spell checking
  • Hand-writing recognition of zip codes,
    signatures, Graffiti
  • Issues
  • Correct non-words in isolation (dg for dog, why
    not dig?)
  • Correcting non-words could lead to valid words
  • Homophone substitution parents love there
    children Lets order a desert after dinner
  • Correcting words in context

3
Patterns of Error
  • Human typists make different types of errors from
    OCR systems -- why?
  • Error classification I performance-based
  • Insertion catt
  • Deletion ct
  • Substitution car
  • Transposition cta
  • Error classification II cognitive
  • People dont know how to spell (nucular/nuclear
    potatoe/potato)
  • Homonymous errors (their/there)

4
Probability Refresher
  • Population 10 Princeton students
  • 4 vegetarians
  • 3 CS majors
  • What is the probability that a randomly chosen
    student (rcs) is a vegetarian? p(v) 0.4
  • That a rcs is a CS major? p(c) 0.3
  • That a rcs is a vegetarian and CS major? p(c,v)
    0.2
  • That a vegetarian is a CS major? p(cv) 0.5
  • That a CS major is a vegetarian? p(vc) 0.66
  • That a non-CS major is a vegetarian? p(vc) ??

5
Bayes Rule and Noisy Channel model
  • We know the joint probabilities
  • p(c,v) p(c) p(vc) (chain rule)
  • p(v,c) p(c,v) p(v) p(cv)
  • So, we can define the conditional probability
    p(cv) in terms of the prior probabilities p(c)
    and p(v) and the likelihood p(vc).
  • Noisy channel metaphor channel corrupts the
    input recover the original.
  • think cell-phone conversations!!
  • Hearers challenge decode what the speaker said
    (w), given a channel-corrupted observation (O).

Source model
Channel model
6
How do we use this model to correct spelling
errors?
  • Simplifying assumptions
  • We only have to correct non-word errors
  • Each non-word (O) differs from its correct word
    (w) by one step (insertion, deletion,
    substitution, transposition)
  • Generate and Test Method (Kernighan et al 1990)
  • Generate a word using one of substitution,
    deletion or insertion, transposition operations
  • Test if the resulting word is in the dictionary.
  • Example

Observation Correct Correct letter Error Letter Position Type of Error
caat cat - a 2 insertion
caat carat r - 3 deletion
7
How do we decide which correction is most likely?
  • Validate the generated word in a dictionary.
  • But there may be multiple valid words, how to
    rank them?
  • Rank them based on a scoring function
  • P(w typo) P(typo w) P(w)
  • Note there could be other scoring functions
  • Propose n-best solutions
  • Estimate the likelihood P(typow) and the prior
    P(w)
  • count events from a corpus to estimate these
    probabilities
  • Labeled versus Unlabeled corpus
  • For spelling correction, what do we need?
  • Word occurrence information (unlabeled corpus)
  • A corpus of labeled spelling errors
  • Approximate word replacement by local letter
    replacement probabilities Confusion matrix on
    letters

8
Cat vs Carat
  • Estimating the Prior Suppose we look at the
    occurrence of cat and carat in a large (50M word)
    AP news corpus
  • cat occurs 6500 times, so p(cat) .00013
  • carat occurs 3000 times, so p(carat) .00006
  • Estimating the likelihood Now we need to find
    out if inserting an a after an a is more
    likely than deleting an r after an a in a
    corrections corpus of 50K corrections (?
    p(typoword))
  • suppose a insertion after a occurs 5000 times
    (p(a).1) and r deletion occurs 7500 times
    (p(-r).15)
  • Scoring function p(wordtypo) p(typoword)
    p(word)
  • p(catcaat) p(a) p(cat) .1 .00013
    .000013
  • p(caratcaat) p(-r) p(carat) .15 .000006
    .000009

9
Encoding One-Error Correction as WFSTs
  • Let S c,a,r,t
  • One-edit model
  • Dictionary model
  • One-Error spelling correction
  • Input ? Edit ? Dictionary

t
t
10
Issues
  • What if there are no instances of carat in
    corpus?
  • Smoothing algorithms
  • Estimate of P(typoword) may not be accurate
  • Training probabilities on typo/word pairs
  • What if there is more than one error per word?

11
Minimum Edit Distance
  • How can we measure how different one word is from
    another word?
  • How many operations will it take to transform one
    word into another?
  • caat --gt cat, fplc --gt fireplace (treat
    abbreviations as typos??)
  • Levenshtein distance smallest number of
    insertion, deletion, or substitution operations
    that transform one string into another
    (insdelsubst1)
  • Alternative weight each operation by training on
    a corpus of spelling errors to see which is most
    frequent

12
Computing Levinshtein Distance
  • Dynamic Programming algorithm
  • Solution for a problem is a function of the
    solutions of subproblems
  • di,j contains the distance upto si and tj
  • di,j is computed by combining the distance of
    shorter substrings using insertion, deletion and
    substitution operations.
  • optimal edit operations is recovered by storing
    back-pointers.

13
Edit Distance Matrix
NB errors
Cost1 for insertions and deletions Cost2 for
substitutions Recompute the matrix
insertionsdeletionssubstituitions1
14
Levenstein Distance with WFSTs
  • Let S c,a,r,t
  • Edit model
  • The two sentences to compared are encoded as
    FSTs.
  • Levenstein distance between two sentences
  • Dist(s1,s2) s1 ? Edit ? s2

ce,ae,re,te
Del
cc,aa,rr,tt
ca,cr,ct,ac,at
0
Sub
ec,ea,er,et
Ins
15
Spelling Correction with WFSTs
  • Dictionary FST representation of words
  • Isolated word spelling correction
  • AllCorrections(w) w ? Edit ? Dictionary
  • BestCorrection(w) Bestpath (w ? Edit ?
    Dictionary)
  • Spelling correction in context parents love
    there children
  • S w1, w2, wn
  • Spelling correction of wi
  • Generate possible edits for wi
  • Pick the edit that fits best in context
  • Use a n-gram language model (LM) to rank the
    alternatives.
  • love there vs love their there children vs
    their children
  • SentenceCorrection (S) F(S) ? Edit ? LM

16
  • Aoccdrnig to a rscheearch at an Elingsh
    uinervtisy, it deosn't mttaer in waht oredr the
    ltteers in a wrod are, the olny iprmoetnt tihng
    is taht the frist and lsat ltteers are at the
    rghit pclae. The rset can be a toatl mses and you
    can sitll raed it wouthit a porbelm. Tihs is
    bcuseae we do not raed ervey lteter by itslef but
    the wrod as a wlohe.
  • Can humans understand what is meant as opposed
    to what is said/written?
  • How?
  • http//www.mrc-cbu.cam.ac.uk/personal/matt.davis/C
    mabrigde/

17
Summary
  • We can apply probabilistic modeling to NL
    problems like spell-checking
  • Noisy channel model, Bayesian method
  • Training priors and likelihoods on a corpus
  • Dynamic programming approaches allow us to solve
    large problems that can be decomposed into sub
    problems
  • e.g. Minimum Edit Distance algorithm
  • A number of Speech and Language tasks can be cast
    in this framework.
  • Generate alternatives using a generator
  • Select best/ Rank the alternatives using a model
  • If the generator and the model are encodable as
    FST
  • Decoding becomes
  • composition followed by search for best path.

18
Word Classes and Tagging
19
Word Classes and Tagging
  • Words can be grouped into classes based on a
    number of criteria.
  • Application independent criterion
  • Syntactic class (Nouns, Verbs, Adjectives)
  • Proper names (People names, country names)
  • Dates, currencies
  • Application specific criterion
  • Product names (Ajax, Slurpee, Lexmark 3100)
  • Service names (7-cents plan, GoldPass)
  • Tagging Categorizing words of a sentence into
    one of the classes.

20
Syntactic Classes in English Open Class Words
  • Nouns
  • Defined semantically words for people, places,
    things
  • Defined syntactically words that take
    determiners
  • Count nouns nouns that can be counted
  • One book, two computers, hundred men
  • Mass nouns nouns that represent homogenous
    groups, can occur without articles.
  • snow, salt, milk, water, hair
  • Proper nouns common nouns
  • Verbs words for actions and processes
  • Hit, love, run, fly, differ, go
  • Adjectives words for describing qualities and
    properties (modifiers) of objects
  • White, black, old, young, good, bad
  • Adverbs words for describing modifiers of
    actions
  • Unfortunately, John walked home extremely slowly
    yesterday
  • Subclasses locative (home), degree (very),
    manner (slowly), temporal (yesterday)

21
Syntactic Classes in English Closed Class Words
  • Closed Class words
  • fixed set for a language
  • Typically high frequency words
  • Prepositions relational words for describing
    relations among objects and events
  • In, on, before, by
  • Particles looked up, throw out
  • Articles/Determiners definite versus indefinite
  • Indefinite a, an
  • Definite the
  • Conjunctions used to join two phrases, clauses,
    sentences.
  • Coordinating conjunctions and, or, but
  • Subordinating conjunctions that, since, because
  • Pronouns shorthand to refer to objects and
    events.
  • Personal pronouns he, she, it, they, us
  • Possessive pronouns my, your, ours, theirs, his,
    hers, its, ones
  • Wh-pronouns whose, what, who, whom, whomever
  • Auxiliary verbs used to mark tense, aspect,
    polarity, mood, of an action
  • Tense past, present, future
  • Aspect completed or on-going

22
Tagset
  • Tagset set of tags to use depends on the
    application.
  • Basic tags tags with some morphology
  • Composition of a number of subtags
  • Agglutinative languages
  • Popular tagsets for English
  • Penn Treebank Tagset 45 tags
  • CLAWS tagset 61 tags
  • C7 tagset 146 tags
  • How do we decide how many tags to use?
  • Application utility
  • Ease of disambiguation
  • Annotation consistency
  • IN tag in Penn Treebank tagset subordinating
    conjuntions and prepositions
  • TO tag represents preposition to and
    infinitival marker to read
  • Supertags fold in syntactic information into
    tagset
  • of the order of 1000 tags

23
Tagging Disambiguating Words
  • Three different models
  • ENGTWOL model (Karlsson et.al. 1995)
  • Transformation-based model (Brill 1995)
  • Hidden Markov Model tagger
  • ENGTWOL tagger
  • Constraint-based tagger
  • 1,100 hand-written constraints to rule out
    invalid combinations of tags.
  • Use of probabilistic constraints and syntactic
    information
  • Transformation-based model
  • Start with the most likely assignment
  • Make note of the context when the most likely
    assignment is wrong.
  • Induce a transformation rule that corrects the
    most likely assignment to the correct tag in that
    context.
  • Rules can be seen as a ? ß d ?
  • Compilable into an FST

24
Again, the Noisy Channel Model
  • Input to channel Part-of-speech sequence T
  • Output from channel a word sequence W
  • Decoding task find T P(TW)
  • Using Bayes Rule
  • And since P(W) doesnt change for any
    hypothetical T
  • T P(WT) P(T)
  • P(WT) is the Emit Probability, and P(T) is the
    prior, or Contextual Probability

25
Stochastic Tagging Markov Assumption
  • The tagging model is approximated using Markov
    assumptions.
  • T P(T) P(WT)
  • Markov (first-order) assumption
  • Independence assumption
  • Thus
  • The probability distributions are estimated from
    an annotated corpus.
  • Maximum Likelihood Estimate
  • P(wt) count(w,t)/count(t)
  • P(titi-1) count(ti, ti-1)/count(ti-1)
  • Dont forget to smooth the counts!!
  • There are other means of estimating these
    probabilities.

26
Best Path Search
  • Search for the best path pervades many Speech and
    NLP problems.
  • ASR best path through a composition of acoustic,
    pronunciation and language models
  • Tagging best path through a composition of
    lexicon and contextual model
  • Edit distance best path through a search space
    set up by insertion, deletion and substitution
    operations.
  • In general
  • Decisions/operations create a weighted search
    space
  • Search for the best sequence of decisions
  • Dynamic programming solution
  • Sometimes the score is only relevant.
  • Most often the path (sequence of states
    derivation) is relevant.

27
Multi-stage decision problems
The dog runs
.

NN
NNS
EOS
DT

VB
VBZ
BOS
P(DTBOS) 1 P(NNDT) 0.9 P(VBDT)
0.1 P(NNSNN) 0.3 P(VBZNN) 0.7 P(
NNS) 0.3 P( VBZ) 0.7 P(EOS )
1
P(dogNN) 0.99 P(dogVB) 0.01 P(theDT)
0.999 P(runsNNS) 0.63 P(runsVBZ) 0.37 P(
) 0.999
P(NNSVB) 0.7 P(VBZVB) 0.3





28
Multi-stage decision problems
The dog runs
.

NN
NNS
EOS
DT

VB
VBZ
BOS
  • Find the state sequence through this space that
    maximizes P(wt)P(tt-1)
  • cost(BOS, EOS) 1cost(DT, EOS)
  • cost(DT,EOS) maxP(theDT)P(NNDT)cost(NN,EOS)
    ,

  • P(theDT)P(VBDT)cost(VB,EOS)

29
Two ways of reasoning
  • Forward approach (Backward reasoning)
  • Compute the best way to get from a state to the
    goal state.
  • Backward approach (Forward reasoning)
  • Compute the best way from the source state to get
    to a state.
  • A combination of these two approaches is used in
    unsupervised training of HMMs.
  • Forward-backward algorithm (Appendix D)
Write a Comment
User Comments (0)
About PowerShow.com