Contemporary Spelling Correction Decoding the Noisy Channel - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Contemporary Spelling Correction Decoding the Noisy Channel

Description:

... 'priveledge', 'rescision' as 'recision', 'collateral' as 'colaterall', 'latter' ... as 'estopple', 'withholding' as 'witholding', 'recission' as 'recision' ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 27
Provided by: BobCar6
Category:

less

Transcript and Presenter's Notes

Title: Contemporary Spelling Correction Decoding the Noisy Channel


1
ContemporarySpelling CorrectionDecoding the
Noisy Channel
  • Bob Carpenter
  • Alias I, Inc.
  • carp_at_alias-i.com

2
Kinds of Spelling Mistakes Typos
  • Typos are wrong characters by mistake
  • Insertions
  • appellate as appellare, prejudice as
    prejudsice
  • Deletions
  • plaintiff as paintiff, judgement as
    judment, liability as liabilty, discovery
    as dicovery, fourth amendment as
    fourthamendment
  • Substitutions
  • habeas as haceas
  • Transpositions
  • fraud as fruad, bankruptcy as banrkuptcy
  • subpoena as subpeona
  • plaintiff as plaitniff

3
Kinds of Spelling Mistakes Brainos
  • Brainos are wrong characters on purpose
  • The kinds of mistakes found in lists of common
    misspellings
  • Very common in general web queries
  • Derive from either pronunciation or spelling or
    deep semantic confusions
  • English is particularly bad due to irregularity
  • Probably (?) common in other languages importing
    words

4
Brainos Soundalikes
  • Latinates
  • subpoena as supena, judicata as judicada,
    voir as voire
  • Consonant Clusters Flaps
  • privelege as priveledge, rescision as
    recision, collateral as colaterall,
    latter as ladder, estoppel as estopple,
    withholding as witholding, recission as
    recision
  • Vowel Reductions
  • collateral as collaterel, punitive as
    punative
  • Vowel Clusters
  • respondeat as respondiat, lien as lein
    estoppel as estopple, habeas as habeeas,
    conveniens as convieniens
  • Marker Vowels
  • foreclosure as forclosure
  • Multiples
  • subpoena as supena (two deletes)

5
Brainos Confusions
  • Substitute more common or just plain different
  • Names Opperman as Oppenheimer Eisenstein
    as Einstein
  • Pronunciation Confusions
  • Transpositions preclusion as perclusion,
    meruit as meriut
  • Irregular word forms
  • juries as jurys or jureys men as mans
  • English is particularly bad for this, too
  • Tokenization issues
  • ATT vs. AT T vs. A.T.T.,
  • Correct variant (if unique) depends on search
    engines notion of word
  • Word Boundaries
  • in camera as incamera, qui tam as quitam,
    injunction as in junction, foreclosure as
    for closure, dramshop as dram shop

6
Old School Spelling Correction
  • Damerau, 1964, A technique for computer
    detection and correction of spelling errors.
    Comms. ACM.
  • One word (token) at a time
  • Only looked at unknown words not in dictionary
  • Suggest closest alternatives (first or multiple
    in order)
  • Closeness measured in number of edits (edit
    distance)
  • Deletions, Insertions, Substitutions, and
    sometimes Transpositions
  • Often results in ties
  • Good word game
  • With 50 characters and a 50-word query, get 5050
    1084 alternatives
  • Can search whole space in linear time using
    dynamic programming
  • This technique lives on in many apps
  • Simple, fast and only requires a word list

7
Edit Distance (Damerau/Levenstein)
  • Quadratic time linear space algorithm
  • Eg. D(John, Jan) 2 D(John, Bob)
    3
  • Edits match J, subst a for o delete h,
    match n)
  • score(I,J)
  • Min (score(I-1,J-1)
  • match(I,J),
  • score(I-1,J)
  • delete(J),
  • score(I,J-1)
  • insert(I) )

8
Middle Aged Spelling Correction
  • Still look at single words not in a dictionary
    and list of common misspellings
  • Model Likely Edits
  • Whole words
  • acceptable as acceptible truant as
    truent, etc.
  • Sound Sequences
  • ie ? ? ei mm ? ? m
  • Typos
  • Closeness on keyboard (depends on your keyboard
    mixtures)
  • q as w y as u (substitutions)
  • q as qw or wq (insertions)
  • Position in Word
  • Edits more likely internally, next at end, least
    in front
  • Psychology of reading left-to-right early
    resolution
  • plantiff (mid) gt plaintff (end) gt laintiff
    (front)

9
Contemporary Spelling Correction
  • Find most likely intended query given observed
    query
  • Integrated Probabilistic Model
  • Model of Query Likelihood (source) P(query)
  • Model of Edit Likelihood (channel)
    P(realizationquery)
  • Shannons Noisy Channel Model (1940s)
  • Find most likely query (Q) given realization (R)
  • ARGMAXQ P(Q R)
    Problem
  • ARGMAXQ P(R Q) P(Q) / P(R) Defn.
    Conditional
  • ARGMAXQ P(R Q) P(Q) R
    constant

10
Simple Example of Correction
  • Query Likelihood Model
  • P(hte) 1/1,000,000
  • P(the) 1/20
  • Edit Likelihood Model
  • P(hte the)
  • P(transpose(th)) P(match(e))
  • 1/500 99/100 99/50000 1/500
  • P( hte hte)
  • P(match(h)) P(match(t)) P(match(e))
    1/1
  • Therefore
  • P(hte the ) P(the) 1/500 1/20
    1/10,000
  • gtgt P(hte hte ) P(hte) 1/1
    1/1,000,000

11
General Approach Solves Several Problems
  • Orders alternatives based on likelihood
  • First best or ranked n-best alternatives
  • N-best is a tricky user-interface issue for web
    search
  • Measures likelihood that query is in error
  • Allows tuning of rejection thresholds for
    precision/recall
  • Measures likelihood that correction is correct
  • As posterior probability in the Bayesian model
  • Principled balance of query vs. edit likelihoods
  • Empirical issue determined by measurable user
    behavior
  • E.g. Word processors and web search very
    different
  • Suggests Valid Word Substitutions in Phrases
  • pro bono as per bono
  • Peter principle as Peter principal
  • Google e.g. fodr ? ford but fodr baggins ?
    frodo baggins

12
Alias-is Approach
  • Models fully retrainable per application
  • Out-of-the-box solutions not feasible
  • Tailored query and edit models based on user
    application behavior
  • Scalable to gigabytes w/o pruning and to
    arbitrary amounts of data with selective pruning
  • Character-level model for queries P(query)
  • Generalizes to subphrases of unknown tokens
  • E.g. likelihoods flagged as error by PowerPoint
  • E.g. likelihood not flagged as error by
    PowerPoint
  • Or Token-sensitive output (only output known
    words in corpus)
  • Allows efficient search based on prefixes
  • Flexible framework for edit likelihoods
    P(realizationquery)
  • Models likely substitutions in domain

13
Source Language Models
  • Character n-grams
  • P(c0,,cn-1)
  • PROD iltn P(ci c0,,ci-1 ) chain
    rule
  • PROD iltn P(ci ci-n1,,ci-1 )
    n-gram
  • Generalized Witten-Bell smoothing ( state of the
    art)
  • P(d c C)
  • lambda(c C) PML(d c C)
  • (1 lambda(c C)) P(d C)
  • where d,c are characters, and C a sequence of
    characters,
  • PML is the maximum likelihood estimator,
  • the recursion grounds on the uniform estimate,
    and
  • lambda(X) count(X) / (count(X) K
    outcomes(X)) in 0,1

14
Training Data for Query Model P(Query)
  • Trained independently of edit model
  • Captures domain-specific features more than edits
  • Appropriate Text Corpus matches problem
  • Overall stats trt ? tart or tort (depends
    on domain)
  • Phrasal Context linzer trt vs. trt reform
  • Implicitly models number of possible hits for
    query
  • Can train per field for complex queries
  • E.g. author, institution, MeSH term, abstract in
    MEDLINE
  • Can retrain query models as new data arrives
  • Training data must match use data
  • e.g. all caps, mixed case, etc.
  • May normalize queries plus training data

15
Training Data for Edit Model P(realizationquery)
  • 1. No training data
  • A priori typos
  • Characters near each other on keyboard are likely
    typos
  • More careful typing near beginning and end of
    word
  • A priori brainos
  • Vowel sequences confusable with vowel sequences
  • Consonants that sound alike easily confused (t
    vs. d, etc.)
  • Consonants likely doubled or undoubled in error
  • More common in unstressed syllables
    (approximately later)
  • 2. Bootstrap raw query logs
  • Can do this step with simpler model, such as
    ispell
  • Better with the first approximation model above
    (like EM)
  • Estimates rate of various errors and likely
    substitutions

16
Training Data for Edit Model P(realizationquery)
(cont.)
  • Sample of Correct/Error Classified Queries
  • Better estimate of error edit rates (not specific
    errors)
  • Estimate likely insert/delete/substitute/transpose
    errors
  • Requires unbiased sample of errors and correct
    queries
  • Search engines report 10-15 of queries have
    errors!!!
  • Need 100 examples of each type of error type on
    average
  • Requires unbiased sample of errors (correct not
    necessary)
  • Need about 100 examples average per character, or
    about 5K examples total assuming 50 editable
    characters
  • We can find these using active learning or
    bootstrapping
  • Requires best guess of correction using simpler
    method

17
Training Data for Edit Model P(realizationquery)
(cont.)
  • 4. Fully Supervised Learning
  • Same samples as in (3) above
  • Editor(s) provides correction for errors
  • Only a few days work with a halfway decent
    interface
  • Should use two editors on same sample to
    cross-validate
  • Multiple editors also provide a bound on human
    performance
  • Almost always significantly better than bootstrap
    methods

18
Evaluating Accuracy Correcting the Right
Queries
  • Need the labeled training data!
  • Are we correcting the right queries?
  • Confusion Matrix
  • True Positive Error that is corrected
  • True Negative Good query that is not corrected
  • False Positive Good query that is corrected
  • False Negative Error that is not corrected
  • Performance Metrics
  • Precision TP / (TP FP)
  • of corrections that were errors
  • Sensitivity TN / (TN FP)
  • of rejections that were not errors
  • Recall TP / (TP FN)
  • of errors that are corrected
  • Accuracy (TP TN) / (TP TN FP FN)
  • of queries for which we do the right thing
  • Can balance false alarms and missed corrections

19
Evaluating Accuracy Returning the Proper
Correction
  • Correction Accuracy
  • of corrections that were properly corrected
  • Combine with precision on the to-correct decision
  • Overall Accuracy
  • of queries that are TN or TP with right
    correction

20
Evaluating Accuracy MSN Case Study
  • Cucerzan and Brill. 2004. Spelling Correction as
    an iterative process that explits the collective
    knowledge of web users. Proc. ACL.
  • 10-15 estimate of queries with errors
  • Training by Bootstrapping Query Logs (method 2)
  • Scoring one human against another 90
  • System accuracy against averaged humans 82
  • System accuracy on valid queries 85
  • System accuracy on queries with errors 67
  • System accuracy with baseline edit model
  • 80 total 83 valid queries 66 queries with
    errors
  • 8 lower estimates for auto-eval over sequential
    logs
  • 5 higher estimate for reasonable vs. exact
    correction
  • Good News
  • Web search is as hard as it gets multi-topic
    and multi-lingual

21
Evaluating Efficiency
  • May trade accuracy for efficiency along received
    operating curve
  • Smaller model size by token or characters
  • Smaller search space
  • Higher rejection threshold increases efficiency,
    reduces recall, and increases precision
  • Standalone Server Deployment
  • Allows larger shared models in memory
  • Simple timeout robustness from web server
  • Models require CRSW synchronization
  • Any number of concurrent queries share same model
    w/o blocking
  • No queries can run while model is changing
  • Correction may be done in parallel to search (not
    pure latency)
  • Do not need to evaluate number of queries
    returned,
  • though this may be combined post-hoc with results
    for tighter rejection
  • Should easily scale to requirements
  • 1 million queries in 8 hours on a single
    multiprocessor server
  • Thats 25-50 queries/second
  • LMs run at 2 million characters/second on desktop

22
But wait, thats not all for LingPipe 2.0
  • Character and Token-level Language Models
  • Ranked Terminology Discovery
  • collocations within corpus (chi square
    independence test)
  • whats new across corpora (binomial t-test)
  • Binary Multiway Classification
  • Bayesian framework language model
    implementations
  • Extensive probabilistic confusion matrix scoring
  • E.g. Topic (e.g. which newsgroup, which section
    of paper)
  • E.g. Sentiment (eg. Positive or negative product
    review)
  • E.g. Language (critical for multi-lingual
    applications)
  • E.g. De-duplication of message streams
  • E.g. Spam detection
  • Hierarchical Clustering
  • General framework Language model implementations
  • E.g. Self-organizing web results
  • Chunking (high throughput Bayesian model)
  • E.g. Named entities, noun phrases and verb
    phrases
  • Implementations of standard evaluations and
    corpora

23
Design Standards
  • Extensive use of standard patterns
  • E.g. corpus visitors, abstract adapters,
    factories for runtime pluggable implementations
  • Mostly immutable final (efficiency, state
    stability testability)
  • Modules all support CRSW synchronization
  • Highly Modular Interfaces
  • Allows implementation plug and play
  • Most interfaces have abstract adapters
  • E.g. SpellChecker interface, AbstractSpellChecker
    adapter with abstract edit model, and
    ConstantSpellChecker and ProbabilisticSpellChecker
    implementations
  • Simple or Complex Tuning Parameterizations
  • Reasonable Defaults
  • M.S./Ph.D.-level tuning options (popular for
    theses)
  • Follows Suns coding standards

24
Engineering Support Standards
  • Active and Responsive User Group Forum
  • Tutorial examples of all modules
  • Most include industry-standard evaluations
  • Thorough Unit Testing (JUnit)
  • More good examples of API usage
  • Windows XP Linux for Java 1.4.2 and 1.5.0
  • Profile-based tuning (JProfiler)
  • Speed, Memory and Disk access
  • Full javadoc of public/protected API
  • Classes are shy about their privates as a rule
  • Types are as specific as possible (many adapters)
  • Integration at command-line, XML or API levels

25
Other Applications
  • Case Restoration ()
  • Source Train on mixed case data
  • Channel Case switching costs nothing others
    infinite
  • E.g. LOUISE MCNALLY TEACHES AT POMPEU FABRU
    becomes Louise McNally teaches at Pompeu Fabru
  • Useful for speech output or some old teletype
    feeds
  • Vlad-Lita et al. 2003. tRuEasIng. ACL 03.
  • Punctuation Restoration
  • Channel Punctuation insertion costs nothing
    others infinite
  • Also useful for speech output
  • Chinese Tokenization (Bill Teahan)
  • Source Train on space-separated tokens
  • Channel Spaces insert free others infinite
  • Teahan et al. 2000. A compression-based algorithm
    for Chinese word segmentation CL Journal.

26
Decoding L33T-speak
  • L33T is l33t-speak for elite
  • Used by gamers (pwn4g3) and spammers (medcatiOn)
  • Substitute numbers (e.g. E to 3, A to 4,
    O to 0, I to 1)
  • Substitute punctuation (e.g. /\ for A,
    for L, \/\/ for W)
  • Some standard typos (e.g. p for o)
  • De-duplicate or duplicate characters freely
  • Delete characters relatively freely
  • Insert/delete space or punctuation freely
  • Get creative
  • Examples from my Spam from this week
  • VàLIUM CíAL1SS ViÁGRRA MACR0MEDIA, M1CR0S0FT,
    SYMANNTEC 20 EACH univers.ty de-gree online
    HOt penny pick fueed by high demand Fwd
    cials-tabs, 24 hour sale online HOw 1s yOur
    health Your C A R D D E B T can be wipe clean
    Savvy players wOuld be wise tO l0ad up early Im
    fed up of my Pan medcatiOn pr0bem Y0ur wIfe
    needs tO cOpe with the PaIn End your
    gIrlfr1end's Med!ca prOcedures n0w
    C,EL.EB,R.E'X    2oo  m'gg
  • Piece of cake to correct (pwn4g3 ownage, a
    popular taunt if you win)
  • More info http//en.wikipedia.org/wiki/Leet
Write a Comment
User Comments (0)
About PowerShow.com