Title: Contemporary Spelling Correction Decoding the Noisy Channel
1ContemporarySpelling CorrectionDecoding the
Noisy Channel
- Bob Carpenter
- Alias I, Inc.
- carp_at_alias-i.com
2Kinds of Spelling Mistakes Typos
- Typos are wrong characters by mistake
- Insertions
- appellate as appellare, prejudice as
prejudsice - Deletions
- plaintiff as paintiff, judgement as
judment, liability as liabilty, discovery
as dicovery, fourth amendment as
fourthamendment - Substitutions
- habeas as haceas
- Transpositions
- fraud as fruad, bankruptcy as banrkuptcy
- subpoena as subpeona
- plaintiff as plaitniff
3Kinds of Spelling Mistakes Brainos
- Brainos are wrong characters on purpose
- The kinds of mistakes found in lists of common
misspellings - Very common in general web queries
- Derive from either pronunciation or spelling or
deep semantic confusions - English is particularly bad due to irregularity
- Probably (?) common in other languages importing
words
4Brainos Soundalikes
- Latinates
- subpoena as supena, judicata as judicada,
voir as voire - Consonant Clusters Flaps
- privelege as priveledge, rescision as
recision, collateral as colaterall,
latter as ladder, estoppel as estopple,
withholding as witholding, recission as
recision - Vowel Reductions
- collateral as collaterel, punitive as
punative - Vowel Clusters
- respondeat as respondiat, lien as lein
estoppel as estopple, habeas as habeeas,
conveniens as convieniens - Marker Vowels
- foreclosure as forclosure
- Multiples
- subpoena as supena (two deletes)
5Brainos Confusions
- Substitute more common or just plain different
- Names Opperman as Oppenheimer Eisenstein
as Einstein - Pronunciation Confusions
- Transpositions preclusion as perclusion,
meruit as meriut - Irregular word forms
- juries as jurys or jureys men as mans
- English is particularly bad for this, too
- Tokenization issues
- ATT vs. AT T vs. A.T.T.,
- Correct variant (if unique) depends on search
engines notion of word - Word Boundaries
- in camera as incamera, qui tam as quitam,
injunction as in junction, foreclosure as
for closure, dramshop as dram shop
6Old School Spelling Correction
- Damerau, 1964, A technique for computer
detection and correction of spelling errors.
Comms. ACM. - One word (token) at a time
- Only looked at unknown words not in dictionary
- Suggest closest alternatives (first or multiple
in order) - Closeness measured in number of edits (edit
distance) - Deletions, Insertions, Substitutions, and
sometimes Transpositions - Often results in ties
- Good word game
- With 50 characters and a 50-word query, get 5050
1084 alternatives - Can search whole space in linear time using
dynamic programming - This technique lives on in many apps
- Simple, fast and only requires a word list
7Edit Distance (Damerau/Levenstein)
- Quadratic time linear space algorithm
- Eg. D(John, Jan) 2 D(John, Bob)
3 - Edits match J, subst a for o delete h,
match n) - score(I,J)
- Min (score(I-1,J-1)
- match(I,J),
- score(I-1,J)
- delete(J),
- score(I,J-1)
- insert(I) )
8Middle Aged Spelling Correction
- Still look at single words not in a dictionary
and list of common misspellings - Model Likely Edits
- Whole words
- acceptable as acceptible truant as
truent, etc. - Sound Sequences
- ie ? ? ei mm ? ? m
- Typos
- Closeness on keyboard (depends on your keyboard
mixtures) - q as w y as u (substitutions)
- q as qw or wq (insertions)
- Position in Word
- Edits more likely internally, next at end, least
in front - Psychology of reading left-to-right early
resolution - plantiff (mid) gt plaintff (end) gt laintiff
(front)
9Contemporary Spelling Correction
- Find most likely intended query given observed
query - Integrated Probabilistic Model
- Model of Query Likelihood (source) P(query)
- Model of Edit Likelihood (channel)
P(realizationquery) - Shannons Noisy Channel Model (1940s)
- Find most likely query (Q) given realization (R)
- ARGMAXQ P(Q R)
Problem - ARGMAXQ P(R Q) P(Q) / P(R) Defn.
Conditional - ARGMAXQ P(R Q) P(Q) R
constant
10Simple Example of Correction
- Query Likelihood Model
- P(hte) 1/1,000,000
- P(the) 1/20
- Edit Likelihood Model
- P(hte the)
- P(transpose(th)) P(match(e))
- 1/500 99/100 99/50000 1/500
- P( hte hte)
- P(match(h)) P(match(t)) P(match(e))
1/1 - Therefore
- P(hte the ) P(the) 1/500 1/20
1/10,000 - gtgt P(hte hte ) P(hte) 1/1
1/1,000,000
11General Approach Solves Several Problems
- Orders alternatives based on likelihood
- First best or ranked n-best alternatives
- N-best is a tricky user-interface issue for web
search - Measures likelihood that query is in error
- Allows tuning of rejection thresholds for
precision/recall - Measures likelihood that correction is correct
- As posterior probability in the Bayesian model
- Principled balance of query vs. edit likelihoods
- Empirical issue determined by measurable user
behavior - E.g. Word processors and web search very
different - Suggests Valid Word Substitutions in Phrases
- pro bono as per bono
- Peter principle as Peter principal
- Google e.g. fodr ? ford but fodr baggins ?
frodo baggins
12Alias-is Approach
- Models fully retrainable per application
- Out-of-the-box solutions not feasible
- Tailored query and edit models based on user
application behavior - Scalable to gigabytes w/o pruning and to
arbitrary amounts of data with selective pruning - Character-level model for queries P(query)
- Generalizes to subphrases of unknown tokens
- E.g. likelihoods flagged as error by PowerPoint
- E.g. likelihood not flagged as error by
PowerPoint - Or Token-sensitive output (only output known
words in corpus) - Allows efficient search based on prefixes
- Flexible framework for edit likelihoods
P(realizationquery) - Models likely substitutions in domain
13Source Language Models
- Character n-grams
- P(c0,,cn-1)
- PROD iltn P(ci c0,,ci-1 ) chain
rule - PROD iltn P(ci ci-n1,,ci-1 )
n-gram - Generalized Witten-Bell smoothing ( state of the
art) - P(d c C)
- lambda(c C) PML(d c C)
- (1 lambda(c C)) P(d C)
- where d,c are characters, and C a sequence of
characters, - PML is the maximum likelihood estimator,
- the recursion grounds on the uniform estimate,
and - lambda(X) count(X) / (count(X) K
outcomes(X)) in 0,1
14Training Data for Query Model P(Query)
- Trained independently of edit model
- Captures domain-specific features more than edits
- Appropriate Text Corpus matches problem
- Overall stats trt ? tart or tort (depends
on domain) - Phrasal Context linzer trt vs. trt reform
- Implicitly models number of possible hits for
query - Can train per field for complex queries
- E.g. author, institution, MeSH term, abstract in
MEDLINE - Can retrain query models as new data arrives
- Training data must match use data
- e.g. all caps, mixed case, etc.
- May normalize queries plus training data
15Training Data for Edit Model P(realizationquery)
- 1. No training data
- A priori typos
- Characters near each other on keyboard are likely
typos - More careful typing near beginning and end of
word - A priori brainos
- Vowel sequences confusable with vowel sequences
- Consonants that sound alike easily confused (t
vs. d, etc.) - Consonants likely doubled or undoubled in error
- More common in unstressed syllables
(approximately later) - 2. Bootstrap raw query logs
- Can do this step with simpler model, such as
ispell - Better with the first approximation model above
(like EM) - Estimates rate of various errors and likely
substitutions
16Training Data for Edit Model P(realizationquery)
(cont.)
- Sample of Correct/Error Classified Queries
- Better estimate of error edit rates (not specific
errors) - Estimate likely insert/delete/substitute/transpose
errors - Requires unbiased sample of errors and correct
queries - Search engines report 10-15 of queries have
errors!!! - Need 100 examples of each type of error type on
average - Requires unbiased sample of errors (correct not
necessary) - Need about 100 examples average per character, or
about 5K examples total assuming 50 editable
characters - We can find these using active learning or
bootstrapping - Requires best guess of correction using simpler
method
17Training Data for Edit Model P(realizationquery)
(cont.)
- 4. Fully Supervised Learning
- Same samples as in (3) above
- Editor(s) provides correction for errors
- Only a few days work with a halfway decent
interface - Should use two editors on same sample to
cross-validate - Multiple editors also provide a bound on human
performance - Almost always significantly better than bootstrap
methods
18Evaluating Accuracy Correcting the Right
Queries
- Need the labeled training data!
- Are we correcting the right queries?
- Confusion Matrix
- True Positive Error that is corrected
- True Negative Good query that is not corrected
- False Positive Good query that is corrected
- False Negative Error that is not corrected
- Performance Metrics
- Precision TP / (TP FP)
- of corrections that were errors
- Sensitivity TN / (TN FP)
- of rejections that were not errors
- Recall TP / (TP FN)
- of errors that are corrected
- Accuracy (TP TN) / (TP TN FP FN)
- of queries for which we do the right thing
- Can balance false alarms and missed corrections
19Evaluating Accuracy Returning the Proper
Correction
- Correction Accuracy
- of corrections that were properly corrected
- Combine with precision on the to-correct decision
- Overall Accuracy
- of queries that are TN or TP with right
correction
20Evaluating Accuracy MSN Case Study
- Cucerzan and Brill. 2004. Spelling Correction as
an iterative process that explits the collective
knowledge of web users. Proc. ACL. - 10-15 estimate of queries with errors
- Training by Bootstrapping Query Logs (method 2)
- Scoring one human against another 90
- System accuracy against averaged humans 82
- System accuracy on valid queries 85
- System accuracy on queries with errors 67
- System accuracy with baseline edit model
- 80 total 83 valid queries 66 queries with
errors - 8 lower estimates for auto-eval over sequential
logs - 5 higher estimate for reasonable vs. exact
correction - Good News
- Web search is as hard as it gets multi-topic
and multi-lingual
21Evaluating Efficiency
- May trade accuracy for efficiency along received
operating curve - Smaller model size by token or characters
- Smaller search space
- Higher rejection threshold increases efficiency,
reduces recall, and increases precision - Standalone Server Deployment
- Allows larger shared models in memory
- Simple timeout robustness from web server
- Models require CRSW synchronization
- Any number of concurrent queries share same model
w/o blocking - No queries can run while model is changing
- Correction may be done in parallel to search (not
pure latency) - Do not need to evaluate number of queries
returned, - though this may be combined post-hoc with results
for tighter rejection - Should easily scale to requirements
- 1 million queries in 8 hours on a single
multiprocessor server - Thats 25-50 queries/second
- LMs run at 2 million characters/second on desktop
22But wait, thats not all for LingPipe 2.0
- Character and Token-level Language Models
- Ranked Terminology Discovery
- collocations within corpus (chi square
independence test) - whats new across corpora (binomial t-test)
- Binary Multiway Classification
- Bayesian framework language model
implementations - Extensive probabilistic confusion matrix scoring
- E.g. Topic (e.g. which newsgroup, which section
of paper) - E.g. Sentiment (eg. Positive or negative product
review) - E.g. Language (critical for multi-lingual
applications) - E.g. De-duplication of message streams
- E.g. Spam detection
- Hierarchical Clustering
- General framework Language model implementations
- E.g. Self-organizing web results
- Chunking (high throughput Bayesian model)
- E.g. Named entities, noun phrases and verb
phrases - Implementations of standard evaluations and
corpora
23Design Standards
- Extensive use of standard patterns
- E.g. corpus visitors, abstract adapters,
factories for runtime pluggable implementations - Mostly immutable final (efficiency, state
stability testability) - Modules all support CRSW synchronization
- Highly Modular Interfaces
- Allows implementation plug and play
- Most interfaces have abstract adapters
- E.g. SpellChecker interface, AbstractSpellChecker
adapter with abstract edit model, and
ConstantSpellChecker and ProbabilisticSpellChecker
implementations - Simple or Complex Tuning Parameterizations
- Reasonable Defaults
- M.S./Ph.D.-level tuning options (popular for
theses) - Follows Suns coding standards
24Engineering Support Standards
- Active and Responsive User Group Forum
- Tutorial examples of all modules
- Most include industry-standard evaluations
- Thorough Unit Testing (JUnit)
- More good examples of API usage
- Windows XP Linux for Java 1.4.2 and 1.5.0
- Profile-based tuning (JProfiler)
- Speed, Memory and Disk access
- Full javadoc of public/protected API
- Classes are shy about their privates as a rule
- Types are as specific as possible (many adapters)
- Integration at command-line, XML or API levels
25Other Applications
- Case Restoration ()
- Source Train on mixed case data
- Channel Case switching costs nothing others
infinite - E.g. LOUISE MCNALLY TEACHES AT POMPEU FABRU
becomes Louise McNally teaches at Pompeu Fabru - Useful for speech output or some old teletype
feeds - Vlad-Lita et al. 2003. tRuEasIng. ACL 03.
- Punctuation Restoration
- Channel Punctuation insertion costs nothing
others infinite - Also useful for speech output
- Chinese Tokenization (Bill Teahan)
- Source Train on space-separated tokens
- Channel Spaces insert free others infinite
- Teahan et al. 2000. A compression-based algorithm
for Chinese word segmentation CL Journal.
26Decoding L33T-speak
- L33T is l33t-speak for elite
- Used by gamers (pwn4g3) and spammers (medcatiOn)
- Substitute numbers (e.g. E to 3, A to 4,
O to 0, I to 1) - Substitute punctuation (e.g. /\ for A,
for L, \/\/ for W) - Some standard typos (e.g. p for o)
- De-duplicate or duplicate characters freely
- Delete characters relatively freely
- Insert/delete space or punctuation freely
- Get creative
- Examples from my Spam from this week
- VàLIUM CíAL1SS ViÁGRRA MACR0MEDIA, M1CR0S0FT,
SYMANNTEC 20 EACH univers.ty de-gree online
HOt penny pick fueed by high demand Fwd
cials-tabs, 24 hour sale online HOw 1s yOur
health Your C A R D D E B T can be wipe clean
Savvy players wOuld be wise tO l0ad up early Im
fed up of my Pan medcatiOn pr0bem Y0ur wIfe
needs tO cOpe with the PaIn End your
gIrlfr1end's Med!ca prOcedures n0w
C,EL.EB,R.E'X 2oo m'gg - Piece of cake to correct (pwn4g3 ownage, a
popular taunt if you win) - More info http//en.wikipedia.org/wiki/Leet