Contemporary Spelling Correction Decoding the Noisy Channel - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Contemporary Spelling Correction Decoding the Noisy Channel

Description:

... 'priveledge', 'rescision' as 'recision', 'collateral' as 'colaterall', 'latter' ... as 'estopple', 'withholding' as 'witholding', 'recission' as 'recision' ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 27

Provided by: BobCar6

Category:

more less

Transcript and Presenter's Notes

Title: Contemporary Spelling Correction Decoding the Noisy Channel

1
ContemporarySpelling CorrectionDecoding the
Noisy Channel

Bob Carpenter
Alias I, Inc.
carp_at_alias-i.com

2
Kinds of Spelling Mistakes Typos

Typos are wrong characters by mistake
Insertions
appellate as appellare, prejudice as
prejudsice
Deletions
plaintiff as paintiff, judgement as
judment, liability as liabilty, discovery
as dicovery, fourth amendment as
fourthamendment
Substitutions
habeas as haceas
Transpositions
fraud as fruad, bankruptcy as banrkuptcy
subpoena as subpeona
plaintiff as plaitniff

3
Kinds of Spelling Mistakes Brainos

Brainos are wrong characters on purpose
The kinds of mistakes found in lists of common
misspellings
Very common in general web queries
Derive from either pronunciation or spelling or
deep semantic confusions
English is particularly bad due to irregularity
Probably (?) common in other languages importing
words

4
Brainos Soundalikes

Latinates
subpoena as supena, judicata as judicada,
voir as voire
Consonant Clusters Flaps
privelege as priveledge, rescision as
recision, collateral as colaterall,
latter as ladder, estoppel as estopple,
withholding as witholding, recission as
recision
Vowel Reductions
collateral as collaterel, punitive as
punative
Vowel Clusters
respondeat as respondiat, lien as lein
estoppel as estopple, habeas as habeeas,
conveniens as convieniens
Marker Vowels
foreclosure as forclosure
Multiples
subpoena as supena (two deletes)

5
Brainos Confusions

Substitute more common or just plain different
Names Opperman as Oppenheimer Eisenstein
as Einstein
Pronunciation Confusions
Transpositions preclusion as perclusion,
meruit as meriut
Irregular word forms
juries as jurys or jureys men as mans
English is particularly bad for this, too
Tokenization issues
ATT vs. AT T vs. A.T.T.,
Correct variant (if unique) depends on search
engines notion of word
Word Boundaries
in camera as incamera, qui tam as quitam,
injunction as in junction, foreclosure as
for closure, dramshop as dram shop

6
Old School Spelling Correction

Damerau, 1964, A technique for computer
detection and correction of spelling errors.
Comms. ACM.
One word (token) at a time
Only looked at unknown words not in dictionary
Suggest closest alternatives (first or multiple
in order)
Closeness measured in number of edits (edit
distance)
Deletions, Insertions, Substitutions, and
sometimes Transpositions
Often results in ties
Good word game
With 50 characters and a 50-word query, get 5050
1084 alternatives
Can search whole space in linear time using
dynamic programming
This technique lives on in many apps
Simple, fast and only requires a word list

7
Edit Distance (Damerau/Levenstein)

Quadratic time linear space algorithm
Eg. D(John, Jan) 2 D(John, Bob)
3
Edits match J, subst a for o delete h,
match n)
score(I,J)
Min (score(I-1,J-1)
match(I,J),
score(I-1,J)
delete(J),
score(I,J-1)
insert(I) )

8
Middle Aged Spelling Correction

Still look at single words not in a dictionary
and list of common misspellings
Model Likely Edits
Whole words
acceptable as acceptible truant as
truent, etc.
Sound Sequences
ie ? ? ei mm ? ? m
Typos
Closeness on keyboard (depends on your keyboard
mixtures)
q as w y as u (substitutions)
q as qw or wq (insertions)
Position in Word
Edits more likely internally, next at end, least
in front
Psychology of reading left-to-right early
resolution
plantiff (mid) gt plaintff (end) gt laintiff
(front)

9
Contemporary Spelling Correction

Find most likely intended query given observed
query
Integrated Probabilistic Model
Model of Query Likelihood (source) P(query)
Model of Edit Likelihood (channel)
P(realizationquery)
Shannons Noisy Channel Model (1940s)
Find most likely query (Q) given realization (R)
ARGMAXQ P(Q R)
Problem
ARGMAXQ P(R Q) P(Q) / P(R) Defn.
Conditional
ARGMAXQ P(R Q) P(Q) R
constant

10
Simple Example of Correction

Query Likelihood Model
P(hte) 1/1,000,000
P(the) 1/20
Edit Likelihood Model
P(hte the)
P(transpose(th)) P(match(e))
1/500 99/100 99/50000 1/500
P( hte hte)
P(match(h)) P(match(t)) P(match(e))
1/1
Therefore
P(hte the ) P(the) 1/500 1/20
1/10,000
gtgt P(hte hte ) P(hte) 1/1
1/1,000,000

11
General Approach Solves Several Problems

Orders alternatives based on likelihood
First best or ranked n-best alternatives
N-best is a tricky user-interface issue for web
search
Measures likelihood that query is in error
Allows tuning of rejection thresholds for
precision/recall
Measures likelihood that correction is correct
As posterior probability in the Bayesian model
Principled balance of query vs. edit likelihoods
Empirical issue determined by measurable user
behavior
E.g. Word processors and web search very
different
Suggests Valid Word Substitutions in Phrases
pro bono as per bono
Peter principle as Peter principal
Google e.g. fodr ? ford but fodr baggins ?
frodo baggins

12
Alias-is Approach

Models fully retrainable per application
Out-of-the-box solutions not feasible
Tailored query and edit models based on user
application behavior
Scalable to gigabytes w/o pruning and to
arbitrary amounts of data with selective pruning
Character-level model for queries P(query)
Generalizes to subphrases of unknown tokens
E.g. likelihoods flagged as error by PowerPoint
E.g. likelihood not flagged as error by
PowerPoint
Or Token-sensitive output (only output known
words in corpus)
Allows efficient search based on prefixes
Flexible framework for edit likelihoods
P(realizationquery)
Models likely substitutions in domain

13
Source Language Models

Character n-grams
P(c0,,cn-1)
PROD iltn P(ci c0,,ci-1 ) chain
rule
PROD iltn P(ci ci-n1,,ci-1 )
n-gram
Generalized Witten-Bell smoothing ( state of the
art)
P(d c C)
lambda(c C) PML(d c C)
(1 lambda(c C)) P(d C)
where d,c are characters, and C a sequence of
characters,
PML is the maximum likelihood estimator,
the recursion grounds on the uniform estimate,
and
lambda(X) count(X) / (count(X) K
outcomes(X)) in 0,1

14
Training Data for Query Model P(Query)

Trained independently of edit model
Captures domain-specific features more than edits
Appropriate Text Corpus matches problem
Overall stats trt ? tart or tort (depends
on domain)
Phrasal Context linzer trt vs. trt reform
Implicitly models number of possible hits for
query
Can train per field for complex queries
E.g. author, institution, MeSH term, abstract in
MEDLINE
Can retrain query models as new data arrives
Training data must match use data
e.g. all caps, mixed case, etc.
May normalize queries plus training data

15
Training Data for Edit Model P(realizationquery)

1. No training data
A priori typos
Characters near each other on keyboard are likely
typos
More careful typing near beginning and end of
word
A priori brainos
Vowel sequences confusable with vowel sequences
Consonants that sound alike easily confused (t
vs. d, etc.)
Consonants likely doubled or undoubled in error
More common in unstressed syllables
(approximately later)
2. Bootstrap raw query logs
Can do this step with simpler model, such as
ispell
Better with the first approximation model above
(like EM)
Estimates rate of various errors and likely
substitutions

16
Training Data for Edit Model P(realizationquery)
(cont.)

Sample of Correct/Error Classified Queries
Better estimate of error edit rates (not specific
errors)
Estimate likely insert/delete/substitute/transpose
errors
Requires unbiased sample of errors and correct
queries
Search engines report 10-15 of queries have
errors!!!
Need 100 examples of each type of error type on
average
Requires unbiased sample of errors (correct not
necessary)
Need about 100 examples average per character, or
about 5K examples total assuming 50 editable
characters
We can find these using active learning or
bootstrapping
Requires best guess of correction using simpler
method

17
Training Data for Edit Model P(realizationquery)
(cont.)

4. Fully Supervised Learning
Same samples as in (3) above
Editor(s) provides correction for errors
Only a few days work with a halfway decent
interface
Should use two editors on same sample to
cross-validate
Multiple editors also provide a bound on human
performance
Almost always significantly better than bootstrap
methods

18
Evaluating Accuracy Correcting the Right
Queries

Need the labeled training data!
Are we correcting the right queries?
Confusion Matrix
True Positive Error that is corrected
True Negative Good query that is not corrected
False Positive Good query that is corrected
False Negative Error that is not corrected
Performance Metrics
Precision TP / (TP FP)
of corrections that were errors
Sensitivity TN / (TN FP)
of rejections that were not errors
Recall TP / (TP FN)
of errors that are corrected
Accuracy (TP TN) / (TP TN FP FN)
of queries for which we do the right thing
Can balance false alarms and missed corrections

19
Evaluating Accuracy Returning the Proper
Correction

Correction Accuracy
of corrections that were properly corrected
Combine with precision on the to-correct decision
Overall Accuracy
of queries that are TN or TP with right
correction

20
Evaluating Accuracy MSN Case Study

Cucerzan and Brill. 2004. Spelling Correction as
an iterative process that explits the collective
knowledge of web users. Proc. ACL.
10-15 estimate of queries with errors
Training by Bootstrapping Query Logs (method 2)
Scoring one human against another 90
System accuracy against averaged humans 82
System accuracy on valid queries 85
System accuracy on queries with errors 67
System accuracy with baseline edit model
80 total 83 valid queries 66 queries with
errors
8 lower estimates for auto-eval over sequential
logs
5 higher estimate for reasonable vs. exact
correction
Good News
Web search is as hard as it gets multi-topic
and multi-lingual

21
Evaluating Efficiency

May trade accuracy for efficiency along received
operating curve
Smaller model size by token or characters
Smaller search space
Higher rejection threshold increases efficiency,
reduces recall, and increases precision
Standalone Server Deployment
Allows larger shared models in memory
Simple timeout robustness from web server
Models require CRSW synchronization
Any number of concurrent queries share same model
w/o blocking
No queries can run while model is changing
Correction may be done in parallel to search (not
pure latency)
Do not need to evaluate number of queries
returned,
though this may be combined post-hoc with results
for tighter rejection
Should easily scale to requirements
1 million queries in 8 hours on a single
multiprocessor server
Thats 25-50 queries/second
LMs run at 2 million characters/second on desktop

22
But wait, thats not all for LingPipe 2.0

Character and Token-level Language Models
Ranked Terminology Discovery
collocations within corpus (chi square
independence test)
whats new across corpora (binomial t-test)
Binary Multiway Classification
Bayesian framework language model
implementations
Extensive probabilistic confusion matrix scoring
E.g. Topic (e.g. which newsgroup, which section
of paper)
E.g. Sentiment (eg. Positive or negative product
review)
E.g. Language (critical for multi-lingual
applications)
E.g. De-duplication of message streams
E.g. Spam detection
Hierarchical Clustering
General framework Language model implementations
E.g. Self-organizing web results
Chunking (high throughput Bayesian model)
E.g. Named entities, noun phrases and verb
phrases
Implementations of standard evaluations and
corpora

23
Design Standards

Extensive use of standard patterns
E.g. corpus visitors, abstract adapters,
factories for runtime pluggable implementations
Mostly immutable final (efficiency, state
stability testability)
Modules all support CRSW synchronization
Highly Modular Interfaces
Allows implementation plug and play
Most interfaces have abstract adapters
E.g. SpellChecker interface, AbstractSpellChecker
adapter with abstract edit model, and
ConstantSpellChecker and ProbabilisticSpellChecker
implementations
Simple or Complex Tuning Parameterizations
Reasonable Defaults
M.S./Ph.D.-level tuning options (popular for
theses)
Follows Suns coding standards

24
Engineering Support Standards

Active and Responsive User Group Forum
Tutorial examples of all modules
Most include industry-standard evaluations
Thorough Unit Testing (JUnit)
More good examples of API usage
Windows XP Linux for Java 1.4.2 and 1.5.0
Profile-based tuning (JProfiler)
Speed, Memory and Disk access
Full javadoc of public/protected API
Classes are shy about their privates as a rule
Types are as specific as possible (many adapters)
Integration at command-line, XML or API levels

25
Other Applications

Case Restoration ()
Source Train on mixed case data
Channel Case switching costs nothing others
infinite
E.g. LOUISE MCNALLY TEACHES AT POMPEU FABRU
becomes Louise McNally teaches at Pompeu Fabru
Useful for speech output or some old teletype
feeds
Vlad-Lita et al. 2003. tRuEasIng. ACL 03.
Punctuation Restoration
Channel Punctuation insertion costs nothing
others infinite
Also useful for speech output
Chinese Tokenization (Bill Teahan)
Source Train on space-separated tokens
Channel Spaces insert free others infinite
Teahan et al. 2000. A compression-based algorithm
for Chinese word segmentation CL Journal.

26
Decoding L33T-speak

L33T is l33t-speak for elite
Used by gamers (pwn4g3) and spammers (medcatiOn)
Substitute numbers (e.g. E to 3, A to 4,
O to 0, I to 1)
Substitute punctuation (e.g. /\ for A,
for L, \/\/ for W)
Some standard typos (e.g. p for o)
De-duplicate or duplicate characters freely
Delete characters relatively freely
Insert/delete space or punctuation freely
Get creative
Examples from my Spam from this week
VàLIUM CíAL1SS ViÁGRRA MACR0MEDIA, M1CR0S0FT,
SYMANNTEC 20 EACH univers.ty de-gree online
HOt penny pick fueed by high demand Fwd
cials-tabs, 24 hour sale online HOw 1s yOur
health Your C A R D D E B T can be wipe clean
Savvy players wOuld be wise tO l0ad up early Im
fed up of my Pan medcatiOn pr0bem Y0ur wIfe
needs tO cOpe with the PaIn End your
gIrlfr1end's Med!ca prOcedures n0w
C,EL.EB,R.E'X 2oo m'gg
Piece of cake to correct (pwn4g3 ownage, a
popular taunt if you win)
More info http//en.wikipedia.org/wiki/Leet