Title: Probabilistic Detection of Context-Sensitive Spelling Errors
1Probabilistic Detection of Context-Sensitive
Spelling Errors
- Johnny Bigert
- Royal Institute of Technology, Sweden
- johnny_at_kth.se
2What?
- Context-Sensitive Spelling Errors
- ExampleNice whether today.
- All words found in dictionary
- If context is considered,the spelling of whether
is incorrect
3Why?
- Why do we need detection of context-sensitive
spelling errors? - These errors are quite frequent (reports on
16-40 of all errors) - Larger dictionaries result in more errors
undetected - They cannot be found by regular spell checkers!
4Why not?
- What about proposing corrections for the errors?
- An interesting topic,but not the topic of this
article - Detection is imperative,correction is an aid
5Related work?
- Are there no algorithms doing this already?
- A full parser is perfect for the job
- Drawbacks
- high accuracy is required
- not available for many languages
- manual labor is expensive
- not robust
6Related work?
- Are there no other algorithms?
- Several other algorithms (e.g. Winnow)
- Some do correction
- Drawbacks
- They require a set of easily confused words
- Normally, you dont know your spelling errors
beforehand
7Why?
- What are the benefits of this algorithm?
- Find any error
- Avoid extensive manual work
- Robustness
8How?
- Prerequisites
- We use PoS tag trigram frequenciesfrom an
annotated corpus - We are given a sentence, and apply a PoS tagger
9How?
- Basic assumption
- If any tag trigram frequency is low, that part
is probably ungrammatical
10But?
- But dont you often encounter rare or unseen
trigrams? - Yes, unfortunately
- We modify the notion of frequency
- Find and use other, syntactically close PoS
trigrams
11Close?
- What is the syntactic distance between two PoS
tags? - A probability that one tag is replaceable by
another - Retain grammaticality
- Distances extracted from corpus
- Unsupervised learning algorithm
12Then?
- The algorithm
- We have a generalized PoS tag trigtram frequency
- If frequency below threshold, text is probably
ungrammatical
13Result?
- Summary so far
- Unsupervised learning
- Automatic algorithm
- Detection of any error
- No manual labor!
- Alas, phrase boundaries cause problems
14Phrases?
- What about phrases?
- PoS tag trigrams overlapping two phrases are very
productive - Rare phrases, rare trigrams
- Transformations!
15Transform?
- How do we transform a phrase?
- Shallow parser
- Transform phrases to most common form
- Normally, the head
- Benefits retain grammaticality, less rare
trigrams, longer tagger scope
16Example?
- Example of phrase transformation
- Only the paintings that are old are for sale
- Only the paintings are for sale
NP
NP
17Then what?
- How do we use the transformations?
- Apply tagger to transformed sentence
- Run first part of algorithm again
- If any transformation yield only trigrams with
high frequency,sentence ok - Otherwise, probable error
18Result?
- Summary
- Trigram part, fully automatic
- Phrase part, could use machine learning of rules
for shallow parser - Finds many difficult error types
- Threshold determines precision/recall trade-off
19Evaluation?
- Fully automatic evaluation
- Introduce artificial context-sensitive spelling
errors (using software Missplel) - Automated evaluation procedure for 1, 2, 5, 10
and 20 misspelled words(using software AutoEval)
20Results? 1 errors
21Results? 2 errors