Probabilistic Detection of Context-Sensitive Spelling Errors PowerPoint PPT Presentation

presentation player overlay

About This Presentation

Transcript and Presenter's Notes

Title: Probabilistic Detection of Context-Sensitive Spelling Errors

1
Probabilistic Detection of Context-Sensitive
Spelling Errors

Johnny Bigert
Royal Institute of Technology, Sweden
johnny_at_kth.se

2
What?

Context-Sensitive Spelling Errors
ExampleNice whether today.
All words found in dictionary
If context is considered,the spelling of whether
is incorrect

3
Why?

Why do we need detection of context-sensitive
spelling errors?
These errors are quite frequent (reports on
16-40 of all errors)
Larger dictionaries result in more errors
undetected
They cannot be found by regular spell checkers!

4
Why not?

What about proposing corrections for the errors?
An interesting topic,but not the topic of this
article
Detection is imperative,correction is an aid

5
Related work?

Are there no algorithms doing this already?
A full parser is perfect for the job
Drawbacks
high accuracy is required
not available for many languages
manual labor is expensive
not robust

6
Related work?

Are there no other algorithms?
Several other algorithms (e.g. Winnow)
Some do correction
Drawbacks
They require a set of easily confused words
Normally, you dont know your spelling errors
beforehand

7
Why?

What are the benefits of this algorithm?
Find any error
Avoid extensive manual work
Robustness

8
How?

Prerequisites
We use PoS tag trigram frequenciesfrom an
annotated corpus
We are given a sentence, and apply a PoS tagger

9
How?

Basic assumption
If any tag trigram frequency is low, that part
is probably ungrammatical

10
But?

But dont you often encounter rare or unseen
trigrams?
Yes, unfortunately
We modify the notion of frequency
Find and use other, syntactically close PoS
trigrams

11
Close?

What is the syntactic distance between two PoS
tags?
A probability that one tag is replaceable by
another
Retain grammaticality
Distances extracted from corpus
Unsupervised learning algorithm

12
Then?

The algorithm
We have a generalized PoS tag trigtram frequency
If frequency below threshold, text is probably
ungrammatical

13
Result?

Summary so far
Unsupervised learning
Automatic algorithm
Detection of any error
No manual labor!
Alas, phrase boundaries cause problems

14
Phrases?

What about phrases?
PoS tag trigrams overlapping two phrases are very
productive
Rare phrases, rare trigrams
Transformations!

15
Transform?

How do we transform a phrase?
Shallow parser
Transform phrases to most common form
Normally, the head
Benefits retain grammaticality, less rare
trigrams, longer tagger scope

16
Example?

Example of phrase transformation
Only the paintings that are old are for sale
Only the paintings are for sale

NP
NP
17
Then what?

How do we use the transformations?
Apply tagger to transformed sentence
Run first part of algorithm again
If any transformation yield only trigrams with
high frequency,sentence ok
Otherwise, probable error

18
Result?

Summary
Trigram part, fully automatic
Phrase part, could use machine learning of rules
for shallow parser
Finds many difficult error types
Threshold determines precision/recall trade-off

19
Evaluation?

Fully automatic evaluation
Introduce artificial context-sensitive spelling
errors (using software Missplel)
Automated evaluation procedure for 1, 2, 5, 10
and 20 misspelled words(using software AutoEval)

20
Results? 1 errors
21
Results? 2 errors

Write a Comment

User Comments (0)

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user