SpellChecking PubMed Queries - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

SpellChecking PubMed Queries

Description:

is just the probability that someone would accidentally put an i' in place of a y' ... Sapna Baht - sauna bath. periostin - periods in. Daniel K E - danieluk m ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 49

Provided by: johnw88

Category:

more less

Transcript and Presenter's Notes

Title: SpellChecking PubMed Queries

1
SpellChecking PubMed Queries

By John Wilbur, Won Kim, Natalie Xie

2
Noisy Channel Model

Historically dates to Claude Shannon (1948)
Bayesian Model IWIntended Word OWObserved Word

3
Maximal Likelihood

We can ignore

4
Example

is just the
probability that someone would accidentally put
an i in place of a y.
is proportional to the
frequency of the word enzyme in the PubMed
data.

5
(No Transcript)
6
Spelling Error Statistics

Harvested from PubMed Log Files
Criteria
Look for word pairs entered within 5 min of each
other
Same IP address
Similar, but not same, spellings
Second member of pair much more frequent in
PubMed.

7
(No Transcript)
8
Many Terms of Low Frequency in PubMed are
Misspellings

Previous slide one string was not in PubMed.
Next slide it is in at low frequencies 0-100.

9
(No Transcript)
10
Errors collected for 63 days of log files
11
Average Rates of Different Types of Errors

Aver prob of del 0.00146
Aver prob of ins 2.925e-05
Aver prob of rep 4.006e-05
Aver prob of trn 0.000334

12
Corrections Context Dependent

We collected the context of one letter on either
side of a correction.
Word beginning and ending were marked with
special characters so that the statistics for
them could be specific.

13
A Model for Calculation

Simplest model Generate all possible corrections
and see if they are in PubMed and at what
frequency.
Better model is to search a tree like structure
(a trie).

14
Example
l l c h o l i n e
a c e t y l c h o l i n e
p h o l i n e
v l c h o l i n e
Search Term acitylcholine
15
Acetylcholine

Has 28 neighbors 1 edit distance away.
Has 29 neighbors 2 edit distance away.
Has about
13 1 edit deletions
364 1 edit insertions
325 1 edit replacements
12 1 edit transpositions
Two edit variations are much greater in number.

16
Basic Assumption I

Misspelling rate in PubMed queries is in PubMed
documents.
Therefore if we see acitylcholine as query we
may find

17
Basic Assumption II

Using frequency in the PubMed data to stand in
the place of probability of intent is only valid
at higher frequencies.
Must down grade the frequency at low frequencies.
Data of Zobel Dart (1994).

18
Estimating Regions of Success

Difficulty of correction depends on how densely
the space is packed with words.
Packing is more dense for shorter words.

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Treating a Single Token t Stage 1

I. If len
II. If in database f1000 do not correct.
III. Search for a most likely correction c
If cannot find a 1 edit correction label case 0.
If can find and pc0.7 or ptcorrection and label case 1.
If not case 0 or case 1 label as case 2.

23
Treating a Single Token t Stage 2

Case 0 If len9
Try to split into 2 words, both f500. If
succeed, accept as correction, label case 3.
Else try to perform a 2 edit correction. If
succeed, accept as correction, label case 4.
Case 1 If len5 run through stage 1 again.
Remains case 1.

24
Treating a Single Token t Stage 3

If len9 and case 1 or case 2 and if frequency
of current best
Do a two edit search for a correction.
If result has f80 and f10prior frequency and
if string not changed too much at beg, accept
result as the correction and label as case 4.

25
Treating a Single Token t Stage 4

Cases 1, 3, or 4 are accepted as a correction.
Otherwise, if len12 do a deep search and accept
if successful.
If not, try to split with both words in database.
Accept if successful.
Else failure, no correction!

26
Deep Search

Algorithm is allowed to do two edits and performs
a partial match of the token into the database
trie.
Looks for the match using the most letters from
the token.
Among all matches using that maximum number it
chooses the most probable.
This is repeated until a total match is produced
or failure. It must match at least 4 letters at
each step.
Ends with a rationality check.

27
Two Token Phrase

A different problem because of potential context.
Constraints are necessary.

28
Fixed Points

Tokens 1 or 2 characters long are very unlikely
to be misspelled.
We do not try to correct tokens of 1 or 2
characters in a two token phrase.

29
Restricted Points

If a token has between 3 and 6 characters, only
allow a single edit in that token.

30
Stopping Rules

Let fp be the frequency of the original phrase
and fm the minimum of the frequencies of the two
tokens making the phrase.
If fp 5 and fm 500 do not correct.
If fp 0 and fm 50 and the length of one string
is not more than 4 do not correct.

31
Attempt Correction

Let fc be frequency of the best 1 or 2 edit
correction for the phrase.
If fc0 and fm100 do correction on individual
tokens.
If 0
Try to split the phrase
Else try deep search
Else correct individual tokens.

32
More than Two Tokens

A trie of truncated phrases all the initial two
word phrases coming from longer phrases in the
database.
Take the initial two words of current phrase and
look for it in the truncated data. Do corrections
as appropriate.

33
Extension

If truncated string found in trie of truncated
database strings, trace the initial two token
string in the trie of long phrases and extend the
match with corrections as needed.
Follows deep search strategy (but safer).
If unable to produce perfect match, back track
looking for longest partial match.

34
Extension Failure

Treat the two token initial truncation in the
standard manner.
If two token phrase correction fails, treat only
the first token.
If a one or two token initial phrase has been
corrected, start over with what remains.

35
Cleaning Up the PubMed Data

Looked at single words that occurred in at a
least 20 PubMed Documents and 2 word phrases that
occurred in at least 9 PubMed documents.
Compared these with single edit alternatives that
were at least 10 times as common in the data.
Performed statistical testing.

36
Hypergeometric Test
Database
T2
I
T1
37
Wilcoxin-Mann-Whitney Test
myocardial
infraction
infarction
38
(No Transcript)
39
(No Transcript)
40
What about Phonetics?

Methods
Soundex (Odell and Russell, 1918)
Phonix (Gad, 1990)
Metaphone (Philips, 1990)
Trials by Zobel and Dart (1994)
Our trials phonetic encoding is a double edged
sword. Performance degraded.

41
The Problem with Phonetics

Things are grouped that should not be grouped and
things are separated that should not be
separated.
Zobel and Dart (1994) mad and not map to the
same encoding.
phalanges- flnj(s) and hpalanges- hpln(js).

42
Error Distribution

80 of spelling errors are a single edit
(Damerau, 1964 Morgan, 1970).
Our data looked at 1-3 edit errors and found 76
1 edit, 22 2 edit, 2 3 edit.

43
Error Distribution

We looked at the log data to find corrections
that could be detected by Metaphone, but not
within 2 edits.
For 63 days of logs less than 5,000 legitimate
corrections found.
That is less than 75 examples per day.

44
Examples of Success

miocardi alinfraction-myocardial infarction
terminl illnss-terminal illness
hig pressue liqud chromatogph-high pressure
liquid chromatography
tumor necrosisactor-tumor necrosis factor
hmgolbin-hemoglobin
philariosis-filariosis

45
Examples of Failure

Sapna Baht - sauna bath
periostin - periods in
Daniel K E - danieluk m
bisexual molest - bisexual modest
pancreas transplation -pancreas AND translation
stem cell ros - stem cell loss
cupper hair - upper air

46
Post Checking Process

Testing indicates that on PubMed queries about
87 of suggestions made are reasonable.
Suggestions are made for about 10.
Before using a suggestion postings are checked
and it is rejected if there are no postings.
Result is suggestions reach the user for about 7
of queries.

47
Usage Statistics