Title: SpellChecking PubMed Queries
1SpellChecking PubMed Queries
- By John Wilbur, Won Kim, Natalie Xie
2Noisy Channel Model
- Historically dates to Claude Shannon (1948)
- Bayesian Model IWIntended Word OWObserved Word
3Maximal Likelihood
4Example
- is just the
probability that someone would accidentally put
an i in place of a y. - is proportional to the
frequency of the word enzyme in the PubMed
data.
5(No Transcript)
6Spelling Error Statistics
- Harvested from PubMed Log Files
- Criteria
- Look for word pairs entered within 5 min of each
other - Same IP address
- Similar, but not same, spellings
- Second member of pair much more frequent in
PubMed.
7(No Transcript)
8Many Terms of Low Frequency in PubMed are
Misspellings
- Previous slide one string was not in PubMed.
- Next slide it is in at low frequencies 0-100.
9(No Transcript)
10Errors collected for 63 days of log files
11Average Rates of Different Types of Errors
- Aver prob of del 0.00146
- Aver prob of ins 2.925e-05
- Aver prob of rep 4.006e-05
- Aver prob of trn 0.000334
12Corrections Context Dependent
- We collected the context of one letter on either
side of a correction. - Word beginning and ending were marked with
special characters so that the statistics for
them could be specific.
13A Model for Calculation
- Simplest model Generate all possible corrections
and see if they are in PubMed and at what
frequency. - Better model is to search a tree like structure
(a trie).
14Example
l l c h o l i n e
a c e t y l c h o l i n e
p h o l i n e
v l c h o l i n e
Search Term acitylcholine
15Acetylcholine
- Has 28 neighbors 1 edit distance away.
- Has 29 neighbors 2 edit distance away.
- Has about
- 13 1 edit deletions
- 364 1 edit insertions
- 325 1 edit replacements
- 12 1 edit transpositions
- Two edit variations are much greater in number.
16Basic Assumption I
- Misspelling rate in PubMed queries is in PubMed
documents. - Therefore if we see acitylcholine as query we
may find
17Basic Assumption II
- Using frequency in the PubMed data to stand in
the place of probability of intent is only valid
at higher frequencies. - Must down grade the frequency at low frequencies.
- Data of Zobel Dart (1994).
18Estimating Regions of Success
- Difficulty of correction depends on how densely
the space is packed with words. - Packing is more dense for shorter words.
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Treating a Single Token t Stage 1
- I. If len
- II. If in database f1000 do not correct.
- III. Search for a most likely correction c
- If cannot find a 1 edit correction label case 0.
- If can find and pc0.7 or ptcorrection and label case 1.
- If not case 0 or case 1 label as case 2.
23Treating a Single Token t Stage 2
- Case 0 If len9
- Try to split into 2 words, both f500. If
succeed, accept as correction, label case 3. - Else try to perform a 2 edit correction. If
succeed, accept as correction, label case 4. - Case 1 If len5 run through stage 1 again.
Remains case 1.
24Treating a Single Token t Stage 3
- If len9 and case 1 or case 2 and if frequency
of current best - Do a two edit search for a correction.
- If result has f80 and f10prior frequency and
if string not changed too much at beg, accept
result as the correction and label as case 4.
25Treating a Single Token t Stage 4
- Cases 1, 3, or 4 are accepted as a correction.
- Otherwise, if len12 do a deep search and accept
if successful. - If not, try to split with both words in database.
Accept if successful. - Else failure, no correction!
26Deep Search
- Algorithm is allowed to do two edits and performs
a partial match of the token into the database
trie. - Looks for the match using the most letters from
the token. - Among all matches using that maximum number it
chooses the most probable. - This is repeated until a total match is produced
or failure. It must match at least 4 letters at
each step. - Ends with a rationality check.
27Two Token Phrase
- A different problem because of potential context.
- Constraints are necessary.
28Fixed Points
- Tokens 1 or 2 characters long are very unlikely
to be misspelled. - We do not try to correct tokens of 1 or 2
characters in a two token phrase.
29Restricted Points
- If a token has between 3 and 6 characters, only
allow a single edit in that token.
30Stopping Rules
- Let fp be the frequency of the original phrase
and fm the minimum of the frequencies of the two
tokens making the phrase. - If fp 5 and fm 500 do not correct.
- If fp 0 and fm 50 and the length of one string
is not more than 4 do not correct.
31Attempt Correction
- Let fc be frequency of the best 1 or 2 edit
correction for the phrase. - If fc0 and fm100 do correction on individual
tokens. - If 0
- Try to split the phrase
- Else try deep search
- Else correct individual tokens.
32More than Two Tokens
- A trie of truncated phrases all the initial two
word phrases coming from longer phrases in the
database. - Take the initial two words of current phrase and
look for it in the truncated data. Do corrections
as appropriate.
33Extension
- If truncated string found in trie of truncated
database strings, trace the initial two token
string in the trie of long phrases and extend the
match with corrections as needed. - Follows deep search strategy (but safer).
- If unable to produce perfect match, back track
looking for longest partial match.
34Extension Failure
- Treat the two token initial truncation in the
standard manner. - If two token phrase correction fails, treat only
the first token. - If a one or two token initial phrase has been
corrected, start over with what remains.
35Cleaning Up the PubMed Data
- Looked at single words that occurred in at a
least 20 PubMed Documents and 2 word phrases that
occurred in at least 9 PubMed documents. - Compared these with single edit alternatives that
were at least 10 times as common in the data. - Performed statistical testing.
36Hypergeometric Test
Database
T2
I
T1
37Wilcoxin-Mann-Whitney Test
myocardial
infraction
infarction
38(No Transcript)
39(No Transcript)
40What about Phonetics?
- Methods
- Soundex (Odell and Russell, 1918)
- Phonix (Gad, 1990)
- Metaphone (Philips, 1990)
- Trials by Zobel and Dart (1994)
- Our trials phonetic encoding is a double edged
sword. Performance degraded.
41The Problem with Phonetics
- Things are grouped that should not be grouped and
things are separated that should not be
separated. - Zobel and Dart (1994) mad and not map to the
same encoding. - phalanges- flnj(s) and hpalanges- hpln(js).
-
42Error Distribution
- 80 of spelling errors are a single edit
(Damerau, 1964 Morgan, 1970). - Our data looked at 1-3 edit errors and found 76
1 edit, 22 2 edit, 2 3 edit.
43Error Distribution
- We looked at the log data to find corrections
that could be detected by Metaphone, but not
within 2 edits. - For 63 days of logs less than 5,000 legitimate
corrections found. - That is less than 75 examples per day.
44Examples of Success
- miocardi alinfraction-myocardial infarction
- terminl illnss-terminal illness
- hig pressue liqud chromatogph-high pressure
liquid chromatography - tumor necrosisactor-tumor necrosis factor
- hmgolbin-hemoglobin
- philariosis-filariosis
45Examples of Failure
- Sapna Baht - sauna bath
- periostin - periods in
- Daniel K E - danieluk m
- bisexual molest - bisexual modest
- pancreas transplation -pancreas AND translation
- stem cell ros - stem cell loss
- cupper hair - upper air
46Post Checking Process
- Testing indicates that on PubMed queries about
87 of suggestions made are reasonable. - Suggestions are made for about 10.
- Before using a suggestion postings are checked
and it is rejected if there are no postings. - Result is suggestions reach the user for about 7
of queries.
47Usage Statistics
- On deployment about 36 of suggestions reaching
the user were accepted. - This has gradually climbed to about 40.
- On a good day 80,000 suggestions are accepted.
48Acknowledgments
- Won Kim, Natalie Xie
- David Kenton, Pramod Paranthaman, Volodya
Sirotinin, Grisha Starchenko