SpellChecking PubMed Queries - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

SpellChecking PubMed Queries

Description:

is just the probability that someone would accidentally put an i' in place of a y' ... Sapna Baht - sauna bath. periostin - periods in. Daniel K E - danieluk m ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 49
Provided by: johnw88
Category:

less

Transcript and Presenter's Notes

Title: SpellChecking PubMed Queries


1
SpellChecking PubMed Queries
  • By John Wilbur, Won Kim, Natalie Xie

2
Noisy Channel Model
  • Historically dates to Claude Shannon (1948)
  • Bayesian Model IWIntended Word OWObserved Word

3
Maximal Likelihood
  • We can ignore

4
Example
  • is just the
    probability that someone would accidentally put
    an i in place of a y.
  • is proportional to the
    frequency of the word enzyme in the PubMed
    data.

5
(No Transcript)
6
Spelling Error Statistics
  • Harvested from PubMed Log Files
  • Criteria
  • Look for word pairs entered within 5 min of each
    other
  • Same IP address
  • Similar, but not same, spellings
  • Second member of pair much more frequent in
    PubMed.

7
(No Transcript)
8
Many Terms of Low Frequency in PubMed are
Misspellings
  • Previous slide one string was not in PubMed.
  • Next slide it is in at low frequencies 0-100.

9
(No Transcript)
10
Errors collected for 63 days of log files
11
Average Rates of Different Types of Errors
  • Aver prob of del 0.00146
  • Aver prob of ins 2.925e-05
  • Aver prob of rep 4.006e-05
  • Aver prob of trn 0.000334

12
Corrections Context Dependent
  • We collected the context of one letter on either
    side of a correction.
  • Word beginning and ending were marked with
    special characters so that the statistics for
    them could be specific.

13
A Model for Calculation
  • Simplest model Generate all possible corrections
    and see if they are in PubMed and at what
    frequency.
  • Better model is to search a tree like structure
    (a trie).

14
Example
l l c h o l i n e
a c e t y l c h o l i n e
p h o l i n e
v l c h o l i n e
Search Term acitylcholine
15
Acetylcholine
  • Has 28 neighbors 1 edit distance away.
  • Has 29 neighbors 2 edit distance away.
  • Has about
  • 13 1 edit deletions
  • 364 1 edit insertions
  • 325 1 edit replacements
  • 12 1 edit transpositions
  • Two edit variations are much greater in number.

16
Basic Assumption I
  • Misspelling rate in PubMed queries is in PubMed
    documents.
  • Therefore if we see acitylcholine as query we
    may find

17
Basic Assumption II
  • Using frequency in the PubMed data to stand in
    the place of probability of intent is only valid
    at higher frequencies.
  • Must down grade the frequency at low frequencies.
  • Data of Zobel Dart (1994).

18
Estimating Regions of Success
  • Difficulty of correction depends on how densely
    the space is packed with words.
  • Packing is more dense for shorter words.

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Treating a Single Token t Stage 1
  • I. If len
  • II. If in database f1000 do not correct.
  • III. Search for a most likely correction c
  • If cannot find a 1 edit correction label case 0.
  • If can find and pc0.7 or ptcorrection and label case 1.
  • If not case 0 or case 1 label as case 2.

23
Treating a Single Token t Stage 2
  • Case 0 If len9
  • Try to split into 2 words, both f500. If
    succeed, accept as correction, label case 3.
  • Else try to perform a 2 edit correction. If
    succeed, accept as correction, label case 4.
  • Case 1 If len5 run through stage 1 again.
    Remains case 1.

24
Treating a Single Token t Stage 3
  • If len9 and case 1 or case 2 and if frequency
    of current best
  • Do a two edit search for a correction.
  • If result has f80 and f10prior frequency and
    if string not changed too much at beg, accept
    result as the correction and label as case 4.

25
Treating a Single Token t Stage 4
  • Cases 1, 3, or 4 are accepted as a correction.
  • Otherwise, if len12 do a deep search and accept
    if successful.
  • If not, try to split with both words in database.
    Accept if successful.
  • Else failure, no correction!

26
Deep Search
  • Algorithm is allowed to do two edits and performs
    a partial match of the token into the database
    trie.
  • Looks for the match using the most letters from
    the token.
  • Among all matches using that maximum number it
    chooses the most probable.
  • This is repeated until a total match is produced
    or failure. It must match at least 4 letters at
    each step.
  • Ends with a rationality check.

27
Two Token Phrase
  • A different problem because of potential context.
  • Constraints are necessary.

28
Fixed Points
  • Tokens 1 or 2 characters long are very unlikely
    to be misspelled.
  • We do not try to correct tokens of 1 or 2
    characters in a two token phrase.

29
Restricted Points
  • If a token has between 3 and 6 characters, only
    allow a single edit in that token.

30
Stopping Rules
  • Let fp be the frequency of the original phrase
    and fm the minimum of the frequencies of the two
    tokens making the phrase.
  • If fp 5 and fm 500 do not correct.
  • If fp 0 and fm 50 and the length of one string
    is not more than 4 do not correct.

31
Attempt Correction
  • Let fc be frequency of the best 1 or 2 edit
    correction for the phrase.
  • If fc0 and fm100 do correction on individual
    tokens.
  • If 0
  • Try to split the phrase
  • Else try deep search
  • Else correct individual tokens.

32
More than Two Tokens
  • A trie of truncated phrases all the initial two
    word phrases coming from longer phrases in the
    database.
  • Take the initial two words of current phrase and
    look for it in the truncated data. Do corrections
    as appropriate.

33
Extension
  • If truncated string found in trie of truncated
    database strings, trace the initial two token
    string in the trie of long phrases and extend the
    match with corrections as needed.
  • Follows deep search strategy (but safer).
  • If unable to produce perfect match, back track
    looking for longest partial match.

34
Extension Failure
  • Treat the two token initial truncation in the
    standard manner.
  • If two token phrase correction fails, treat only
    the first token.
  • If a one or two token initial phrase has been
    corrected, start over with what remains.

35
Cleaning Up the PubMed Data
  • Looked at single words that occurred in at a
    least 20 PubMed Documents and 2 word phrases that
    occurred in at least 9 PubMed documents.
  • Compared these with single edit alternatives that
    were at least 10 times as common in the data.
  • Performed statistical testing.

36
Hypergeometric Test
Database
T2
I
T1
37
Wilcoxin-Mann-Whitney Test
myocardial
infraction
infarction
38
(No Transcript)
39
(No Transcript)
40
What about Phonetics?
  • Methods
  • Soundex (Odell and Russell, 1918)
  • Phonix (Gad, 1990)
  • Metaphone (Philips, 1990)
  • Trials by Zobel and Dart (1994)
  • Our trials phonetic encoding is a double edged
    sword. Performance degraded.

41
The Problem with Phonetics
  • Things are grouped that should not be grouped and
    things are separated that should not be
    separated.
  • Zobel and Dart (1994) mad and not map to the
    same encoding.
  • phalanges- flnj(s) and hpalanges- hpln(js).

42
Error Distribution
  • 80 of spelling errors are a single edit
    (Damerau, 1964 Morgan, 1970).
  • Our data looked at 1-3 edit errors and found 76
    1 edit, 22 2 edit, 2 3 edit.

43
Error Distribution
  • We looked at the log data to find corrections
    that could be detected by Metaphone, but not
    within 2 edits.
  • For 63 days of logs less than 5,000 legitimate
    corrections found.
  • That is less than 75 examples per day.

44
Examples of Success
  • miocardi alinfraction-myocardial infarction
  • terminl illnss-terminal illness
  • hig pressue liqud chromatogph-high pressure
    liquid chromatography
  • tumor necrosisactor-tumor necrosis factor
  • hmgolbin-hemoglobin
  • philariosis-filariosis

45
Examples of Failure
  • Sapna Baht - sauna bath
  • periostin - periods in
  • Daniel K E - danieluk m
  • bisexual molest - bisexual modest
  • pancreas transplation -pancreas AND translation
  • stem cell ros - stem cell loss
  • cupper hair - upper air

46
Post Checking Process
  • Testing indicates that on PubMed queries about
    87 of suggestions made are reasonable.
  • Suggestions are made for about 10.
  • Before using a suggestion postings are checked
    and it is rejected if there are no postings.
  • Result is suggestions reach the user for about 7
    of queries.

47
Usage Statistics
  • On deployment about 36 of suggestions reaching
    the user were accepted.
  • This has gradually climbed to about 40.
  • On a good day 80,000 suggestions are accepted.

48
Acknowledgments
  • Won Kim, Natalie Xie
  • David Kenton, Pramod Paranthaman, Volodya
    Sirotinin, Grisha Starchenko
Write a Comment
User Comments (0)
About PowerShow.com