Title: CSCI 7000 Modern Information Retrieval Jim Martin
1CSCI 7000Modern Information RetrievalJim Martin
2Today 9/5
- Review
- Dictionary contents
- Advance query handling
- Phrases
- Wildcards
- Spelling correction
- First programming assignment
3 - Index The Dictionary file and a Postings file
4Review Dictionary
- What goes into creating the dictionary?
- Tokenization
- Case folding
- Stemming
- Stop-listing
- Dealing with numbers (and number-like entities)
- Complex morphology
5Phrasal queries
- Want to handle queries such as
- Colorado Buffaloes as a phrase
- This concept is popular with users about 10 of
web queries are phrasal queries - Postings that consist of document lists alone are
not sufficient to handle phrasal queries - Two general approaches
- Biword indexing
- Positional indexing
6Solution 1 Biword Indexing
- Index every consecutive pair of terms in the text
as a phrase - For example the text Friends, Romans,
Countrymen would generate the biwords - friends romans
- romans countrymen
- Each of these biwords is now a dictionary term
- Two-word phrase query-processing is now free
- Not really.
7Longer Phrasal Queries
- Longer phrases can be broken into the Boolean AND
queries on the component biwords - Colorado Buffaloes at Arizona
- (Colorado Buffaloes) AND (Buffaloes at) AND (at
Arizona)
Susceptible to Type 1 errors (false positives)
8Solution 2 Positional Indexing
- Change our posting content
- Store, for each term, entries of the form
- doc1 position1, position2
- doc2 position1, position2
- etc.
9Positional index example
149 4 17, 191, 291, 430, 434 5 363, 367,
Which of docs 1,2,4,5 could contain to be or not
to be?
10Processing a phrase query
- Extract inverted index entries for each distinct
term to, be, or, not. - Merge their docposition lists to enumerate all
positions with to be or not to be. - to
- 21,17,74,222,551 48,16,190,429,433
713,23,191 ... - be
- 117,19 417,191,291,430,434 514,19,101 ...
- Same general method for proximity searches
11Positional index size
- As well see you can compress position
values/offsets - But a positional index still expands the postings
storage substantially - Nevertheless, it is now the standard approach
because of the power and usefulness of phrase and
proximity queries whether used explicitly or
implicitly in a ranking retrieval system.
12Rules of thumb
- Positional index size 3550 of volume of
original text - Caveat all of this holds for English-like
languages
13Combination Techniques
- Biwords are faster.
- And they cover a large percentage of very
frequent (implied) phrasal queries - Britney Spears
- So it can be effective to combine positional
indexes with biword indexes for frequent bigrams
14Web
15Programming Assignment Part 1
- Download and install Lucene
- How does Lucene handle (by default)
- Case, stemming, and phrasal queries
- Download and index a collection that I will point
you at - How big is the resulting index?
- Terms and size of index
- Return the Top N document IDs (hits) from a set
of queries Ill provide.
16Programming Assignment Part 2
17Wild Card Queries
- Two flavors
- Word-based
- Caribb
- Phrasal
- Pirates Caribbean
- General approach
- Generate a set of new queries from the original
- Operation on the dictionary
- Run those queries in a not stupid way
-
18Simple Single Wild-card queries
- Single instance of a
- means an string of length 0 or more
- This is not Kleene .
- mon find all docs containing any word beginning
mon. - Index your lexicon on prefixes
- mon find words ending in mon
- Maintain a backwards index
Exercise from this, how can w enumerate all
terms meeting the wild-card query procent ?
19Arbitrary Wildcards
- How can we handle multiple s in the middle of
query term? - The solution transform every wild-card query so
that the s occur at the end - This gives rise to the Permuterm Index.
20Permuterm Index
- For term hello index under
- hello, elloh, llohe, lohel, ohell
- where is a special symbol.
- Example
Query helo Rotate Lookup ohel
21Permuterm query processing
- Rotate query wild-card to the right
- Now use indexed lookup as before.
- Permuterm problem quadruples lexicon size
Empirical observation for English.
22Spelling Correction
- Two primary uses
- Correcting document(s) being indexed
- Retrieve matching documents when query contains a
spelling error - Two main flavors
- Isolated word
- Check each word on its own for misspelling
- Will not catch typos resulting in correctly
spelled words e.g., from ? form - Context-sensitive
- Look at surrounding words, e.g., I flew form
Heathrow to Narita.
23Document correction
- Primarily for OCRed documents
- Correction algorithms tuned for this
- Goal the index (dictionary) contains fewer
OCR-induced misspellings - Can use domain-specific knowledge
- E.g., OCR can confuse O and D more often than it
would confuse O and I (adjacent on the QWERTY
keyboard, so more likely interchanged in typing).
24Query correction
- Our principal focus here
- E.g., the query Alanis Morisett
- We can either
- Retrieve using that spelling
- Retrieve documents indexed by the correct
spelling, OR - Return several suggested alternative queries with
the correct spelling - Did you mean ?
25Isolated word correction
- Fundamental premise there is a lexicon from
which the correct spellings come - Two basic choices for this
- A standard lexicon such as
- Websters English Dictionary
- An industry-specific lexicon hand-maintained
- The lexicon of the indexed corpus
- E.g., all words on the web
- All names, acronyms etc.
- (Including the mis-spellings)
26Isolated word correction
- Given a lexicon and a character sequence Q,
return the words in the lexicon closest to Q - Whats closest?
- Well study several alternatives
- Edit distance
- Weighted edit distance
- Character n-gram overlap
27Edit distance
- Given two strings S1 and S2, the minimum number
of basic operations to covert one to the other - Basic operations are typically character-level
- Insert
- Delete
- Replace
- E.g., the edit distance from cat to dog is 3.
- Generally found by dynamic programming.
28Weighted edit distance
- As above, but the weight of an operation depends
on the character(s) involved - Meant to capture keyboard errors, e.g. m more
likely to be mis-typed as n than as q - Therefore, replacing m by n is a smaller edit
distance than by q - (Same ideas usable for OCR, but with different
weights) - Require weight matrix as input
- Modify dynamic programming to handle weights
(Viterbi)
29Using edit distances
- Given query, first enumerate all dictionary terms
within a preset (weighted) edit distance - Then look up enumerated dictionary terms in the
term-document inverted index
30Edit distance to all dictionary terms?
- Given a (misspelled) query do we compute its
edit distance to every dictionary term? - Expensive and slow
- How do we cut the set of candidate dictionary
terms? - Here we can use n-gram overlap for this
31Context-sensitive spell correction
- Text I flew from Heathrow to Narita.
- Consider the phrase query flew form Heathrow
- Wed like to respond
- Did you mean flew from Heathrow?
- because no docs matched the query phrase.
32Context-sensitive correction
- Need surrounding context to catch this.
- NLP too heavyweight for this.
- First idea retrieve dictionary terms close (in
weighted edit distance) to each query term - Now try all possible resulting phrases with one
word fixed at a time - flew from heathrow
- fled form heathrow
- flea form heathrow
- etc.
- Suggest the alternative that has lots of hits?
33Exercise
- Suppose that for flew form Heathrow we have 7
alternatives for flew, 19 for form and 3 for
heathrow. - How many corrected phrases will we enumerate in
this scheme?
34General issue in spell correction
- Will enumerate multiple alternatives for Did you
mean - Need to figure out which one (or small number) to
present to the user - Use heuristics
- The alternative hitting most docs
- Query log analysis tweaking
- For especially popular, topical queries
- Language modeling
35Computational cost
- Spell-correction is computationally expensive
- Avoid running routinely on every query?
- Run only on queries that matched few docs
36Next Time
- On to Chapter 4
- Real indexing