Title: INF 2914 Information Retrieval and Web Search
1INF 2914Information Retrieval and Web Search
- Lecture 7 Query Processing
- These slides are adapted from Stanfords class
CS276 / LING 286 - Information Retrieval and Web Mining
2Query processing AND
- Consider processing the query
- Brutus AND Caesar
- Locate Brutus in the Dictionary
- Retrieve its postings.
- Locate Caesar in the Dictionary
- Retrieve its postings.
- Merge the two postings
128
Brutus
Caesar
34
3The merge
- Walk through the two postings simultaneously, in
time linear in the total number of postings
entries
128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
4Boolean queries Exact match
- The Boolean Retrieval model is being able to ask
a query that is a Boolean expression - Boolean Queries are queries using AND, OR and NOT
to join query terms - Views each document as a set of words
- Is precise document matches condition or not.
- Primary commercial retrieval tool for 3 decades.
- Professional searchers (e.g., lawyers) still like
Boolean queries - You know exactly what youre getting.
5Boolean queries More general merges
- Exercise Adapt the merge for the queries
- Brutus AND NOT Caesar
- Brutus OR NOT Caesar
- Can we still run through the merge in time O(xy)
or what can we achieve?
6Merging
- What about an arbitrary Boolean formula?
- (Brutus OR Caesar) AND NOT
- (Antony OR Cleopatra)
- Can we always merge in linear time?
- Linear in what?
- Can we do better?
7Query optimization
- What is the best order for query processing?
- Consider a query that is an AND of t terms.
- For each of the t terms, get its postings, then
AND them together.
Brutus
Calpurnia
Caesar
13
16
Query Brutus AND Calpurnia AND Caesar
8Query optimization example
- Process in order of increasing freq
- start with smallest set, then keep cutting
further.
This is why we kept freq in dictionary
Execute the query as (Caesar AND Brutus) AND
Calpurnia.
9More general optimization
- e.g., (madding OR crowd) AND (ignoble OR strife)
- Get freqs for all terms.
- Estimate the size of each OR by the sum of its
freqs (conservative). - Process in increasing order of OR sizes.
10Query processing exercises
- If the query is friends AND romans AND (NOT
countrymen), how could we use the freq of
countrymen? - Exercise Extend the merge to an arbitrary
Boolean query. Can we always guarantee execution
in time linear in the total postings size?
11Faster postings mergesSkip pointers
12Recall basic merge
- Walk through the two postings simultaneously, in
time linear in the total number of postings
entries
128
2
31
If the list lengths are m and n, the merge takes
O(mn) operations.
Can we do better? Yes,if we have pointers
13Augment postings with skip pointers (at indexing
time)
128
16
31
8
- Why?
- To skip postings that will not figure in the
search results. - How?
- Where do we place skip pointers?
14Query processing with skip pointers
128
16
31
8
Suppose weve stepped through the lists until we
process 8 on each list.
15Where do we place skips?
- Tradeoff
- More skips ? shorter skip spans ? more likely to
skip. But lots of comparisons to skip pointers. - Fewer skips ? few pointer comparison, but then
long skip spans ? few successful skips.
16B-Trees
- Use B-Trees, instead of skip pointers
- Handle large posting lists
- Top levels of the B-Tree always in memory for
most used posting lists - Better caching performance
- Read-only B-Trees
- Simple implementation
- No internal fragmentation
17Zig-zag join
- Join all lists at the same time
- Self-optimized
- Heuristic when a result is found, move list with
the smallest residual term frequency - Want to move the list which will skip the most
number of entries
No need to execute the query (Caesar AND Brutus)
AND Calpurnia.
18Zig-zag example
- Handle ORs and NOTs
- More about Zig-zag join in the XML class
19Phrase queries
20Phrase queries
- Want to answer queries such as stanford
university as a phrase - Thus the sentence I went to university at
Stanford is not a match. - The concept of phrase queries has proven easily
understood by users about 10 of web queries are
phrase queries - No longer suffices to store only
- ltterm docsgt entries
21Positional indexes
- Store, for each term, entries of the form
- ltnumber of docs containing term
- doc1 position1, position2
- doc2 position1, position2
- etc.gt
22Positional index example
ltbe 993427 1 7, 18, 33, 72, 86, 231 2 3,
149 4 17, 191, 291, 430, 434 5 363, 367, gt
Which of docs 1,2,4,5 could contain to be or not
to be?
- Can compress position values/offsets
- Nevertheless, this expands postings storage
substantially
23Processing a phrase query
- Extract inverted index entries for each distinct
term to, be, or, not. - Merge their docposition lists to enumerate all
positions with to be or not to be. - to
- 21,17,74,222,551 48,16,190,429,433
713,23,191 ... - be
- 117,19 417,191,291,430,434 514,19,101 ...
- Same general method for proximity searches
24Positional index size
- You can compress position values/offsets
- Nevertheless, a positional index expands postings
storage substantially - It is now vastly used because of the power and
usefulness of phrase and proximity queries
whether used explicitly or implicitly in a
ranking retrieval system.
25Rules of thumb
- A positional index is 24 as large as a
non-positional index - Positional index size 3550 of volume of
original text - Caveat all of this holds for English-like
languages
26Combination schemes
- Biword an positional indexes can be profitably
combined - For particular phrases (Michael Jackson,
Britney Spears) it is inefficient to keep on
merging positional postings lists - Even more so for phrases like The Who
- Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme - A typical web query mixture was executed in ¼ of
the time of using just a positional index - It required 26 more space than having a
positional index alone
27Wild-card queries
28Wild-card queries
- mon find all docs containing any word beginning
mon. - Easy with binary tree (or B-tree) lexicon
retrieve all words in range mon w lt moo - mon find words ending in mon harder
- Maintain an additional B-tree for terms backwards
Exercise from this, how can we enumerate all
terms meeting the wild-card query procent ?
29Query processing
- At this point, we have an enumeration of all
terms in the dictionary that match the wild-card
query - We still have to look up the postings for each
enumerated term - E.g., consider the query
- seate AND filer
- This may result in the execution of many Boolean
AND queries
30B-trees handle s at the end of a query term
- How can we handle s in the middle of query
term? - (Especially multiple s)
- The solution transform every wild-card query so
that the s occur at the end - This gives rise to the Permuterm Index.
31Permuterm index
- For term hello index under
- hello, elloh, llohe, lohel, ohell
- where is a special symbol.
- Queries
- X lookup on X X lookup on X
- X lookup on X X lookup on X
- XY lookup on YX XYZ ???
- Exercise!
32Permuterm query processing
- Rotate query wild-card to the right
- Now use B-tree lookup as before.
- Permuterm problem quadruples lexicon size
Empirical observation for English.
33Bigram indexes
- Enumerate all k-grams (sequence of k chars)
occurring in any term - e.g., from text April is the cruelest month we
get the 2-grams (bigrams) - is a special word boundary symbol
- Maintain an inverted index from bigrams to
dictionary terms that match each bigram.
a,ap,pr,ri,il,l,i,is,s,t,th,he,e,c,cr,ru, u
e,el,le,es,st,t, m,mo,on,nt,h
34Bigram index example
m
mace
madden
mo
among
amortize
on
among
around
35Processing n-gram wild-cards
- Query mon can now be run as
- m AND mo AND on
- Fast, space efficient.
- Gets terms that match AND version of our wildcard
query. - But wed enumerate moon.
- Must post-filter these terms against query.
- Surviving enumerated terms are then looked up in
the term-document inverted index.
36Processing wild-card queries
- As before, we must execute a Boolean query for
each enumerated, filtered term. - Wild-cards can result in expensive query execution
Search
Type your search terms, use if you need
to. E.g., Alex will match Alexander.
37Spelling correction
38Spell correction
- Two principal uses
- Correcting document(s) being indexed
- Retrieve matching documents when query contains a
spelling error - Two main flavors
- Isolated word
- Check each word on its own for misspelling
- Will not catch typos resulting in correctly
spelled words e.g., from ? form - Context-sensitive
- Look at surrounding words, e.g., I flew form
Heathrow to Narita.
39Document correction
- Primarily for OCRed documents
- Correction algorithms tuned for this
- Goal the index (dictionary) contains fewer
OCR-induced misspellings - Can use domain-specific knowledge
- E.g., OCR can confuse O and D more often than it
would confuse O and I (adjacent on the keyboard,
so more likely interchanged in typing).
40Query mis-spellings
- Our principal focus here
- E.g., the query Alanis Morisett
- We can either
- Retrieve documents indexed by the correct
spelling, OR - Return several suggested alternative queries with
the correct spelling - Did you mean Alanis Morissette?
41Isolated word correction
- Fundamental premise there is a lexicon from
which the correct spellings come - Two basic choices for this
- A standard lexicon such as
- Websters English Dictionary
- An industry-specific lexicon hand-maintained
- The lexicon of the indexed corpus
- E.g., all words on the web
- All names, acronyms etc.
- (Including the mis-spellings)
42Isolated word correction
- Given a lexicon and a character sequence Q,
return the words in the lexicon closest to Q - Whats closest?
- Well study several alternatives
- Edit distance
- Weighted edit distance
- n-gram overlap
43Edit distance
- Given two strings S1 and S2, the minimum number
of basic operations to covert one to the other - Basic operations are typically character-level
- Insert
- Delete
- Replace
- E.g., the edit distance from cat to dog is 3.
- Generally found by dynamic programming.
44Edit distance
- Also called Levenshtein distance
- See http//www.merriampark.com/ld.htm for a nice
example plus an applet to try on your own
45Weighted edit distance
- As above, but the weight of an operation depends
on the character(s) involved - Meant to capture keyboard errors, e.g. m more
likely to be mis-typed as n than as q - Therefore, replacing m by n is a smaller edit
distance than by q - (Same ideas usable for OCR, but with different
weights) - Require weight matrix as input
- Modify dynamic programming to handle weights
46Using edit distances
- Given query, first enumerate all dictionary terms
within a preset (weighted) edit distance - (Some literature formulates weighted edit
distance as a probability of the error) - Then look up enumerated dictionary terms in the
term-document inverted index - Slow but no real fix
- Tries help
- Better implementations see Kukich, Zobel/Dart
references.
47Edit distance to all dictionary terms?
- Given a (mis-spelled) query do we compute its
edit distance to every dictionary term? - Expensive and slow
- How do we cut the set of candidate dictionary
terms? - Here we use n-gram overlap for this
48n-gram overlap
- Enumerate all the n-grams in the query string as
well as in the lexicon - Use the n-gram index to retrieve all lexicon
terms matching any of the query n-grams - Threshold by number of matching n-grams
- Variants weight by keyboard layout, etc.
49Example with trigrams
- Suppose the text is november
- Trigrams are nov, ove, vem, emb, mbe, ber.
- The query is december
- Trigrams are dec, ece, cem, emb, mbe, ber.
- So 3 trigrams overlap (of 6 in each term)
- How can we turn this into a normalized measure of
overlap?
50One option Jaccard coefficient
- A commonly-used measure of overlap (remember dup
detection) - Let X and Y be two sets then the J.C. is
- Equals 1 when X and Y have the same elements and
zero when they are disjoint - X and Y dont have to be of the same size
- Always assigns a number between 0 and 1
- Now threshold to decide if you have a match
- E.g., if J.C. gt 0.8, declare a match
51Matching n-grams
- Consider the query lord we wish to identify
words matching 2 of its 3 bigrams (lo, or, rd)
lo
alone
lord
sloth
or
lord
morbid
border
rd
border
card
ardent
Standard postings merge will enumerate
Adapt this to using Jaccard (or another) measure.
52Caveat
- Even for isolated-word correction, the notion of
an index token is critical whats the unit
were trying to correct? - In Chinese/Japanese, the notions of
spell-correction and wildcards are poorly
formulated/understood
53Context-sensitive spell correction
- Text I flew from Heathrow to Narita.
- Consider the phrase query flew form Heathrow
- Wed like to respond
- Did you mean flew from Heathrow?
- because no docs matched the query phrase.
54Context-sensitive correction
- Need surrounding context to catch this
- NLP too heavyweight for this.
- First idea retrieve dictionary terms close (in
weighted edit distance) to each query term - Now try all possible resulting phrases with one
word fixed at a time - flew from heathrow
- fled form heathrow
- flea form heathrow
- etc.
- Suggest the alternative that has lots of hits?
55Exercise
- Suppose that for flew form Heathrow we have 7
alternatives for flew, 19 for form and 3 for
heathrow. - How many corrected phrases will we enumerate in
this scheme?
56Another approach
- Break phrase query into a conjunction of biwords
- Look for biwords that need only one term
corrected. - Enumerate phrase matches and rank them!
57General issue in spell correction
- Will enumerate multiple alternatives for Did you
mean - Need to figure out which one (or small number) to
present to the user - Use heuristics
- The alternative hitting most docs
- Query log analysis tweaking
- For especially popular, topical queries
58Computational cost
- Spell-correction is computationally expensive
- Avoid running routinely on every query?
- Run only on queries that matched few docs
59Thesauri
- Thesaurus language-specific list of synonyms for
terms likely to be queried - car ? automobile, etc.
- Machine learning methods can assist
- Can be viewed as hand-made alternative to
edit-distance, etc.
60Query expansion
- Usually do query expansion rather than index
expansion - No index blowup
- Query processing slowed down
- Docs frequently contain equivalences
- May retrieve more junk
- puma ? jaguar retrieves documents on cars
instead of on sneakers.
61Resources for todays lecture
- IIR 2
- MG 3.6, 4.3 MIR 7.2
- Skip Lists theory Pugh (1990)
- Multilevel skip lists give same O(log n)
efficiency as trees - H.E. Williams, J. Zobel, and D. Bahle. 2004.
Fast Phrase Querying with Combined Indexes, ACM
Transactions on Information Systems. - http//www.seg.rmit.edu.au/research/research.php?a
uthor4, D. Bahle, H. Williams, and J. Zobel.
Efficient phrase querying with an auxiliary
index. SIGIR 2002, pp. 215-221.
62Resources
- MG 4.2
- Efficient spell retrieval
- K. Kukich. Techniques for automatically
correcting words in text. ACM Computing Surveys
24(4), Dec 1992. - J. Zobel and P. Dart. Finding approximate
matches in large lexicons. Software - practice
and experience 25(3), March 1995.
http//citeseer.ist.psu.edu/zobel95finding.html - Nice, easy reading on spell correction
- Mikael Tillenius Efficient Generation and
Ranking of Spelling Error Corrections. Masters
thesis at Swedens Royal Institute of Technology.
http//citeseer.ist.psu.edu/179155.html