Query processing: optimizations - PowerPoint PPT Presentation

About This Presentation

Title:

Query processing: optimizations

Description:

Title: Web Algorithmics Author: Paolo Ferragina Last modified by: Paolo Ferragina Created Date: 9/18/2002 4:13:07 PM Document presentation format – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 19

Provided by: Paolo262

Category:

more less

Transcript and Presenter's Notes

Title: Query processing: optimizations

1
Query processingoptimizations

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa

Reading 2.3
2
Augment postings with skip pointers (at indexing
time)
Sec. 2.3
128
41
31
11
31

How do we deploy them ?
Where do we place them ?

3
Using skips
Sec. 2.3
128
41
128
31
11
31
Suppose weve stepped through the lists until we
process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is
smaller.
4
Placing skips
Sec. 2.3

Tradeoff
More skips ? shorter spans ? more likely to skip.
But lots of comparisons to skip pointers.
Fewer skips ? longer spans ? few successful
skips. Less pointer comparisons.

5
Placing skips
Sec. 2.3

Simple heuristic for postings of length L
use ?L evenly-spaced skip pointers.
This ignores the distribution of query terms.
Easy if the index is relatively static.
This definitely useful for in-memory index
The I/O cost of loading a bigger list can
outweigh the gains!

6
Placing skips, contd
Sec. 2.3
You can solve it by Shortest Path

What if it is known a distribution of access pk
to the k-th element of the inverted list?
w(i,j) sumki..j pk
L0(i,j) average cost of accessing an item in
the sublist from i to j sumki..j pk (k-i1)
L1(1,n) 1 (first skip cmp) (cost to access
the two lists)
minugt1 w(1,u-1) L0(1,u-1) w(u,n)
L1(u,n)
L0(i,j) can be tabulated in O(n2) time
Computing L1(i,n) takes O(n), given L1(j,n),
for jgti
Computing the total L1(1,n) takes O(n2) time

7
Placing skips, contd
Sec. 2.3

What if it is also fixed the number of p
skip-pointers that can be allocated?
Same as before but we add the parameter p
L1_p(1,n) 1 min_ugt1 w(1,u-1) L0(1,u-1)
w(u,n) L1_p-1(u,n)
L1_0(i,j) L0(i,j), i.e. no pointers left, so
scan
Li(j,n) takes O(n) time min calculation if are
available the values for Li-1(h,n) with h gt j
So Lp(1,n) takes O(pn2) time

8
Faster query caching?

Two opposite approaches
Cache the query results (exploits query
locality)
Cache pages of posting lists (exploits term
locality)

9
Query processingphrase queries and positional
indexes

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa

Reading 2.4
10
Phrase queries
Sec. 2.4

Want to be able to answer queries such as
stanford university as a phrase
Thus the sentence I went at Stanford my
university is not a match.

11
Solution 1 Biword indexes
Sec. 2.4.1

For example the text Friends, Romans,
Countrymen would generate the biwords
friends romans
romans countrymen
Each of these biwords is now an entry in the
dictionary
Two-word phrase query-processing is immediate.

12
Longer phrase queries
Sec. 2.4.1

Longer phrases are processed by reducing them to
bi-word queries in AND
stanford university palo alto can be broken into
the Boolean query on biwords, such as
stanford university AND university palo AND palo
alto
Need the docs to verify
They are combined with other solutions

Can have false positives! Index blows up
13
Solution 2 Positional indexes
Sec. 2.4.2

In the postings, store for each term and document
the position(s) in which that term occurs
ltterm, number of docs containing term
doc1 position1, position2
doc2 position1, position2
etc.gt

14
Processing a phrase query
Sec. 2.4.2

to be or not to be.
to
21,17,74,222,551 48,16,190,429,433
713,23,191 ...
be
117,19 417,191,291,430,434 514,19,101 ...
Same general method for proximity searches

15
Query term proximity
Sec. 7.2.2

Free text queries just a set of terms typed into
the query box common on the web
Users prefer docs in which query terms occur
within close proximity of each other
Would like scoring function to take this into
account how?

16
Positional index size
Sec. 2.4.2

You can compress position values/offsets
Nevertheless, a positional index expands postings
storage by a factor 2-4 in English
Nevertheless, a positional index is now commonly
used because of the power and usefulness of
phrase and proximity queries whether used
explicitly or implicitly in a ranking retrieval
system.

17
Combination schemes
Sec. 2.4.3

BiWord Positional index is a profitable
combination
Biword is particularly useful for particular
phrases (Michael Jackson, Britney Spears)
More complicated mixing strategies do exist!

18
Soft-AND
Sec. 7.2.3

E.g. query rising interest rates
Run the query as a phrase query
If ltK docs contain the phrase rising interest
rates, run the two phrase queries rising interest
and interest rates
If we still have ltK docs, run the vector space
query rising interest rates (see next)
Rank the matching docs (see next)

Write a Comment

User Comments (0)