Query processing: optimizations - PowerPoint PPT Presentation

About This Presentation
Title:

Query processing: optimizations

Description:

Title: Web Algorithmics Author: Paolo Ferragina Last modified by: Paolo Ferragina Created Date: 9/18/2002 4:13:07 PM Document presentation format – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 19
Provided by: Paolo262
Category:

less

Transcript and Presenter's Notes

Title: Query processing: optimizations


1
Query processingoptimizations
  • Paolo Ferragina
  • Dipartimento di Informatica
  • Università di Pisa

Reading 2.3
2
Augment postings with skip pointers (at indexing
time)
Sec. 2.3
128
41
31
11
31
  • How do we deploy them ?
  • Where do we place them ?

3
Using skips
Sec. 2.3
128
41
128
31
11
31
Suppose weve stepped through the lists until we
process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is
smaller.
4
Placing skips
Sec. 2.3
  • Tradeoff
  • More skips ? shorter spans ? more likely to skip.
    But lots of comparisons to skip pointers.
  • Fewer skips ? longer spans ? few successful
    skips. Less pointer comparisons.

5
Placing skips
Sec. 2.3
  • Simple heuristic for postings of length L
  • use ?L evenly-spaced skip pointers.
  • This ignores the distribution of query terms.
  • Easy if the index is relatively static.
  • This definitely useful for in-memory index
  • The I/O cost of loading a bigger list can
    outweigh the gains!

6
Placing skips, contd
Sec. 2.3
You can solve it by Shortest Path
  • What if it is known a distribution of access pk
    to the k-th element of the inverted list?
  • w(i,j) sumki..j pk
  • L0(i,j) average cost of accessing an item in
    the sublist from i to j sumki..j pk (k-i1)
  • L1(1,n) 1 (first skip cmp) (cost to access
    the two lists)
  • minugt1 w(1,u-1) L0(1,u-1) w(u,n)
    L1(u,n)
  • L0(i,j) can be tabulated in O(n2) time
  • Computing L1(i,n) takes O(n), given L1(j,n),
    for jgti
  • Computing the total L1(1,n) takes O(n2) time

7
Placing skips, contd
Sec. 2.3
  • What if it is also fixed the number of p
    skip-pointers that can be allocated?
  • Same as before but we add the parameter p
  • L1_p(1,n) 1 min_ugt1 w(1,u-1) L0(1,u-1)
  • w(u,n) L1_p-1(u,n)
  • L1_0(i,j) L0(i,j), i.e. no pointers left, so
    scan
  • Li(j,n) takes O(n) time min calculation if are
    available the values for Li-1(h,n) with h gt j
  • So Lp(1,n) takes O(pn2) time

8
Faster query caching?
  • Two opposite approaches
  • Cache the query results (exploits query
    locality)
  • Cache pages of posting lists (exploits term
    locality)

9
Query processingphrase queries and positional
indexes
  • Paolo Ferragina
  • Dipartimento di Informatica
  • Università di Pisa

Reading 2.4
10
Phrase queries
Sec. 2.4
  • Want to be able to answer queries such as
    stanford university as a phrase
  • Thus the sentence I went at Stanford my
    university is not a match.

11
Solution 1 Biword indexes
Sec. 2.4.1
  • For example the text Friends, Romans,
    Countrymen would generate the biwords
  • friends romans
  • romans countrymen
  • Each of these biwords is now an entry in the
    dictionary
  • Two-word phrase query-processing is immediate.

12
Longer phrase queries
Sec. 2.4.1
  • Longer phrases are processed by reducing them to
    bi-word queries in AND
  • stanford university palo alto can be broken into
    the Boolean query on biwords, such as
  • stanford university AND university palo AND palo
    alto
  • Need the docs to verify
  • They are combined with other solutions

Can have false positives! Index blows up
13
Solution 2 Positional indexes
Sec. 2.4.2
  • In the postings, store for each term and document
    the position(s) in which that term occurs
  • ltterm, number of docs containing term
  • doc1 position1, position2
  • doc2 position1, position2
  • etc.gt

14
Processing a phrase query
Sec. 2.4.2
  • to be or not to be.
  • to
  • 21,17,74,222,551 48,16,190,429,433
    713,23,191 ...
  • be
  • 117,19 417,191,291,430,434 514,19,101 ...
  • Same general method for proximity searches

15
Query term proximity
Sec. 7.2.2
  • Free text queries just a set of terms typed into
    the query box common on the web
  • Users prefer docs in which query terms occur
    within close proximity of each other
  • Would like scoring function to take this into
    account how?

16
Positional index size
Sec. 2.4.2
  • You can compress position values/offsets
  • Nevertheless, a positional index expands postings
    storage by a factor 2-4 in English
  • Nevertheless, a positional index is now commonly
    used because of the power and usefulness of
    phrase and proximity queries whether used
    explicitly or implicitly in a ranking retrieval
    system.

17
Combination schemes
Sec. 2.4.3
  • BiWord Positional index is a profitable
    combination
  • Biword is particularly useful for particular
    phrases (Michael Jackson, Britney Spears)
  • More complicated mixing strategies do exist!

18
Soft-AND
Sec. 7.2.3
  • E.g. query rising interest rates
  • Run the query as a phrase query
  • If ltK docs contain the phrase rising interest
    rates, run the two phrase queries rising interest
    and interest rates
  • If we still have ltK docs, run the vector space
    query rising interest rates (see next)
  • Rank the matching docs (see next)
Write a Comment
User Comments (0)
About PowerShow.com