Probabilistic%20Ranking%20of%20Database%20Query%20Results - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic%20Ranking%20of%20Database%20Query%20Results

Description:

... Query Results. Gautam Das, Surajit Chaudhuri, Vagelis Hristidis, Gerhard Weikum ... Not a very selective query Has many matching tuples. Thus needs some ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 12
Provided by: wha93
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic%20Ranking%20of%20Database%20Query%20Results


1
Probabilistic Ranking of Database Query Results
  • Gautam Das, Surajit Chaudhuri, Vagelis Hristidis,
    Gerhard Weikum

Presented by Z.M. Joseph Spring 2006, CSE, UT
Arlington
2
Introduction
  • Addresses the Many-Answers problem
  • Not a very selective query Has many matching
    tuples
  • Thus needs some ranking
  • Thus
  • Specified attributes all match
  • Must look into non-specified attributes

3
Challenge
  • How do you select based on non-specified
    attributes?
  • Difficult to get correlation information
  • Expensive to manage

4
Approach
  • Build off Probabilistic Information Retrieval
  • Combine
  • Global Score
  • Contains global importance of unspecified
    attributes
  • Conditional Score
  • Captures strength of correlation between
    unspecified and specified attributes
  • Preprocessing at Intermediate Knowledge
    Representation Layer

5
Recall from PIR
  • We already know that for a tuple t
  • t can be broken down as
  • X As the set of specified attributes
  • Y The list of unspecified attributes
  • R is the ideal set of result tuples
  • D is a single database table (approximated to R)

6
Structured Data
  • Simplifies to
  • This automatically increases probability for
    unspecified attributes that occur more in the
    ideal tuple set R

7
Limited Independence Assumptions
  • Possible to capture dependencies and correlations
    from structured data
  • Efficient approach
  • X and Y values within themselves are independent
    of each other
  • Allows derivation of
  • This assumption may not always be correct!

8
Workload-Based R Estimation
  • In order to use these techniques, the ideal
    result set R must be known.
  • Use statistics gathered from the workload
  • View the workload as a set of tuples containing
    each query and the specified attributes
  • Thus can replace P(yR) with P(yX,W)
  • Properties of R can be obtained by examining the
    workload for queries that retrieved X in the past

9
Workload-Based R Estimation
  • Thus the ranking function is
  • Does not contain R
  • Quantities are all atomic and can be computed
  • First part is global, second part is conditional
  • Can use association rules for , etc.
  • These values stored in intermediate knowledge
    representation layer

10
Implementation
  • Atomic Probabilities Module stores atomic
    quantities in the intermediate knowledge
    representation layer
  • Index Module Uses inputs and association rules
    to create global and conditional scores
  • Scan Algorithm Selects tuples that satisfy the
    condition and then finds the ranking based on the
    scores
  • List Merge Algorithm Alternate to scanning

11
Conclusion
  • Gives a ranking for the Many-Answer problem by
    factoring in unspecified attributes
  • Automated
  • Makes use of workload statistics and correlations
  • Can still be adjusted by users and/or domain
    experts
  • Can use user feedback as well
Write a Comment
User Comments (0)
About PowerShow.com