Query Processing over Incomplete Autonomous Web Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Query Processing over Incomplete Autonomous Web Databases

Description:

BMW. convt. 20000. 2004. a4. Audi. Body style. Price. Year. Model. Make ... User might be interested in similar cars like 'Accord', 'Camry', etc ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 36
Provided by: jaewo
Category:

less

Transcript and Presenter's Notes

Title: Query Processing over Incomplete Autonomous Web Databases


1
Query Processing over Incomplete Autonomous Web
Databases
  • MS Thesis Defense
  • by Hemal Khatri
  • Committee Members
  • Prof. Subbarao Kambhampati (chair)
  • Prof. Chitta Baral
  • Prof. Yi Chen
  • Prof. Huan Liu

2
Introduction to Web databases
  • Many websites allow user query through a form
    based interface and are supported by backend
    databases
  • Consider used cars selling websites such as
    Cars.com, Yahoo! autos, etc

3
Incompleteness in Web databases
  • Web databases are often input by lay individuals
    without any curation. For e.g. Cars.com, Yahoo!
    Autos
  • Web databases are being populated using automated
    information extraction techniques which are
    inherently imperfect
  • The local schema of data sources may not support
    certain attributes supported by the global schema
  • Incomplete/Uncertain tuple A tuple in which one
    or more of its attributes have a missing value

4
Problem Statement
  • Many entities corresponding to tuples with
    missing values might be relevant to the user
    query
  • Current query processing techniques return
    answers that exactly satisfy the user query
  • Such techniques return results with high
    precision but low recall
  • Relevant Uncertain tuple A tuple which does not
    exactly satisfy the query predicates but the
    entity represented by that tuple might be
    relevant to the query
  • How to support query processing over incomplete
    autonomous databases in order to retrieve ranked
    uncertain results?

QMakeHonda
5
Challenges Involved
  • How to predict missing values in autonomous
    databases?
  • As autonomous databases are accessible only
    through form-based interfaces, how to retrieve
    relevant uncertain answers?
  • How to keep query processing cost manageable in
    retrieving uncertain tuples?
  • How to rank the retrieved uncertain answers?

6
Related Work
  • Probabilistic databases
  • Incomplete databases are similar to probabilistic
    databases once we assess the probabilities for
    missing values
  • TRIO uncertainty with lineage
  • ConQuer handling inconsistency over databases
  • Assume probability distributions are given for
    uncertain or inconsistent attributes
  • We assess probability distribution for missing
    attribute and use it to rank rewritten queries to
    retrieve relevant answers since the probabilities
    cannot be stored in databases
  • Our query rewriting framework is general and can
    be used by these systems if the databases are
    autonomous
  • Handling Missing Values
  • EM algorithm, Bayes Net, Association rules

7
Possible Approaches
  • For a query Qbody style convt
  • 1.Certain Answers Only (CAO) Return certain
    answers only as in traditional databases
  • 2. All Uncertain Answers (AUA) Null matches any
    concrete value, hence return all answers having
    body styleconvt along with answers having body
    style as null
  • 3. Relevant Uncertain Answers (RUA) Ranking
    answers by predicting values of missing attribute

Low Recall
Low Precision, infeasible
Costly, infeasible
8
Outline
  • Introduction
  • QPIAD Query Processing over Incomplete
    Autonomous Databases
  • Data Integration over Incomplete Autonomous
    Databases
  • Other Contributions
  • Conclusion

9
QPIAD System Architecture
10
RRUA Generating Rewritten Queries
  • Restricted Relevant Uncertain Answers (RRUA)
    approach only retrieves only relevant incomplete
    tuples instead of retrieving all tuples as in AUA
    and RUA
  • Consider a query QBody styleconvt

Base Result SetRS(Q)
Rewritten queries are based on the determining
set from AFD for Body style Model gt Body
style0.9 Q1modela4 Q2modelz4 Q3modelbo
xster
Determining Attribute set(dtrSet)
11
Learning Attribute Correlations
  • AFD VIN gt Model where VIN is an Approximate
    Key(AKey) with high confidence
  • VIN will not be useful for query rewriting and
    feature selection since it will not be able to
    retrieve additional new tuples

12
RRUA Ranking Rewritten Queries
  • All queries may not be equally good in retrieving
    relevant answers
  • z4 model cars are more likely to be
    convertibles than a car with a4 model
  • When database or network resources are limited,
    the mediator can choose to issue the top K
    queries to get the most relevant uncertain
    answers

13
Learning Value Distributions
  • Used to rank queries based on the determining set
    of attributes from the AFD for query attribute
  • We use Naïve Bayes Classifier with m-estimates
    with AFD as a feature selection step
  • Rank of a rewritten query Qi P(Amvmti), where
    ti e ?dtrSet(Am)(RS(Q))
  • Q1modela4, R(Q1) P(bodystyleconvtmodela4)
    0.4
  • Q2modelz4, R(Q2) P(bodystyleconvtmodelz4)
    1.0
  • Q3modelboxster, R(Q3) P(bodystyleconvtmode
    lboxster)0.7
  • R(Q2) gt R(Q3) gt R(Q1)
  • Relevant uncertain answers are ranked based on
    the rank of the rewritten query that retrieved it

14
Combining AFDs and Classifiers
  • More than one AFD may exist for some attributes
  • Experimented with several approaches
  • Only best-AFD having highest confidence
  • All attributes ignoring AFDs
  • Hybrid One-AFD
  • Ensemble of classifiers

15
Empirical Evaluation of QPIAD
  • Test Databases AutoTrader database containing
    100K tuples and Census database from UCI
    Repository containing 50K tuples
  • Oracular study To evaluate the effectiveness of
    our system against a ground truth, we
    artificially insert missing values in 10 of the
    tuples within these databases

16
RRUA vs AUA vs RUA
17
Precision over Top K Tuples
18
Ranking the Rewritten Queries
  • Cars database Census database

19
Robustness of QPIAD
20
User Relevance Issues with QPIAD
  • When the query processor presents incomplete
    tuples, it becomes a recommender system
  • For a query Qyear2000
  • How to convince users into believing the system
    results?

21
Outline
  • Introduction
  • QPIAD Query Processing over Incomplete
    Autonomous Databases
  • Data Integration over Incomplete Autonomous
    Databases
  • Other Contributions
  • Conclusion

22
Leveraging Correlations between Data Sources
QBody stylecoupe
MediatorGS(Make,Model,Year,Price,Mileage,Bodystyl
e)
23
Correlated Source and Maximum Correlated Source
  • Consider four sources with schema
  • S1(Make,Model,Year,Price)
  • S2(Engine,Drive,Bodystyle),
  • AFD Engine, Drive -gt Body style confidence 0.7
  • S3(Make,Model,Body style)
  • AFD Model -gt Body style confidence 0.8
  • S4(Make,Price,Body style)
  • AFD Make, Price -gt Body Style confidence 0.6
  • Mediator global schema GS(Make,Model,Year,Price,
    Bodystyle, Engine, Drive)
  • S3 and S4 are correlated sources with S1 on Body
    style attribute
  • S3 is the maximum correlated source for S1 on
    Body style attribute

24
Retrieving Relevant Uncertain Answers from
CarsDirect.com
  • Consider a query Qbody style coupe(GS)
  • Cars.com has an AFD Model gt Body style(0.9)
  • Cars.com is the maximum correlated source for
    CarsDirect.com which doesnt support Body style
    but supports Model attribute

Q1modelAccord Q2modelMustang Q3modelLegend Q
4model325
25
Empirical Evaluation of using Correlation between
Data Sources
  • We consider a mediator performing data
    integration over three sources Cars.com, Yahoo!
    Autos and CarsDirect.com
  • Yahoo! Autos and CarsDirect.com do not allow
    querying on body style but when the tuples are
    retrieved we can check the body style attribute
    to determine if the tuple retrieved has the body
    style specified in the query
  • Evaluation using attribute correlations and value
    distributions learned from Cars.com for 5 test
    queries on body style attribute

26
Retrieving Relevant Answers using Correlations
from Cars.com
27
Handling Joins over Incomplete Autonomous
databases
  • Mediator performing data integration across two
    sources
  • Source S1 is incomplete
  • Source S2 is complete

28
Issues in Handling Joins
  • Performing joins over probabilistic databases
    will lead to a disjunction in join results
  • Consider joining uncertain tuples from the two
    sources

Approximation
0.6
or
0.4
29
Handling Join Queries
  • QsMakeHonda(UsedCars)
  • Assume AFDs Make,Year gt Model, Model gt
    Make

Q1 ModelOdysseyR(Q1)1 Q2 ModelAccordR(Q2)1
1.0
Queries on source S2 to join Q3ModelOdysseyR(Q3
)1 Q4ModelAccordR(Q4)1 Q5ModelCivicR(Q5)0
.6
0.6
0.6 Civic
0.4 Accord
30
Experimental Results Joins
31
Outline
  • Introduction
  • QPIAD Query Processing over Incomplete
    Autonomous Databases
  • Data Integration over Incomplete Autonomous
    Databases
  • Other Contributions
  • Conclusion

32
QUIC Querying under Imprecision and
Incompleteness
  • Consider a query Qmodel like Civic(Cars)
  • User might be interested in similar cars like
    Accord, Camry, etc
  • Ranking results in presence of both similar and
    incomplete tuples

33
Other ContributionsCollaboration with Garrett
Wolf
  • Handling multi-attribute selection queries for
    incomplete databases
  • QUIC system for query processing under
    imprecision and incompleteness
  • Online learning of value distribution based on
    base result set to avoid sample biases

34
Conclusion
  • Thesis proposed a framework for query processing
    over incomplete autonomous web databases
  • QPIAD Query processing over incomplete
    autonomous databases
  • QPIAD Data Integration over multiple incomplete
    data sources
  • Results of empirical evaluation on real world
    databases show that our system returns relevant
    answers with high precision while keeping the
    query processing cost manageable

35
Thank You!!
  • Questions??
Write a Comment
User Comments (0)
About PowerShow.com