Title: Query Processing over Incomplete Autonomous Web Databases
1Query Processing over Incomplete Autonomous Web
Databases
- MS Thesis Defense
- by Hemal Khatri
- Committee Members
- Prof. Subbarao Kambhampati (chair)
- Prof. Chitta Baral
- Prof. Yi Chen
- Prof. Huan Liu
2Introduction to Web databases
- Many websites allow user query through a form
based interface and are supported by backend
databases - Consider used cars selling websites such as
Cars.com, Yahoo! autos, etc
3Incompleteness in Web databases
- Web databases are often input by lay individuals
without any curation. For e.g. Cars.com, Yahoo!
Autos - Web databases are being populated using automated
information extraction techniques which are
inherently imperfect - The local schema of data sources may not support
certain attributes supported by the global schema - Incomplete/Uncertain tuple A tuple in which one
or more of its attributes have a missing value
4Problem Statement
- Many entities corresponding to tuples with
missing values might be relevant to the user
query - Current query processing techniques return
answers that exactly satisfy the user query - Such techniques return results with high
precision but low recall - Relevant Uncertain tuple A tuple which does not
exactly satisfy the query predicates but the
entity represented by that tuple might be
relevant to the query - How to support query processing over incomplete
autonomous databases in order to retrieve ranked
uncertain results?
QMakeHonda
5Challenges Involved
- How to predict missing values in autonomous
databases? - As autonomous databases are accessible only
through form-based interfaces, how to retrieve
relevant uncertain answers? - How to keep query processing cost manageable in
retrieving uncertain tuples? - How to rank the retrieved uncertain answers?
6Related Work
- Probabilistic databases
- Incomplete databases are similar to probabilistic
databases once we assess the probabilities for
missing values - TRIO uncertainty with lineage
- ConQuer handling inconsistency over databases
- Assume probability distributions are given for
uncertain or inconsistent attributes - We assess probability distribution for missing
attribute and use it to rank rewritten queries to
retrieve relevant answers since the probabilities
cannot be stored in databases - Our query rewriting framework is general and can
be used by these systems if the databases are
autonomous - Handling Missing Values
- EM algorithm, Bayes Net, Association rules
7Possible Approaches
- For a query Qbody style convt
- 1.Certain Answers Only (CAO) Return certain
answers only as in traditional databases - 2. All Uncertain Answers (AUA) Null matches any
concrete value, hence return all answers having
body styleconvt along with answers having body
style as null - 3. Relevant Uncertain Answers (RUA) Ranking
answers by predicting values of missing attribute
Low Recall
Low Precision, infeasible
Costly, infeasible
8Outline
- Introduction
- QPIAD Query Processing over Incomplete
Autonomous Databases - Data Integration over Incomplete Autonomous
Databases - Other Contributions
- Conclusion
9QPIAD System Architecture
10RRUA Generating Rewritten Queries
- Restricted Relevant Uncertain Answers (RRUA)
approach only retrieves only relevant incomplete
tuples instead of retrieving all tuples as in AUA
and RUA - Consider a query QBody styleconvt
Base Result SetRS(Q)
Rewritten queries are based on the determining
set from AFD for Body style Model gt Body
style0.9 Q1modela4 Q2modelz4 Q3modelbo
xster
Determining Attribute set(dtrSet)
11Learning Attribute Correlations
- AFD VIN gt Model where VIN is an Approximate
Key(AKey) with high confidence - VIN will not be useful for query rewriting and
feature selection since it will not be able to
retrieve additional new tuples
12RRUA Ranking Rewritten Queries
- All queries may not be equally good in retrieving
relevant answers - z4 model cars are more likely to be
convertibles than a car with a4 model - When database or network resources are limited,
the mediator can choose to issue the top K
queries to get the most relevant uncertain
answers
13Learning Value Distributions
- Used to rank queries based on the determining set
of attributes from the AFD for query attribute - We use Naïve Bayes Classifier with m-estimates
with AFD as a feature selection step - Rank of a rewritten query Qi P(Amvmti), where
ti e ?dtrSet(Am)(RS(Q)) - Q1modela4, R(Q1) P(bodystyleconvtmodela4)
0.4 - Q2modelz4, R(Q2) P(bodystyleconvtmodelz4)
1.0 - Q3modelboxster, R(Q3) P(bodystyleconvtmode
lboxster)0.7 - R(Q2) gt R(Q3) gt R(Q1)
- Relevant uncertain answers are ranked based on
the rank of the rewritten query that retrieved it
14Combining AFDs and Classifiers
- More than one AFD may exist for some attributes
- Experimented with several approaches
- Only best-AFD having highest confidence
- All attributes ignoring AFDs
- Hybrid One-AFD
- Ensemble of classifiers
15Empirical Evaluation of QPIAD
- Test Databases AutoTrader database containing
100K tuples and Census database from UCI
Repository containing 50K tuples - Oracular study To evaluate the effectiveness of
our system against a ground truth, we
artificially insert missing values in 10 of the
tuples within these databases
16RRUA vs AUA vs RUA
17Precision over Top K Tuples
18Ranking the Rewritten Queries
- Cars database Census database
19Robustness of QPIAD
20User Relevance Issues with QPIAD
- When the query processor presents incomplete
tuples, it becomes a recommender system - For a query Qyear2000
- How to convince users into believing the system
results?
21Outline
- Introduction
- QPIAD Query Processing over Incomplete
Autonomous Databases - Data Integration over Incomplete Autonomous
Databases - Other Contributions
- Conclusion
22Leveraging Correlations between Data Sources
QBody stylecoupe
MediatorGS(Make,Model,Year,Price,Mileage,Bodystyl
e)
23Correlated Source and Maximum Correlated Source
- Consider four sources with schema
- S1(Make,Model,Year,Price)
- S2(Engine,Drive,Bodystyle),
- AFD Engine, Drive -gt Body style confidence 0.7
- S3(Make,Model,Body style)
- AFD Model -gt Body style confidence 0.8
- S4(Make,Price,Body style)
- AFD Make, Price -gt Body Style confidence 0.6
- Mediator global schema GS(Make,Model,Year,Price,
Bodystyle, Engine, Drive) - S3 and S4 are correlated sources with S1 on Body
style attribute - S3 is the maximum correlated source for S1 on
Body style attribute
24Retrieving Relevant Uncertain Answers from
CarsDirect.com
- Consider a query Qbody style coupe(GS)
- Cars.com has an AFD Model gt Body style(0.9)
- Cars.com is the maximum correlated source for
CarsDirect.com which doesnt support Body style
but supports Model attribute
Q1modelAccord Q2modelMustang Q3modelLegend Q
4model325
25Empirical Evaluation of using Correlation between
Data Sources
- We consider a mediator performing data
integration over three sources Cars.com, Yahoo!
Autos and CarsDirect.com - Yahoo! Autos and CarsDirect.com do not allow
querying on body style but when the tuples are
retrieved we can check the body style attribute
to determine if the tuple retrieved has the body
style specified in the query - Evaluation using attribute correlations and value
distributions learned from Cars.com for 5 test
queries on body style attribute
26Retrieving Relevant Answers using Correlations
from Cars.com
27Handling Joins over Incomplete Autonomous
databases
- Mediator performing data integration across two
sources - Source S1 is incomplete
- Source S2 is complete
28Issues in Handling Joins
- Performing joins over probabilistic databases
will lead to a disjunction in join results - Consider joining uncertain tuples from the two
sources
Approximation
0.6
or
0.4
29Handling Join Queries
- QsMakeHonda(UsedCars)
- Assume AFDs Make,Year gt Model, Model gt
Make
Q1 ModelOdysseyR(Q1)1 Q2 ModelAccordR(Q2)1
1.0
Queries on source S2 to join Q3ModelOdysseyR(Q3
)1 Q4ModelAccordR(Q4)1 Q5ModelCivicR(Q5)0
.6
0.6
0.6 Civic
0.4 Accord
30Experimental Results Joins
31Outline
- Introduction
- QPIAD Query Processing over Incomplete
Autonomous Databases - Data Integration over Incomplete Autonomous
Databases - Other Contributions
- Conclusion
32QUIC Querying under Imprecision and
Incompleteness
- Consider a query Qmodel like Civic(Cars)
- User might be interested in similar cars like
Accord, Camry, etc - Ranking results in presence of both similar and
incomplete tuples
33Other ContributionsCollaboration with Garrett
Wolf
- Handling multi-attribute selection queries for
incomplete databases - QUIC system for query processing under
imprecision and incompleteness - Online learning of value distribution based on
base result set to avoid sample biases
34Conclusion
- Thesis proposed a framework for query processing
over incomplete autonomous web databases - QPIAD Query processing over incomplete
autonomous databases - QPIAD Data Integration over multiple incomplete
data sources - Results of empirical evaluation on real world
databases show that our system returns relevant
answers with high precision while keeping the
query processing cost manageable
35Thank You!!