Query Processing over Incomplete Autonomous Web Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Query Processing over Incomplete Autonomous Web Databases

Description:

BMW. convt. 20000. 2004. a4. Audi. Body style. Price. Year. Model. Make ... User might be interested in similar cars like 'Accord', 'Camry', etc ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 36

Provided by: jaewo

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Query Processing over Incomplete Autonomous Web Databases

1
Query Processing over Incomplete Autonomous Web
Databases

MS Thesis Defense
by Hemal Khatri
Committee Members
Prof. Subbarao Kambhampati (chair)
Prof. Chitta Baral
Prof. Yi Chen
Prof. Huan Liu

2
Introduction to Web databases

Many websites allow user query through a form
based interface and are supported by backend
databases
Consider used cars selling websites such as
Cars.com, Yahoo! autos, etc

3
Incompleteness in Web databases

Web databases are often input by lay individuals
without any curation. For e.g. Cars.com, Yahoo!
Autos
Web databases are being populated using automated
information extraction techniques which are
inherently imperfect
The local schema of data sources may not support
certain attributes supported by the global schema
Incomplete/Uncertain tuple A tuple in which one
or more of its attributes have a missing value

4
Problem Statement

Many entities corresponding to tuples with
missing values might be relevant to the user
query
Current query processing techniques return
answers that exactly satisfy the user query
Such techniques return results with high
precision but low recall
Relevant Uncertain tuple A tuple which does not
exactly satisfy the query predicates but the
entity represented by that tuple might be
relevant to the query
How to support query processing over incomplete
autonomous databases in order to retrieve ranked
uncertain results?

QMakeHonda
5
Challenges Involved

How to predict missing values in autonomous
databases?
As autonomous databases are accessible only
through form-based interfaces, how to retrieve
relevant uncertain answers?
How to keep query processing cost manageable in
retrieving uncertain tuples?
How to rank the retrieved uncertain answers?

6
Related Work

Probabilistic databases
Incomplete databases are similar to probabilistic
databases once we assess the probabilities for
missing values
TRIO uncertainty with lineage
ConQuer handling inconsistency over databases
Assume probability distributions are given for
uncertain or inconsistent attributes
We assess probability distribution for missing
attribute and use it to rank rewritten queries to
retrieve relevant answers since the probabilities
cannot be stored in databases
Our query rewriting framework is general and can
be used by these systems if the databases are
autonomous
Handling Missing Values
EM algorithm, Bayes Net, Association rules

7
Possible Approaches

For a query Qbody style convt
1.Certain Answers Only (CAO) Return certain
answers only as in traditional databases
2. All Uncertain Answers (AUA) Null matches any
concrete value, hence return all answers having
body styleconvt along with answers having body
style as null
3. Relevant Uncertain Answers (RUA) Ranking
answers by predicting values of missing attribute

Low Recall
Low Precision, infeasible
Costly, infeasible
8
Outline

Introduction
QPIAD Query Processing over Incomplete
Autonomous Databases
Data Integration over Incomplete Autonomous
Databases
Other Contributions
Conclusion

9
QPIAD System Architecture
10
RRUA Generating Rewritten Queries

Restricted Relevant Uncertain Answers (RRUA)
approach only retrieves only relevant incomplete
tuples instead of retrieving all tuples as in AUA
and RUA
Consider a query QBody styleconvt

Base Result SetRS(Q)
Rewritten queries are based on the determining
set from AFD for Body style Model gt Body
style0.9 Q1modela4 Q2modelz4 Q3modelbo
xster
Determining Attribute set(dtrSet)
11
Learning Attribute Correlations

AFD VIN gt Model where VIN is an Approximate
Key(AKey) with high confidence
VIN will not be useful for query rewriting and
feature selection since it will not be able to
retrieve additional new tuples

12
RRUA Ranking Rewritten Queries

All queries may not be equally good in retrieving
relevant answers
z4 model cars are more likely to be
convertibles than a car with a4 model
When database or network resources are limited,
the mediator can choose to issue the top K
queries to get the most relevant uncertain
answers

13
Learning Value Distributions

Used to rank queries based on the determining set
of attributes from the AFD for query attribute
We use Naïve Bayes Classifier with m-estimates
with AFD as a feature selection step
Rank of a rewritten query Qi P(Amvmti), where
ti e ?dtrSet(Am)(RS(Q))
Q1modela4, R(Q1) P(bodystyleconvtmodela4)
0.4
Q2modelz4, R(Q2) P(bodystyleconvtmodelz4)
1.0
Q3modelboxster, R(Q3) P(bodystyleconvtmode
lboxster)0.7
R(Q2) gt R(Q3) gt R(Q1)
Relevant uncertain answers are ranked based on
the rank of the rewritten query that retrieved it

14
Combining AFDs and Classifiers

More than one AFD may exist for some attributes
Experimented with several approaches
Only best-AFD having highest confidence
All attributes ignoring AFDs
Hybrid One-AFD
Ensemble of classifiers

15
Empirical Evaluation of QPIAD

Test Databases AutoTrader database containing
100K tuples and Census database from UCI
Repository containing 50K tuples
Oracular study To evaluate the effectiveness of
our system against a ground truth, we
artificially insert missing values in 10 of the
tuples within these databases

16
RRUA vs AUA vs RUA
17
Precision over Top K Tuples
18
Ranking the Rewritten Queries

Cars database Census database

19
Robustness of QPIAD
20
User Relevance Issues with QPIAD

When the query processor presents incomplete
tuples, it becomes a recommender system
For a query Qyear2000
How to convince users into believing the system
results?

21
Outline

Introduction
QPIAD Query Processing over Incomplete
Autonomous Databases
Data Integration over Incomplete Autonomous
Databases
Other Contributions
Conclusion

22
Leveraging Correlations between Data Sources
QBody stylecoupe
MediatorGS(Make,Model,Year,Price,Mileage,Bodystyl
e)
23
Correlated Source and Maximum Correlated Source

Consider four sources with schema
S1(Make,Model,Year,Price)
S2(Engine,Drive,Bodystyle),
AFD Engine, Drive -gt Body style confidence 0.7
S3(Make,Model,Body style)
AFD Model -gt Body style confidence 0.8
S4(Make,Price,Body style)
AFD Make, Price -gt Body Style confidence 0.6
Mediator global schema GS(Make,Model,Year,Price,
Bodystyle, Engine, Drive)
S3 and S4 are correlated sources with S1 on Body
style attribute
S3 is the maximum correlated source for S1 on
Body style attribute

24
Retrieving Relevant Uncertain Answers from
CarsDirect.com

Consider a query Qbody style coupe(GS)
Cars.com has an AFD Model gt Body style(0.9)
Cars.com is the maximum correlated source for
CarsDirect.com which doesnt support Body style
but supports Model attribute

Q1modelAccord Q2modelMustang Q3modelLegend Q
4model325
25
Empirical Evaluation of using Correlation between
Data Sources

We consider a mediator performing data
integration over three sources Cars.com, Yahoo!
Autos and CarsDirect.com
Yahoo! Autos and CarsDirect.com do not allow
querying on body style but when the tuples are
retrieved we can check the body style attribute
to determine if the tuple retrieved has the body
style specified in the query
Evaluation using attribute correlations and value
distributions learned from Cars.com for 5 test
queries on body style attribute

26
Retrieving Relevant Answers using Correlations
from Cars.com
27
Handling Joins over Incomplete Autonomous
databases

Mediator performing data integration across two
sources
Source S1 is incomplete
Source S2 is complete

28
Issues in Handling Joins

Performing joins over probabilistic databases
will lead to a disjunction in join results
Consider joining uncertain tuples from the two
sources

Approximation
0.6
or
0.4
29
Handling Join Queries

QsMakeHonda(UsedCars)
Assume AFDs Make,Year gt Model, Model gt
Make

Q1 ModelOdysseyR(Q1)1 Q2 ModelAccordR(Q2)1
1.0
Queries on source S2 to join Q3ModelOdysseyR(Q3
)1 Q4ModelAccordR(Q4)1 Q5ModelCivicR(Q5)0
.6
0.6
0.6 Civic
0.4 Accord
30
Experimental Results Joins
31
Outline

Introduction
QPIAD Query Processing over Incomplete
Autonomous Databases
Data Integration over Incomplete Autonomous
Databases
Other Contributions
Conclusion

32
QUIC Querying under Imprecision and
Incompleteness

Consider a query Qmodel like Civic(Cars)
User might be interested in similar cars like
Accord, Camry, etc
Ranking results in presence of both similar and
incomplete tuples

33
Other ContributionsCollaboration with Garrett
Wolf

Handling multi-attribute selection queries for
incomplete databases
QUIC system for query processing under
imprecision and incompleteness
Online learning of value distribution based on
base result set to avoid sample biases

34
Conclusion

Thesis proposed a framework for query processing
over incomplete autonomous web databases
QPIAD Query processing over incomplete
autonomous databases
QPIAD Data Integration over multiple incomplete
data sources
Results of empirical evaluation on real world
databases show that our system returns relevant
answers with high precision while keeping the
query processing cost manageable

35
Thank You!!