Adaptive Query Processing for Data Aggregation:

About This Presentation

Title:

Adaptive Query Processing for Data Aggregation:

Description:

Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source ... Option2: aggregating multi-dimension vectors into scalar utility values ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 33

Provided by: jianch

Learn more at: https://rakaposhi.eas.asu.edu

more less

Transcript and Presenter's Notes

Title: Adaptive Query Processing for Data Aggregation:

1
Adaptive Query Processing for Data Aggregation

Mining, Using and Maintaining Source Statistics

M.S Thesis Defense by Jianchun Fan Committee
Members Dr. Subbarao Kambhampati (chair) Dr.
Huan Liu Dr. Yi Chen

April 13, 2006
2
Introduction

Data Aggregation Vertical Integration

R (A1, A2, A3, A4, A5, A6)
Mediator
S1
R1 (A1, A2, _, _, A5, A6)
S2
R2 (A1, _, A3, A4, A5, A6)
S3
R1 (A1, A2, A3, A4, A5, _)
3
Introduction

Query Processing in Data Aggregation
Sending every query to all sources ?
Increasing work load on sources
Consuming a lot of network resources
Keeping users waiting
Primary processing task
Selecting the most relevant sources regarding
difference user objectives, such as completeness
and quality of the answers and response time
Need several types of sources statistics to guide
source selection
Usually not directly available

4
Introduction

Challenges
Automatically gather various types of source
statistics to optimize individual goal
Many answers (high coverage)
Good answers (high density)
Answered quickly (short latency)
Combine different statistics to support
multi-objective query processing
Maintain statistics dynamically

5
System Overview
6
System Overview

Test beds
Bibfinder Online bibliography mediator system,
integrating DBLP, IEEE xplore, CSB, Network
Bibligraph, ACM Digital Library, etc.
Synthetic test bed 30 synthetic data sources
(based on Yahoo! Auto database) with different
coverage, density and latency characteristics.

7
Outline

Introduction Overview
Coverage/Overlap Statistics
Learning Density Statistics
Learning Latency Statistics
Multi-Objective Query Processing
Other Contribution
Conclusion

8
Coverage/Overlap Statistics

Coverage how many answers a source provides for
a given query
Overlap how many common answers a set of sources
share for a given query
Based on Nie Kambkampati ICDE 2004

9
Density Statistics

Coverage measures vertical completeness of the
answer set
horizontal completeness is important too
quality of the individual answers

Density statistics measures the horizontal
completeness of the individual answer tuples
10
Defining Density

Density of a source w.r.t a given query
Average of density of all answers

Projection Attribute set
Select A1, A2, A3, A4 From S Where A1 gt
v1 Density (1 0.5 0.5 0.75) / 4
0.675
Selection Predicates

Learning density for every possible source/query
combination? too costly
The number of possible queries is exponential to
the number of attributes

11
Learning Density Statistics

A more realistic solution classify the queries
and learn density statistics only w.r.t the
classes

Assumption If a tuple t represents a real world
entity E, then whether or not t has missing value
on attribute A is independent to Es actual value
of A.

Projection Attribute set
Select A1, A2, A3, A4 From S Where A1 gt v1
Selection Predicates
12
Learning Density Statistics

Query class for density statistics projection
attribute set
For queries whose projection attribute set is
(A1, A2, , Am), 2m different types of answers

22 different density patterns dp1 (A1, A2) dp2
(A1, A2) dp3 (A1, A2) dp4 (A1, A2)
Density(A1, A2 S) P(dp1 S) 1.0 P(dp2
S) 0.5 P(dp3 S) 0.5 P(dp4 S)
0.0
13
Learning Density Statistics
R(A1, A2, , An)
2n possible projection attribute set
(A1) (A1, A2) (A1, A3) (A1, A2, , Am)
2m possible density patterns
(A1, A2, , Am) (A1, A2, , Am) (A1, A2, ,
Am) (A1, A2, , Am)
For each data source S, the mediator needs to
estimate
joint probabilities!
14
Learning Density Statistics

Independence Assumption the probability of tuple
t having a missing value on attribute A1 is
independent of whether or not t has a missing
value on attribute A2.
For queries whose projection attribute set is
(A1, A2, , Am), only need to assess m
probability values for each source!

Joint distribution P(A1, A2 S) P(A1 S)
(1 - P(A2 S))
Learned from a sample of the data source
15
Outline

Introduction Overview
Coverage/Overlap Statistics
Learning Density Statistics
Learning Latency Statistics
Multi-Objective Query Processing
Other Contribution
Conclusion

16
Latency Statistics

Existing work source specific measurement of
response time
Variations on time, day of the week, quantity of
data, etc.
However, latency is often query specific
For example, some attributes are indexed
How to classify queries to learn latency?
Binding Pattern

Same
different
17
Latency Statistics
18
Using Latency Statistics

Learning is straightforward average on a group
of training queries for each binding pattern
Effectiveness of binding pattern based latency
statistics

19
Outline

Introduction Overview
Coverage/Overlap Statistics
Learning Density Statistics
Learning Latency Statistics
Multi-Objective Query Processing
Other Contribution
Conclusion

20
Multi-Objective Query Processing

Users may not be easy to please
give me some good answers fast
I need many good answers
These goals are often conflicting!
decoupled optimization strategy wont work
Example
S1(coverage 0.60, density 0.10)
S2(coverage 0.55, density 0.15)
S3(coverage 0.50, density 0.50)

21
Multi-Objective Query Processing

The mediator needs to select sources that are
good in many dimensions
Overall optimality
Query selection plans can be viewed as
3-dimentional vectors
Option1 Pareto Optimal Set
Option2 aggregating multi-dimension vectors into
scalar utility values

22
Combining Density and Coverage
23
Combining Density and Coverage
24
Combining Density and Coverage
25
Multi-Objective Query Processing

discount model
weighted sum model

2D coverage
26
Multi-Objective Query Processing
27
Outline

Introduction Overview
Coverage/Overlap Statistics
Learning Density Statistics
Learning Latency Statistics
Multi-Objective Query Processing
Other Contribution
Conclusion

28
Other Contribution

Incremental Statistics Maintenance (In Thesis)

29
Other Contribution

A snapshot of public web services (not in Thesis)
Sigmod Record Mar. 2005

Implications and Lessons learned
Most publicly available web services support
simple data sensing and conversion, and can be
viewed as distributed data sources
Discovery/Retrival of public web services are not
beyond what the commercial search engines do.
Composition
Very few services available little correlations
among them
Most composition problems can be solved with
existing data integration techniques

30
Other Contribution

Query Processing over Incomplete Autonomous
Database with Hemal Khatri
Retrieving uncertain answers where constrained
attributes are missing
Learning Approximate Functional Dependency and
Classifiers to reformulate the original user
queries

Select from cars where model civic
(Make, Body Style) Model Q1
select from cars where make Honda and
BodyStyle sedan Q2 select from cars where
make Honda and BodyStyle coupe
31
Conclusion