Title: Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data
1Probabilistic Threshold Range Aggregate Query
Processing over Uncertain Data
Wenjie Zhang University of New South Wales
NICTA, Australia
Joint work Shuxiang Yang, Ying Zhang, Xuemin Lin
(UNSW NICTA)
2Outline
- Background and Preliminaries
- Probabilistic Threshold Range Aggregate Query
- Exact query processing
- Approximate query processing Simple Sampling
Double Sampling - Experiments
- Conclusion
3Applications
- Many applications involve data that is imperfect
due to - data randomness and incompleteness
- limitation of equipment
- delay or lose in data transfer
-
- Applications
- Sensor networks
- Environmental surveillance
- Moving objects
- Data cleaning and integration
-
4Applications
- Sensor Networks
- Sensor readings are often imprecise due to
equipment limitation and periodical reporting
mechanism. - (figures are borrowed from Jian et al,
SIGMOD08)
5Applications
- Mobile Equipments / Moving Objects
- A mobile object reports its location
periodically, the exact location is often
uncertain.
6Applications
7Applications
- Data Quality
- Social Data Collection Errors and estimation
inherent in customer surveys and sampling
8Outline
- Background and Preliminaries
- Modeling Uncertainty Related Work
- Probabilistic Threshold Range Query
- Conclusion
9Modeling Uncertainty ( cont. )
- Uncertain Objects Model
- Continuous case described using a probability
density function (PDF) fU such that
. E.g., uniform distribution, normal
distribution.
10Modeling Uncertainty ( cont. )
- Uncertain Objects Model
- Discrete case described using a set of
instances each instance u has an occurrence
probability pu
11Possible World Semantics
- Given a set of uncertain objects U1,U2, ...,
Un, a possible world W u1,u2, .., un is a
set of n instances --- one instance per uncertain
object - The probability of a possible worlds is
- P(W)
- Let ? be the set of all possible world, clearly,
-
12Probabilistic Queries
- Query Evaluation CKP03, CXPSV04, DS04, DS05,
DS07, SD07 - Aggregate Queries BDJR05, MJ07, CG07
- Join Queries CSP06, AW07
- Top-k queries SIC07, YLSK08, RDS07, HJZL08
- Nearest Neighbor Queries KKR07, CCMC08
- Skyline Queries PJLY07
-
13Range query
- Uncertain objects, exact query
- Probability threshold is often assigned
14Related Work
- Range Queries TCXNKP05, BPS06, AY08
- Given a rectangle r and a probabilistic
threshold t , find all objects that appear in r
with probability at least t. - Appearance probability
-
15U-tree
Probabilistically Constrained Region ( PCR )
TCXNKP05
PCR (0.2)
Multi PCRs
16Outline
- Introduction
- Modeling Uncertainty Related Work
- Probabilistic Threshold Range Aggregate Query
(PTRA) - Conclusion
17Contribution
- Formally define PTRA query
- aU-Tree structure for exact PTRA query
- singleSample and doubleSample techniques for
approximate answer.
18Problem Statement
- Given a set of uncertain objects and query q ,
return the number of uncertain objects with
appearance probability no less than threshold pq
19Problem Definition
- Assume threshold 0.5, if the appearance
probability computed for b is gt 0.5 and for c is
lt 0.5, then the aggregate returned is 2 (a b)
20Exact Query Processing ( aU-Tree)
- Main idea add aggregate information on U-tree
- Advantage stop at intermediate level if pruned
or fully covered by the query - Disadvantage otherwise, still need to drill down
to the leaf nodes. - For a large portion of uncertain objects,
appearance probability needs to be computed - Expensive for a massive number of instances per
object!
21Exact Query Processing ( aU-Tree)
22singleSample
- Sampling the instances of the uncertain objects.
- If m out of m sampled instances are inside query
region, then the approximate appearance
probability is m/m
23singleSample ( cont. )
An immediate application of Chernoff-Hoeffding
bound
24doubleSample
- Single Sampling is expensive when there is a
massive number of objects! - Sampling the uncertain objects as well.
- Naive uniform sampling objects from all
uncertain objects.
25doubleSample Accuracy
- Note appearance probability of each object
follows uniform distribution means spatial
location is uniformly distributed. - Using Chernoff-Hoeffding bound.
26doubleSample Our Approach
- Skew!
- Aim select K disjoint groups covering all
objects with the minimum skew i.e. objects in
each group with uniform distribution. (Then do
uniform sampling of objects in each group.) - The optimization problem is NP-hard.
- Observation
- Min-skew is a good heuristic to conduct such a
group. - aU-tree groups objects with a similar principle
to the min-skew.
27doubleSample Our Approach
- Step 1 choose K subtrees to cover all objects
with the total minimum skew. NP-hard! - Find a level L such that the number of nodes at
level L is smaller than K but the number of nodes
at level L-1 is larger than K. - Feed the min-skew algorithm with the subtrees at
level L. - (note if at a level L, the number of nodes
K, then these K subtrees are chosen.) - Step 2 sample objects in each subtree.
- Step 3. sample instances in each sampled object.
28Experiments
- Algorithms
- exact, singleSample, doubleSample
- Data set
- LB 53k objects at long beach country
- CA 62k objects at California
- Synthetic aircraft dataset in 3D
- 10k instances for each points follow Uniform
or constrained-Gaussian - Setting C, P4 2.8GHz , 2G memory, Debian
linux, Page size 8K
29Efficiency
30Accuracy
31Accuracy ( cont. )
32Conclusion
- Definition of PTRA
- aU-Tree technique
- Sampling technique
- Future work. Any approach with theoretic
guarantee?
33 34Min-Skew technique