Title: Indexing MultiDimensional Uncertain Data with Arbitrary Probability Density Functions
1Indexing Multi-Dimensional Uncertain Data with
Arbitrary Probability Density Functions
- Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kai
Ngai, Ben Kao, Sunil Prabhakar - City University of Hong Kong
- Hong Kong Polytechnic University
- University of Hong Kong
- Purdue University
2Multi-dimensional Uncertain Data
- Moving objects
- An object sends its location to a server whenever
its distance from the previously reported
location is larger than certain threshold. - Sensor readings
- Each sensor reports the temperature, humidity, UV
index, , in its neighborhood periodically. - Querying the (uncertain) data stored in the
server directly is meaningless.
3Uncertainty Modeling
An objects location is described by a
probability density function.
4Probabilistic Range Search
Find the clients that are currently in CityU with
at least 50 probability (probabilistic range
query) (probability
threshold)
5Appearance Probability
E.g., uniform pdf
apperance probability
6Appearance Probability
must be calculated numerically
Calculation time of an appearance probability in
2D space 1.3ms
Time for a random access 10ms
7A good solution should
- Support any pdf.
- Minimize the number of page accesses.
- Minimize the number of appearance probability
calculations. - Minimize the total cost (I/O CPU)
8Main Idea
- Pre-compute some auxiliary information that can
be used to - efficiently decide whether an object appears in a
region with at least a certain probability - without calculating its actual appearance
probability.
9Quick Examples
pq20
10Probabilistically Constrained Regions (PCR)
11Probabilistically Constrained Regions (PCR)
For a query q with search region rq and
probability pq 0.2
- Observation 1.1 (pruning)
an object o can not satisfy q if rq does not
intersect o.pcr(0.2)
12Probabilistically Constrained Regions (PCR)
( 1 0.2)
For a query q with search region rq and
probability pq 0.8
- Observation 1.2 (pruning)
an object o can not satisfy q if rq does not
fully contain o.pcr(0.2)
13Probabilistically Constrained Regions (PCR)
A query q with search region rq and probability
pq 0.2
- Observation 1.3 (validating)
an object o definitely satisfies q if rq fully
contains the part of o.MBR on the left of l1- (or
on the right of l1 or below l2- or above l2)
14Probabilistically Constrained Regions (PCR)
A query q with search region rq and probability
pq 0.8
- Observation 1.4 (for validating)
an object o definitely satisfies q if rq fully
contains the part of o.MBR on the left of l1
(or on the right of l1- or below l2 or above
l2-)
15Probabilistically Constrained Regions (PCR)
A query q with search region rq and probability
pq 0.6
(1 2 0.2)
- Observation 1.5 (for validating)
an object o must satisfy q if rq fully contains
the part of o.MBR between l1- and l1 (or between
l2- and l2)
16Probabilistically Constrained Regions (PCR)
- o.pcr(0.2) provides 5 heuristics to reduce CPU
cost - In general, for a prob-range query with
probability threshold pq - if pq lt 0.5
- o may be pruned using o.pcr( pq ) observation
1.1 - o may be validated using o.pcr( pq ) observation
1.3 - o may be validated using o.pcr( (1 - pq)/2
) observation 1.5 - if pq gt 0.5
- o may be pruned using o.pcr( 1 - pq
) observation 1.2 - o may be validated using o.pcr( 1 - pq
) observation 1.4 - o may be validated using o.pcr( pq /2
) observation 1.5 - pq in 0, 1 ? infinite number of pq
- ? infinite number of PCRs
- Impractical!
- It is possible to use a finite number of PCRs to
achieve pruning and validating.
17Using PCRs in a Conservative Way
E.g., U-catalog 0, 0.1, 0.2, 0.3, 0.4, 0.5
for a query q with search region rq and
probability pq 0.25
an object o cannot satisfy q if rq does not
intersect o.pcr(0.25)
an object o cannot satisfy q if rq does not
intersect o.pcr(0.2)
18Using PCRs in a Conservative Way
U-catalog 0, 0.1, 0.2, 0.3, 0.4, 0.5
for a query q with search region rq and
probability pq 0.75
an object o cannot satisfy q if rq does not fully
contain o.pcr(0.25)
an object o cannot satisfy q if rq does not fully
contain o.pcr(0.3)
19U-catalog Size m
0, 0.5, m 2 0, 0.25, 0.5, m 3 0,
0.1, 0.2, 0.3, 0.4, 0.5, m 6 larger m ?
more PCRs ? greater pruning/validating power
? less CPU cost larger m
? higher space consumption ? larger I/O cost
m 9
20Conservative Functional Boxes (CFB)
U-catalog 0, 0.1, 0.2, 0.3, 0.4, 0.5 o.pcr
2m values for each dimension o.cfbout 4
values for each dimension o.cfbin 4 values
for each dimension total 8 values m 9 8
18
21Conservative Functional Boxes (CFB)
for a query q with search region rq and
probability pq 0.25
U-catalog 0, 0.1, 0.2, 0.3, 0.4, 0.5
an object o cannot satisfy q if rq does not
intersect o.pcr(0.25)
an object o cannot satisfy q if rq does not
intersect o.pcr(0.2)
an object o cannot satisfy q if rq does not
intersect o.cfbout(0.2)
22Conservative Functional Boxes (CFB)
for a query q with search region rq and
probability pq 0.75
U-catalog 0, 0.1, 0.2, 0.3, 0.4, 0.5
an object o cannot satisfy q if rq does not fully
contain o.pcr(0.25)
an object o cannot satisfy q if rq does not fully
contain o.pcr(0.3)
an object o cannot satisfy q if rq does not fully
contain o.cfbin(0.3)
23Comparing CFBs with PCRs
- CFBs have weaker pruning/validating power than
PCRs - But CFBs require less space than PCRs
24Finding Conservative Functional Boxes
- goal minimize
- for the i th dimension, minimize
- with the following constrains
- Linear Programming Simplex Method
25More in Our Paper
- The U-tree
- a dynamic index designed to accelerate
prob-range queries.
26Experimental Results
- data space 0, 10000d
- uncertainty region shape circle (sphere)
- uncertainty region radius 250
- data set
- Long Beach County (LB) 53k 2D objects, uniform
pdf - California (CA) 62k 2D objects, Gaussian pdf
- Aircraft 100k 3D objects, uniform pdf
- query set 100 queries for each data set with
various sizes of rq and different pq
27Experimental Results
28Experimental Results
Query performance vs. search region size (LB, pq
0.6)
29Experimental Results
Query performance vs. search region size (CA, pq
0.6)
30Experimental Results
Query performance vs. search region size on
(Aircraft, pq 0.6)
31Experimental Results
Query performance vs. probability threshold on
(LB, qs 1500)
32Experimental Results
Query performance vs. probability threshold on
(CA, qs 1500)
33Experimental Results
Query performance vs. probability threshold on
(Aircraft, qs 1500)
34Summary
- A fast method for answering probabilistic range
search queries.