Title: PIXML: Probabilistic Semistructured Databases
1PIXML Probabilistic Semistructured Databases
- Edward Hung, Lise Getoor, V.S. Subrahmanian
- University of Maryland, College Park
2Outline
- Motivating example
- Semistructured data model
- PIXML data model
- Semantics
- Interpretation
- Satisfaction
- Other work done
- Related work
- Future work
3Motivating Example
- Surveillance applications monitoring a region of
battlefield - Image processing system identifies vehicles in
convoys appearing in the region in different time - Convoys
- Timestamp
- tanks, trucks, etc
- Uncertainty
- number of vehicles
- Category and identity of a vehicle, e.g., a tank?
T-72?
4Motivating Example
- Doppler speed system detects the speed and
velocity of convoys and infers their possible
destinations - Convoys
- Timestamp
- Possible destinations
- Uncertainty
- Number of places the convoy will go
- The name of the places
5Motivating Example
- Semistructured data model
- General hierarchical structure is known.
- The schema is not fixed
- Number of vehicles
- Properties of vehicles
- Our work store uncertain information in
probabilistic environments.
6Semistructured Data Model
- Instance S(V, lch, t, val)
- lch(o, l) the set of children of o with label l
- G (V, lch) is a rooted, directed, edge-labeled
graph
7Semistructured Data Model
Time 10
8Semistructured Data Model
Time 15
9Semistructured Data Model
10PIXML Data Model
- Uncertainty
- Existence of sub-objects
- Number of sub-objects
- Identity of the sub-objects
11PIXML Data Model
- Weak instance W (V, lch, t, val, card)
- Cardinality constraint (card(o, l)) gives the
bounds of the number of sub-objects with edge
label l connected to the same parents o.
12PIXML Data Model
- Example
- Convoy 2 surely has a timestamp
- card(convoy2, ts) 1, 1
- Convoy 2 may have one to two trucks
- card(convoy2, truck) 1, 2
13PIXML Data Model (Cardinality)
Weak Instance W Semistructured Instance card
14PIXML Data Model
- Compatible Instances
- A semistructured instance S (VS, lchS, tS,
valS) is compatible with a weak instance W (VW,
lchW, tW, valW) if -
-
- (VS, lchS) is a rooted connected graph.
- If o is a leaf in S, then
- If o is also a leaf in W, tS(o)tW(o) and
valS(o)valW(o), otherwise, the type and value is
defined as unknown. - Otherwise, card(o,l).min lt k lt card(o,l).max
where k is the number of l-labeled children of o,
i.e. lchS(o, l)
15PIXML Data Model
16PIXML Data Model
- Example
- There are surely 2 convoys.
- card(S, convoy) 2, 2
- Convoy 1 surely has a timestamp, a truck and a
tank. - card(convoy1, ts) 1, 1
- card(convoy1, truck) 1, 1
- card(convoy1, tank) 1, 1
- Convoy 2 surely has a timestamp
- card(convoy2, ts) 1, 1
- Convoy 2 may have one to two trucks
- card(convoy2, truck) 1, 2
17PIXML Data Model
- D(W) the set of all semistructured instances
compatible with a weak instance W
18(No Transcript)
19PIXML Data Model (Weak Instance)
- Example of a weak instance W
card(S1,convoy)2,2
card(convoy1,ts)1,1
card(convoy1,truck)1,1
card(convoy1,tank)1,1
card(convoy2,ts)1,1
card(convoy2,truck1,2
20PIXML Data Model
- Example of an instance compatible with W
card(convoy1,ts)1,1
card(convoy1,truck)1,1
card(S1,convoy)2,2
card(convoy1,tank)1,1
card(convoy2,ts)1,1
card(convoy2,truck)1,2
21- D(W) the set of all semistructured instances
compatible with the weak instance W
22PIXML Data Model
- Potential child set
- PC(o), the potential child set of a non-leaf
object o in a weak instance W is - the set of all possible sets of children of o
satisfying the constraint of cardinality
23PIXML Data Model
- Example
- Convoy 2s surely has one time stamp which is
surely 15. Convoy 2 may have a truck of type mac
and/or a truck of type rover - card(convoy2, truck) 1, 2
- card(convoy2, ts) 1, 1
- PC(convoy2) ts2, truck3, ts2, truck4,
ts2, truck3, truck4
24Potential child set of convoy2, PC(convoy2)
ts2, truck3, truck4,
ts2, truck3,
ts2, truck4
25PIXML Data Model
- Probabilistic instance I (V, lch, t, val, card,
ipf) - Interval probability function (ipf(o, c)) w.r.t.
the set PC(o) associates, with each c in PC(o), a
closed subinterval lb(c), ub(c) 0, 1 -
26PIXML Data Model
- Example
- PC(convoy2) ts2, truck3, ts2, truck4,
ts2, truck3, truck4 - ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
27Probabilistic Instance I Weak Instance W ipf
ipf(convoy2, ts2, truck3 , truck4)0.2, 0.3
ipf(convoy2, ts2, truck3)0.3, 0.5
ipf(convoy2, ts2, truck4)0.2, 0.4
28PIXML Data Model
- Here the ipf assigns the probability interval to
each possible set of children. - More independence assumptions are possible to
make the representation more compact - e.g. independence between trucks and tanks.
- e.g. all trucks are all indistinguishable.
29Semantics (Global Interpretation)
- Interpretation
- Global interpretation, P
- a mapping from D(W) (the set of semistructured
instances compatible with W) to 0,1 s.t.
30S1a
S1b
S1c
P(S1a) 0.12
P(S1b) 0.08
P(S1c) 0.2
S1d
S1e
S1f
P(S1d) 0.18
P(S1e) 0.12
P(S1f) 0.3
31Semantics (Local Interpretation)
- An object probability function (OPF)for an object
o w.r.t. a weak instance W is a mapping w PC(o)
? 0, 1 s.t.
32Semantics
- Example
- ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
- wconvoy2(ts2, truck3) 0.2
- wconvoy2(ts2, truck4) 0.5
- wconvoy2(ts2, truck3, truck4) 0.3
33Semantics (Local Interpretation)
- Previously, probabilities are assigned to each
compatible instance globally. - Now we are going to assign probabilities of the
actual children of each non-leaf object in a
local manner.
34Object probability function (OPF) for convoy2
w.r.t. W is a mapping w PC(convoy2) ? 0,1 s.t.
wconvoy2(ts2, truck3 , truck4) 0.2
wconvoy2(ts2, truck3) 0.5
wconvoy2(ts2, truck4) 0.3
35Semantics (Local Interpretation)
- Interpretation
- Local interpretation, p
- a mapping from the set of non-leaf objects to
OPFs - Example
- p(convoy2) wconvoy2
36Semantics (Local ? Global)
- Assume that the probability of any potential
child of an object o is independent of
non-descendants of o. - W operator
- W operator returns the probabilities assigned to
every semistructured instance compatible with a
given weak instance, which is consistent with a
given local interpretation. - Given a semistructured instance S compatible with
a weak instance W and a local interpretation p
for W - W(p)(S)Õo S p(o)(CS(o))
- Theorem
- W(p) is a global interpretation for W
37Semantics
- Example
- ipf(S1, convoy1, convoy2)1, 1
- wS1(ts1, truck1, tank1) 1
- ipf(convoy1, ts1, truck1, tank1)0.2, 0.6
- ipf(convoy1, ts1, truck1, tank2)0.4, 0.8
- wconvoy1(ts1, truck1, tank1) 0.4
- wconvoy1(ts1, truck1, tank2) 0.6
- ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
- wconvoy2(ts2, truck3) 0.2
- wconvoy2(ts2, truck4) 0.5
- wconvoy2(ts2, truck3, truck4) 0.3
38Semantics
- Example
- W(S1a)
- p(S1)(convoy1, convoy2) x p(convoy1)(ts1,
truck1, tank1) x p(convoy2)(ts2, truck3,
truck4) - wS1(ts1, convoy1, convoy2) x wconvoy1(ts1,
truck1, tank1) x wconvoy2(ts2, truck3, truck4) - 1 x 0.4 x 0.3
- 0.12
39Semantics
wS1(convoy1, convoy2)1
wconvoy1(ts1, truck1, tank1) 0.4
wconvoy2(ts2, truck3, truck4)0.3
p(S1)(convoy1, convoy2) x p(convoy1)(ts1,
truck1, tank1) x p(convoy2)(ts2, truck3,
truck4)
wS1(ts1, convoy1, convoy2) x wconvoy1(ts1,
truck1, tank1) x wconvoy2(ts2, truck3, truck4)
1 x 0.4 x 0.3 0.12
40Semantics
- Example
- Similarly, we can get
- W(S1a) 0.12
- W(S1b) 0.08
- W(S1c) 0.2
- W(S1d) 0.18
- W(S1e) 0.12
- W(S1f) 0.3
41Semantics (Global ? Local)
- (Same assumption) The probability of any
potential child of an object o is independent of
non-descendants of o. - Given a global interpretation P for a weak
instance W - P satisfies W iff P(co, ndes(o)) P(co)
- ndes(o) is the set of non-descendants of o.
42Semantics (Global ? Local)
- D operator
- D operator returns the probabilities assigned to
each possible set of children of every non-leaf
object, which is consistent with a given global
interpretation. - Given a global interpretation P that satisfies a
weak instance W, for any non-leaf object o, any c
in PC(o) -
- D(P) returns a function defined as follows for
any non-leaf object o, D(P)(o)wP,o
43Semantics (Global ? Local)
- Theorem
- D(P) is a local interpretation for W
- Example
- Derive D(P)(convoy2)
44S1a
S1b
S1c
P(S1a) 0.12
P(S1b) 0.08
P(S1c) 0.2
S1d
S1e
S1f
P(S1d) 0.18
P(S1e) 0.12
P(S1f) 0.3
D(P)(convoy2) wP, convoy2
- wP, convoy2(ts2, truck3, truck4)
(0.120.18)/10.3
45D(P)(convoy2) wP, convoy2
- wP, convoy2(ts2, truck3, truck4)
(0.120.18)/10.3
- wP, convoy2(ts2, truck3) (0.080.12)/1 0.2
- wP, convoy2(ts2, truck4) (0.20.3)/1 0.5
46Semantics
- Example
- Derive D(P)(convoy2) wP, convoy2
- wP, convoy2(ts2, truck3, truck4)
(0.120.18)/10.3 - wP, convoy2(ts2, truck3) (0.080.12)/1 0.2
- wP, convoy2(ts2, truck4) (0.20.3)/1 0.5
47Semantics (Local ?? Global)
- Theorems
- Suppose p is a local interpretation for a weak
instance W, then D(W(p))p. - Suppose P is a global interpretation that
satisfies a weak instance W, then W(D(P))P.
48Semantics (Satisfaction)
- Given a probabilistic instance I, a non-leaf
object o, - OC(o), the object constraints are
- p(c) is a real-valued variable denoting the
probability that c is the actual set of children
of o.
49Semantics (Satisfaction)
- Example
- ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
- OC(convoy2)
50Semantics (Local Satisfaction)
- An OPF w satisfies a non-leaf object o iff w is a
probability distribution w.r.t. PC(o) over ipf. - A local interpretation p satisfies a non-leaf
object o iff p(o) satisfies o. - A local interpretation p satisfies a
probabilistic instance I iff p satisfies Is
every non-leaf object.
51Semantics (Global Satisfaction)
- A global interpretation P satisfies a
probabilistic instance I iff D(P) satisfies I. - Corollary
- A local interpretation p satisfies a
probabilistic instance I iff W(p) satisfies I.
52Semantics (Consistency)
- A probabilistic instance is locally consistent
iff there is a local interpretation that
satisfies it. - A probabilistic instance is globally consistent
iff there is a global interpretation that
satisfies it. - Theorem
- Every probabilistic instance is locally and
globally consistent.
53Other Work Done
- Algebra
- Projection, selection, Cartesian product
- Probabilistic point query
- returns the probability that a given object
satisfies a given path expression - R-answer to a query
- returns the set of objects that satisfy a query
with probability r or more - Implementation of a prototype
54Other Work Done
- Experiment
- Execution time is linear to the total number of
ipf entries, i.e., the instance size - Papers submitted to ICDE and ICDT
55Related Work
- Semistructured Probabilistic Objects (SPOs)
(Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) - SPO express contexts (not random variables) in a
semistructured manner - PIXML data model stores XML data AND
probabilistic information.
56Related Work
- ProTDB (Nierman, Jagadish, in VLDB, 2002)
- Point probabilities VS interval probabilities
- Independent probabilities assigned to each child
VS arbitrary distributions over sets of children - Tree-structured VS arbitrary acyclic
- Our model theory provides two formal semantics
- Differences in their queries and our algebra and
query.
57Related Work
- Algebras TAX, SAL
- TAX (Jagadish, Lakshmanan, Srivastava, 2001)
- use pattern tree to extract subsets of nodes, one
for each embedding of pattern tree. - fixed number of children
- SAL (Beeri, Tzaban, 1999)
- bind objects to variables
- original structure is totally lost
58Future Work
- System implementation
- Query optimization
59Summary
- PIXML data model
- Semistructured instance
- Weak instance (add cardinality)
- Probabilistic instance (add ipf)
- Semantics
- Local and Global
- Interpretation
- Satisfaction
60Algebra
- Operators
- Projection
- Selection
- Cross-product
- Path expression
- o.l1.l2ln
S1.convoy.truck
61Algebra (Projection)
- Ancestor projection
- Descendant projection
- Single projection
62Algebra (Projection)
Semistructured Instance
63Weak Instance
64Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,0.3 ipf(convo
y1, ts1,truck1,tank2)0.1,0.4 ipf(convoy1,
ts1,truck2,tank1)0.3,0.5 ipf(convoy1,
ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
Children of convoy1 before CI2(convoy1)ts1,
truck1, truck2, tank1, tank2
Children of convoy1 after CI2(convoy1)truck1,
truck2
Let Cd CI2(convoy1) CI2(convoy1)ts1,
tank1, tank2
PC(convoy1)truck1,truck2
65Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,0.3 ipf(convo
y1, ts1,truck1,tank2)0.1,0.4 ipf(convoy1,
ts1,truck2,tank1)0.3,0.5 ipf(convoy1,
ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
For each c in PC(convoy1),
ipf(convoy1, c)a, min(1,b)
ipf(convoy1) ? tight(ipf(convoy1))
Dekhtyar, Goldsmith (2002)
66Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,
0.3 ipf(convoy1, ts1,truck1,tank2)0.1,0.4 ip
f(convoy1, ts1,truck2,tank1)0.3,0.5 ipf(convo
y1, ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
For truck1,
a 0.00.1 0.1
b 0.30.4 0.7
ipf(convoy1, truck1) 0.1, min(1, 0.7)
0.1, 0.7
67Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,
0.3 ipf(convoy1, ts1,truck1,tank2)0.1,0.4 ip
f(convoy1, ts1,truck2,tank1)0.3,0.5 ipf(convo
y1, ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
For truck2,
a 0.30.3 0.6
b 0.50.6 1.1
ipf(convoy1, truck2) 0.6, min(1, 1.1)
0.6, 1
68Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,
0.3 ipf(convoy1, ts1,truck1,tank2)0.1,0.4 ip
f(convoy1, ts1,truck2,tank1)0.3,0.5 ipf(convo
y1, ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
ipf(convoy1) ? tight(ipf(convoy1))
tight
ipf(convoy1, truck1)0.1, 0.7 ipf(convoy1,
truck2)0.6, 1
ipf(convoy1, truck1)0.1, 0.4 ipf(convoy1,
truck2)0.6, 0.9
69HIDE IT
card(convoy1,ts)1,1
card(convoy1,truck)1,1
card(convoy1,tank)1,1
wconvoy1(ts1,truck1,tank1)0.4 wconvoy1(ts1,tru
ck1,tank2)0.6
card(S1,convoy)2,2
wS1(convoy1,convoy2)1
card(convoy2,ts)1,1
card(convoy2,truck1,2
wconvoy2(ts2,truck3)0.2 wconvoy2(ts2,truck4)
0.5 wconvoy2(ts2,truck3,truck4)0.3
70Algebra (Projection)
- Descendant projection ( )
card(I3, truck)0,3 ipf(I3,c)0,1
One naive strategy
Our better strategy similar to the one in cross
product
71Algebra (Projection)
(null)
card(I3, truck)0,3 ipf(I3,c)0,1
72Algebra (Projection)
Equivalent
73Algebra (Projection)
Equivalent
74Algebra (Projection)
Equivalent
e1 and e2 are a sequence of zero or more
edges. Thus, I.e1.lm can include I.lm, I.l1.lm,
I.l2.l3.lm, etc.
75In general non-equivalent
76Algebra (Selection) ( )
- Similar to ancestor projection
- Path expression specifies leaf objects with a
specified value.
77Algebra (Selection)
Semistructured Instance
I1
78Algebra (Selection) ( )
card(I7, convoy)1,2, wI7(convoy1)0.2,
wI7(convoy2)0.5, wI7(convoy1,convoy2)0.3
card(convoy1, tank)1,1 wconvoy1(tank1)0.3,
wconvoy1(tank2)0.7
card(convoy2, tank)1,1 wconvoy2(tank2)0.4,
wconvoy2(tank3)0.6
0.14 0.3 0.054 0.036 0.084 0.614
D(I7) ?
0.054
/ 0.614
0.06
0.126
0.14
/ 0.614
0.036
0.3
/ 0.614
/ 0.614
0.2
0.084
/ 0.614
79Algebra (Selection) ( )
card(I7, convoy)1,2, ipf(I7,convoy1)0.1,0.
3, ipf(I7,convoy2)0.4,0.6,
ipf(I7,convoy1,convoy2)0.2,0.4
card(convoy1, tank)1,1 ipf(convoy1,tank1)0.
2,0.4, ipf(convoy1,tank2)0.6,0.8
card(convoy2, tank)1,1 ipf(convoy2,tank2)0.
3,0.5, ipf(convoy2,tank3)0.5,0.7
D(I7) ?
0.012,0.08
Conditionalization of interval probabilities
0.02,0.12
0.02,0.112
0.06,0.24
0.036,0.16
Dekhtyar, Goldsmith (2002)
0.08,0.24
0.24,0.48
0.06,0.224
80Algebra (Cross product (x))
- Probabilistic conjunction strategies
- Example
- Ignorance
- Positive correlation
- Negative Correlation
- Independence
81Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
82Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
83Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
84Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
85Algebra (Cross product)
- Equivalence
- (I1 x I2) x I3
- I1 x (I2 x I3)
- (I1 x I3) x I2
Equivalent
86Related Work
- Semistructured Probabilistic Objects (SPOs)
(Dekhtyar, Goldsmith, Hawkes, in SSDBM, 2001) - SPO express contexts (not random variables) in a
semistructured manner - PIXML data model stores XML data AND
probabilistic information.
87Related Work
- ProTDB (Nierman, Jagadish, in VLDB, 2002)
- Point probabilities VS interval probabilities
- Independent probabilities assigned to each child
VS arbitrary distributions over sets of children - Tree-structured VS arbitrary acyclic
- Our model theory provides two formal semantics
- Differences in their queries and our algebra and
query.
88Related Work
- Algebras TAX, SAL
- TAX (Jagadish, Lakshmanan, Srivastava, 2001)
- use pattern tree to extract subsets of nodes, one
for each embedding of pattern tree. - fixed number of children
- SAL (Beeri, Tzaban, 1999)
- bind objects to variables
- original structure is totally lost
89Future Work
- System implementation
- Query optimization
90Summary
- PXML data model
- Semistructured instance
- Weak instance (add cardinality)
- Probabilistic instance (add ipf)
- Semantics
- Local and Global
- Interpretation
- Satisfaction
- Algebra
- Projections, selection, cross product
91Algebra (Projection)
Equivalent
92Algebra (Projection)
Equivalent
e1 and e2 are a sequence of zero or more
edges. Thus, I.e1.lm can include I.lm, I.l1.lm,
I.l2.l3.lm, etc.
93In general non-equivalent
94Algebra (Cross product)
- Equivalence
- (I1 x I2) x I3
- I1 x (I2 x I3)
- (I1 x I3) x I2
Equivalent
95Related Work
- Bayesian net (Pearl, 1988)
- random variables (probability of events)
- ours existence of children requires existence of
parents