Title: PXML: Probabilistic Semistructured Databases
1PXML Probabilistic Semistructured Databases
- Edward Hung, Lise Getoor, V.S. Subrahmanian
- University of Maryland, College Park
2Outline
- Motivating example
- Semistructured data model
- PXML data model
- Semantics
- Interpretation
- Satisfaction
- Algebra
- Related work
- Future work
3Motivating Example
- Surveillance applications monitoring a region of
battlefield - Image processing system identifies vehicles in
convoys appearing in the region in different time - Convoys
- Timestamp
- tanks, trucks, etc
- Uncertainty
- number of vehicles
- Category and identity of a vehicle, e.g., a tank?
T-72?
4Motivating Example
- Doppler speed system detects the speed and
velocity of convoys and infers their possible
destinations - Convoys
- Timestamp
- Possible destinations
- Uncertainty
- Number of places the convoy will go
- The name of the places
5Motivating Example
- Semistructured data model
- General hierarchical structure is known.
- The schema is not fixed
- Number of vehicles
- Properties of vehicles
- Our work store uncertain information in
probabilistic environments.
6Semistructured Data Model
- Instance S(V, lch, t, val)
- lch(o, l) the set of children of o with label l
- G (V, lch) is a rooted, directed, edge-labeled
graph
7Semistructured Data Model
Time 10
8Semistructured Data Model
Time 15
9Semistructured Data Model
10PXML Data Model
- Uncertainty
- Existence of sub-objects
- Number of sub-objects
- Identity of the sub-objects
11PXML Data Model
- Weak instance W (V, lch, t, val, card)
- Cardinality constraint (card(o, l)) gives the
bounds of the number of sub-objects with edge
label l connected to the same parents o.
12PXML Data Model
- Example
- Convoy 2 surely has a timestamp
- card(convoy2, ts) 1, 1
- Convoy 2 may have one to two trucks
- card(convoy2, truck) 1, 2
13PXML Data Model (Cardinality)
Weak Instance W Semistructured Instance card
14PXML Data Model
- Compatible Instances
- A semistructured instance S (VS, lchS, tS,
valS) is compatible with a weak instance W (VW,
lchW, tW, valW) if -
-
- (VS, lchS) is a rooted connected graph.
- If o is a leaf in S, then
- If o is also a leaf in W, tS(o)tW(o) and
valS(o)valW(o), otherwise, the type and value is
defined as unknown. - Otherwise, card(o,l).min lt k lt card(o,l).max
where k is the number of l-labeled children of o,
i.e. lchS(o, l)
15PXML Data Model
16PXML Data Model
- Example
- There are surely 2 convoys.
- card(S, convoy) 2, 2
- Convoy 1 surely has a timestamp, a truck and a
tank. - card(convoy1, ts) 1, 1
- card(convoy1, truck) 1, 1
- card(convoy1, tank) 1, 1
- Convoy 2 surely has a timestamp
- card(convoy2, ts) 1, 1
- Convoy 2 may have one to two trucks
- card(convoy2, truck) 1, 2
17PXML Data Model
- D(W) the set of all semistructured instances
compatible with a weak instance W
18(No Transcript)
19PXML Data Model (Weak Instance)
- Example of a weak instance W
card(S1,convoy)2,2
card(convoy1,ts)1,1
card(convoy1,truck)1,1
card(convoy1,tank)1,1
card(convoy2,ts)1,1
card(convoy2,truck1,2
20PXML Data Model
- Example of an instance compatible with W
card(convoy1,ts)1,1
card(S1,convoy)2,2
card(convoy1,truck)1,1
card(convoy1,tank)1,1
card(convoy2,ts)1,1
card(convoy2,truck)1,2
21- D(W) the set of all semistructured instances
compatible with the weak instance W
22PXML Data Model
- Potential child set
- PC(o), the potential child set of a non-leaf
object o in a weak instance W is - the set of all possible sets of children of o
satisfying the constraint of cardinality
23PXML Data Model
- Example
- Convoy 2s surely has one time stamp which is
surely 15. Convoy 2 may have a truck of type mac
and/or a truck of type rover - card(convoy2, truck) 1, 2
- card(convoy2, ts) 1, 1
- PC(convoy2) ts2, truck3, ts2, truck4,
ts2, truck3, truck4
24Potential child set of convoy2, PC(convoy2)
ts2, truck3, truck4,
ts2, truck3,
ts2, truck4
25PXML Data Model
- Probabilistic instance I (V, lch, t, val, card,
ipf) - Interval probability function (ipf(o, c)) w.r.t.
the set PC(o) associates, with each c in PC(o), a
closed subinterval lb(c), ub(c) 0, 1 -
26PXML Data Model
- Example
- PC(convoy2) ts2, truck3, ts2, truck4,
ts2, truck3, truck4 - ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
27Probabilistic Instance I Weak Instance W ipf
ipf(convoy2, ts2, truck3 , truck4)0.2, 0.3
ipf(convoy2, ts2, truck3)0.3, 0.5
ipf(convoy2, ts2, truck4)0.2, 0.4
28PXML Data Model
- Here the ipf assigns the probability interval to
each possible set of children. - More independence assumptions are possible to
make the representation more compact - e.g. independence between trucks and tanks.
- e.g. all trucks are all indistinguishable.
29Semantics (Global Interpretation)
- Interpretation
- Global interpretation, P
- a mapping from D(W) (the set of semistructured
instances compatible with W) to 0,1 s.t.
30S1a
S1b
S1c
P(S1a) 0.12
P(S1b) 0.08
P(S1c) 0.2
S1d
S1e
S1f
P(S1d) 0.18
P(S1e) 0.12
P(S1f) 0.3
31Semantics (Local Interpretation)
- An object probability function (OPF)for an object
o w.r.t. a weak instance W is a mapping w PC(o)
? 0, 1 s.t.
32Semantics
- Example
- ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
- wconvoy2(ts2, truck3) 0.2
- wconvoy2(ts2, truck4) 0.5
- wconvoy2(ts2, truck3, truck4) 0.3
33Semantics (Local Interpretation)
- Previously, probabilities are assigned to each
compatible instance globally. - Now we are going to assign probabilities of the
actual children of each non-leaf object in a
local manner.
34Object probability function (OPF) for convoy2
w.r.t. W is a mapping w PC(convoy2) ? 0,1 s.t.
wconvoy2(ts2, truck3 , truck4) 0.2
wconvoy2(ts2, truck3) 0.5
wconvoy2(ts2, truck4) 0.3
35Semantics (Local Interpretation)
- Interpretation
- Local interpretation, p
- a mapping from the set of non-leaf objects to
OPFs - Example
- p(convoy2) wconvoy2
36Semantics (Local ? Global)
- Assume that the probability of any potential
child of an object o is independent of
non-descendants of o. - W operator
- W operator returns the probabilities assigned to
every semistructured instance compatible with a
given weak instance, which is consistent with a
given local interpretation. - Given a semistructured instance S compatible with
a weak instance W and a local interpretation p
for W - W(p)(S)Õo S p(o)(CS(o))
- Theorem
- W(p) is a global interpretation for W
37Semantics
- Example
- ipf(S1, convoy1, convoy2)1, 1
- wS1(ts1, truck1, tank1) 1
- ipf(convoy1, ts1, truck1, tank1)0.2, 0.6
- ipf(convoy1, ts1, truck1, tank2)0.4, 0.8
- wconvoy1(ts1, truck1, tank1) 0.4
- wconvoy1(ts1, truck1, tank2) 0.6
- ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
- wconvoy2(ts2, truck3) 0.2
- wconvoy2(ts2, truck4) 0.5
- wconvoy2(ts2, truck3, truck4) 0.3
38Semantics
- Example
- W(S1a)
- p(S1)(convoy1, convoy2) x p(convoy1)(ts1,
truck1, tank1) x p(convoy2)(ts2, truck3,
truck4) - wS1(ts1, convoy1, convoy2) x wconvoy1(ts1,
truck1, tank1) x wconvoy2(ts2, truck3, truck4) - 1 x 0.4 x 0.3
- 0.12
39Semantics
wS1(convoy1, convoy2)1
wconvoy1(ts1, truck1, tank1) 0.4
wconvoy2(ts2, truck3, truck4)0.3
p(S1)(convoy1, convoy2) x p(convoy1)(ts1,
truck1, tank1) x p(convoy2)(ts2, truck3,
truck4)
wS1(ts1, convoy1, convoy2) x wconvoy1(ts1,
truck1, tank1) x wconvoy2(ts2, truck3, truck4)
1 x 0.4 x 0.3 0.12
40Semantics
- Example
- Similarly, we can get
- W(S1a) 0.12
- W(S1b) 0.08
- W(S1c) 0.2
- W(S1d) 0.18
- W(S1e) 0.12
- W(S1f) 0.3
41Semantics (Global ? Local)
- (Same assumption) The probability of any
potential child of an object o is independent of
non-descendants of o. - Given a global interpretation P for a weak
instance W - P satisfies W iff P(co, ndes(o)) P(co)
- ndes(o) is the set of non-descendants of o.
42Semantics (Global ? Local)
- D operator
- D operator returns the probabilities assigned to
each possible set of children of every non-leaf
object, which is consistent with a given global
interpretation. - Given a global interpretation P that satisfies a
weak instance W, for any non-leaf object o, any c
in PC(o) -
- D(P) returns a function defined as follows for
any non-leaf object o, D(P)(o)wP,o
43Semantics (Global ? Local)
- Theorem
- D(P) is a local interpretation for W
- Example
- Derive D(P)(convoy2)
44S1a
S1b
S1c
P(S1a) 0.12
P(S1b) 0.08
P(S1c) 0.2
S1d
S1e
S1f
P(S1d) 0.18
P(S1e) 0.12
P(S1f) 0.3
D(P)(convoy2) wP, convoy2
- wP, convoy2(ts2, truck3, truck4)
(0.120.18)/10.3
45D(P)(convoy2) wP, convoy2
- wP, convoy2(ts2, truck3, truck4)
(0.120.18)/10.3
- wP, convoy2(ts2, truck3) (0.080.12)/1 0.2
- wP, convoy2(ts2, truck4) (0.20.3)/1 0.5
46Semantics
- Example
- Derive D(P)(convoy2) wP, convoy2
- wP, convoy2(ts2, truck3, truck4)
(0.120.18)/10.3 - wP, convoy2(ts2, truck3) (0.080.12)/1 0.2
- wP, convoy2(ts2, truck4) (0.20.3)/1 0.5
47Semantics (Local ?? Global)
- Theorems
- Suppose p is a local interpretation for a weak
instance W, then D(W(p))p. - Suppose P is a global interpretation that
satisfies a weak instance W, then W(D(P))P.
48Semantics (Satisfaction)
- Given a probabilistic instance I, a non-leaf
object o, - OC(o), the object constraints are
- p(c) is a real-valued variable denoting the
probability that c is the actual set of children
of o.
49Semantics (Satisfaction)
- Example
- ipf(convoy2, ts2, truck3)0.2, 0.3
- ipf(convoy2, ts2, truck4)0.3, 0.5
- ipf(convoy2, ts2, truck3, truck4)0.2, 0.4
- OC(convoy2)
50Semantics (Local Satisfaction)
- An OPF w satisfies a non-leaf object o iff w is a
probability distribution w.r.t. PC(o) over ipf. - A local interpretation p satisfies a non-leaf
object o iff p(o) satisfies o. - A local interpretation p satisfies a
probabilistic instance I iff p satisfies Is
every non-leaf object.
51Semantics (Global Satisfaction)
- A global interpretation P satisfies a
probabilistic instance I iff D(P) satisfies I. - Corollary
- A local interpretation p satisfies a
probabilistic instance I iff W(p) satisfies I.
52Semantics (Consistency)
- A probabilistic instance is locally consistent
iff there is a local interpretation that
satisfies it. - A probabilistic instance is globally consistent
iff there is a global interpretation that
satisfies it. - Theorem
- Every probabilistic instance is locally and
globally consistent.
53Algebra
- Operators
- Projection
- Selection
- Cross-product
- Path expression
- o.l1.l2ln
S1.convoy.truck
54Algebra (Projection)
- Ancestor projection
- Descendant projection
- Single projection
55Algebra (Projection)
Semistructured Instance
56Weak Instance
57Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,0.3 ipf(convo
y1, ts1,truck1,tank2)0.1,0.4 ipf(convoy1,
ts1,truck2,tank1)0.3,0.5 ipf(convoy1,
ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
Children of convoy1 before CI2(convoy1)ts1,
truck1, truck2, tank1, tank2
Children of convoy1 after CI2(convoy1)truck1,
truck2
Let Cd CI2(convoy1) CI2(convoy1)ts1,
tank1, tank2
PC(convoy1)truck1,truck2
58Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,0.3 ipf(convo
y1, ts1,truck1,tank2)0.1,0.4 ipf(convoy1,
ts1,truck2,tank1)0.3,0.5 ipf(convoy1,
ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
For each c in PC(convoy1),
ipf(convoy1, c)a, min(1,b)
ipf(convoy1) ? tight(ipf(convoy1))
Dekhtyar, Goldsmith (2002)
59Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,
0.3 ipf(convoy1, ts1,truck1,tank2)0.1,0.4 ip
f(convoy1, ts1,truck2,tank1)0.3,0.5 ipf(convo
y1, ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
For truck1,
a 0.00.1 0.1
b 0.30.4 0.7
ipf(convoy1, truck1) 0.1, min(1, 0.7)
0.1, 0.7
60Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,
0.3 ipf(convoy1, ts1,truck1,tank2)0.1,0.4 ip
f(convoy1, ts1,truck2,tank1)0.3,0.5 ipf(convo
y1, ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
For truck2,
a 0.30.3 0.6
b 0.50.6 1.1
ipf(convoy1, truck2) 0.6, min(1, 1.1)
0.6, 1
61Probabilistic Instance
card(convoy1,ts)1,1
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
card(convoy1,tank)1,1
ipf(convoy1, ts1,truck1,tank1)0,
0.3 ipf(convoy1, ts1,truck1,tank2)0.1,0.4 ip
f(convoy1, ts1,truck2,tank1)0.3,0.5 ipf(convo
y1, ts1,truck2,tank2)0.3,0.6
PC(convoy1)
card(I2,convoy)1,1
card(convoy1,truck)1,1
ipf(I2, convoy1)1
ipf(convoy1) ? tight(ipf(convoy1))
tight
ipf(convoy1, truck1)0.1, 0.7 ipf(convoy1,
truck2)0.6, 1
ipf(convoy1, truck1)0.1, 0.4 ipf(convoy1,
truck2)0.6, 0.9
62HIDE IT
card(convoy1,ts)1,1
card(convoy1,truck)1,1
card(convoy1,tank)1,1
wconvoy1(ts1,truck1,tank1)0.4 wconvoy1(ts1,tru
ck1,tank2)0.6
card(S1,convoy)2,2
wS1(convoy1,convoy2)1
card(convoy2,ts)1,1
card(convoy2,truck1,2
wconvoy2(ts2,truck3)0.2 wconvoy2(ts2,truck4)
0.5 wconvoy2(ts2,truck3,truck4)0.3
63Algebra (Projection)
- Descendant projection ( )
card(I3, truck)0,3 ipf(I3,c)0,1
One naive strategy
Our better strategy similar to the one in cross
product
64Algebra (Projection)
(null)
card(I3, truck)0,3 ipf(I3,c)0,1
65Algebra (Projection)
Equivalent
66Algebra (Projection)
Equivalent
67Algebra (Projection)
Equivalent
e1 and e2 are a sequence of zero or more
edges. Thus, I.e1.lm can include I.lm, I.l1.lm,
I.l2.l3.lm, etc.
68In general non-equivalent
69Algebra (Selection) ( )
- Similar to ancestor projection
- Path expression specifies leaf objects with a
specified value.
70Algebra (Selection)
Semistructured Instance
I1
71Algebra (Selection) ( )
card(I7, convoy)1,2, wI7(convoy1)0.2,
wI7(convoy2)0.5, wI7(convoy1,convoy2)0.3
card(convoy1, tank)1,1 wconvoy1(tank1)0.3,
wconvoy1(tank2)0.7
card(convoy2, tank)1,1 wconvoy2(tank2)0.4,
wconvoy2(tank3)0.6
0.14 0.3 0.054 0.036 0.084 0.614
D(I7) ?
0.054
/ 0.614
0.06
0.126
0.14
/ 0.614
0.036
0.3
/ 0.614
/ 0.614
0.2
0.084
/ 0.614
72Algebra (Selection) ( )
card(I7, convoy)1,2, ipf(I7,convoy1)0.1,0.
3, ipf(I7,convoy2)0.4,0.6,
ipf(I7,convoy1,convoy2)0.2,0.4
card(convoy1, tank)1,1 ipf(convoy1,tank1)0.
2,0.4, ipf(convoy1,tank2)0.6,0.8
card(convoy2, tank)1,1 ipf(convoy2,tank2)0.
3,0.5, ipf(convoy2,tank3)0.5,0.7
D(I7) ?
0.012,0.08
Conditionalization of interval probabilities
0.02,0.12
0.02,0.112
0.06,0.24
0.036,0.16
Dekhtyar, Goldsmith (2002)
0.08,0.24
0.24,0.48
0.06,0.224
73Algebra (Cross product (x))
- Probabilistic conjunction strategies
- Example
- Ignorance
- Positive correlation
- Negative Correlation
- Independence
74Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
75Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
76Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
77Algebra (Cross product (x))
card(I4, truck)1,1 ipf(I4, truck1)0.2,0.7
ipf(I4, truck2)0.3,0.8
card(I5, tank)1,1 ipf(I5, tank1)0.1,0.6 ip
f(I5, tank2)0.4,0.9
card(I6, truck)1,1 card(I6, tank)1,1
I4 x I5
78Algebra (Cross product)
- Equivalence
- (I1 x I2) x I3
- I1 x (I2 x I3)
- (I1 x I3) x I2
Equivalent
79Related Work
- Semistructured Probabilistic Objects (SPOs)
(Dekhtyar, Goldsmith, Hawkes, 2001) - SPO express probabilistic information in a
semistructured manner - PXML data model stores XML data AND probabilistic
information.
80Related Work
- Algebras TAX, SAL
- TAX (Jagadish, Lakshmanan, Srivastava, 2001)
- use pattern tree to extract subsets of nodes, one
for each embedding of pattern tree. - fixed number of children
- SAL (Beeri, Tzaban, 1999)
- bind objects to variables
- original structure is totally lost
81Future Work
- Implement the system
- Query optimization
82Summary
- PXML data model
- Semistructured instance
- Weak instance (add cardinality)
- Probabilistic instance (add ipf)
- Semantics
- Local and Global
- Interpretation
- Satisfaction
- Algebra
- Projections, selection, cross product
83Algebra (Projection)
Equivalent
84Algebra (Projection)
Equivalent
e1 and e2 are a sequence of zero or more
edges. Thus, I.e1.lm can include I.lm, I.l1.lm,
I.l2.l3.lm, etc.
85In general non-equivalent
86Algebra (Cross product)
- Equivalence
- (I1 x I2) x I3
- I1 x (I2 x I3)
- (I1 x I3) x I2
Equivalent
87Related Work
- Bayesian net (Pearl, 1988)
- random variables (probability of events)
- ours existence of children requires existence of
parents