Title: ProbabilisticUncertain Data Management
1Probabilistic/Uncertain Data Management
- Dalvi, Suciu. Efficient query evaluation on
probabilistic databases, VLDB2004. - Das Sarma et al. Working models for uncertain
data, ICDE2006.
- Slides based on the Suciu/Dalvi SIGMOD05 tutorial
2What is a Probabilistic Database ?
- An item belongs to the database is a
probabilistic event - Tuple-existence uncertainty
- Attribute-value uncertainty
- A tuple is an answer to the query is a
probabilistic event - Can be extended to all data models we discuss
only probabilistic relational data
3Possible Worlds Semantics
Attribute domains
int, char(30), varchar(55), datetime
values 232, 2120, 2440, 264
Relational schema
Employee(namevarchar(55), dobdatetime,
salaryint)
of tuples 2440 264 223 of
instances 22440 264 223
Database schema
Employee(. . .), Projects( . . . ), Groups( . .
.), WorksFor( . . .)
of instances N ( BIG but finite)
4The Definition
The set of all possible database instances
INST I1, I2, I3, . . ., IN
will use Pr or Ip interchangeably
Definition A possible world is I s.t. Pr(I) gt 0
5Query Semantics
Given a query Q and a probabilistic database
Ip,what is the meaning of Q(Ip) ?
6Query Semantics
Semantics 1 Possible Answers A probability
distribution on sets of tuples
8 A. Pr(Q A) åI 2 INST. Q(I) A Pr(I)
Semantics 2 Possible Tuples A probability
function on tuples
8 t. Pr(t 2 Q) åI 2 INST. t2 Q(I) Pr(I)
7Example Query Semantics
Purchasep
SELECT DISTINCT x.product FROM Purchasep x,
Purchasep y WHERE x.name 'John' and
x.product y.product and y.name 'Sue'
Pr(I1) 1/3
Possible answers semantics
Pr(I2) 1/12
Pr(I3) 1/2
Possible tuples semantics
Pr(I4) 1/12
8Possible Worlds Query Semantics
- Possible answers semantics
- Precise
- Can be used to compose queries
- Difficult user interface
- Possible tuples semantics
- Less precise, but simple sufficient for most
apps - Cannot be used to compose queries
- Simple user interface
9Possible Worlds Semantics Summary
- Complete model Clean formal semantics for SQL
queries - Not very useful as a representation or
implementation tool - HUGE number of possible worlds!
- Need more effective representation formalisms
- Something that users can understand/explore
- Allow more efficient query execution
- Avoid possible worlds explosion
- Perhaps giving up completeness
10Representation Formalisms
- ProblemNeed a good representation formalism
- Will be interpreted as possible worlds
- Several formalisms exists, but no winner
Main open problem in probabilistic db
11Evaluation of Formalisms
- Completeness?
- What possible worlds can it represent?
- What probability distributions on worlds?
- Closure?
- Is it closed under evaluation of query operators?
12Outline
- A complete formalism
- Intensional Databases
- Incomplete formalisms
- Various expressibility/complexity tradeoffs
- Focus on Explicit Independent Tuples
13Intensional Database
FuhrRoelleke1997
Atomic event ids
e1, e2, e3,
Probabilities
p1, p2, p3, 2 0,1
Event expressions Æ, Ç,
e3 Æ (e5 Ç e2)
Intensional probabilistic database J each
tuple t has an event attribute t.E
14Intensional DB ) Possible Worlds
J
Ip
15Possible Worlds ) Intensional DB
p1
p2
J
Ip
p3
p4
Intensional DBs are complete
16Closure Under Operators
FuhrRoelleke1997
P
-
s
One still needs to compute probability of event
expression
17Summary on Intensional Databases
- Event expression for each tuple
- Possible worlds any subset
- Probability distribution any
- Complete but impractical
- Evaluate the probability of long event
expressions - Important abstraction consider restrictions
- Related to c-tables
ImilelinskiLipski1984
18Restricted Formalisms
- Explicit tuples
- Have a tuple template for every tuple that may
appear in a possible world - Focus on the case of independent tuple events
19Explicit Independent Tuples
tuple independent event
Atomic, distinct. May use TIDs.
Can be easily extended to capture attribute-value
uncertainty
20Explicit Independent Tuples
Tuple independent probabilistic database
Pr(I) Õt 2 I pr(t) Õt Ï I (1-pr(t))
21Tuple Prob. ) Possible Worlds
E size(Ip) 2.3 tuples
å 1
J
Ip
22Tuple-Independent DBs are Incomplete
p1
- Very limited cannot capture correlations across
tuples - Not Closed
- Query operators can introduce complex
correlations!
p1p2
Ip
1-p1 - p1p2
23Tuple Prob. ) Query Evaluation
SELECT DISTINCT x.city FROM Person x, Purchase
y WHERE x.Name y.Customer and
y.Product Gadget
1-(1-q2)(1-q3)
p1( )
1- (1- ) (1 -
)
p2( )
1-(1-q5)(1-q6)
p3 q7
24Application 1 Similarity Predicates
Step 1evaluate predicates
SELECT DISTINCT x.city FROM Person x, Purchase
y WHERE x.Name y.Cust and y.Product
Gadget and x.profession scientist
and y.category music
25Application 1 Similarity Predicates
Step 1evaluate predicates
SELECT DISTINCT x.city FROM Personp x, Purchasep
y WHERE x.Name y.Cust and y.Product
Gadget and x.profession scientist
and y.category music
Step 2evaluate restof query
26Summary on Explicit Independent Tuples
- Independent tuples
- Possible worlds subsets
- Probability distribution restricted
- Closure no