Probabilistic XML database PRM - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Probabilistic XML database PRM

Description:

Uncertain data exist anywhere in the research: Automated data minging and analysis tools are often error-prone ... Data integrated from various sources is often ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 21

Provided by: wsdbE

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic XML database PRM

1
Probabilistic XML database-----PRMBN

Qihong SHAO

2
Motivation

Uncertain data exist anywhere in the research
Automated data minging and analysis tools are
often error-prone
Data integrated from various sources is often
uncertain or conflict
Experiment results are observed in conditions
which maybe subject to a range of variability in
natural environment
Human errors (typos, operation errors, biases, )
! Lack of built-in tools to manage data quality
and uncertain information

3
Challenges

Current research attempts to design probabilistic
databases and query uncertain data, the node is
associated with an optinal confidence
Assumption data tuples are independent
Fail to capture data dependence relationship,
result in incorrect confidence computation in
many applications

4
Outline

What is BN?
What is PRM?
Our project

5
Bayesian Networks

A Bayesian network consists of two components.
a directed acyclic graph G,
nodes attributes A1,A2,..An
edges in the graph denote a direct dependence of
an attribute Ai on its parents Parents(Ai)
a set of conditional independence assumptions
each node Ai is conditionally independent of its
non-descendants given its parents.
CPD(Conditional Probability Distribution)
For each attribute, P(AiParents(Ai)) specifies
the distribution over the values of Ai given any
possible assignment of values to its parents.

6
a) A Bayesian network for the census domain. b)
A tree-structured CPD for the Children node given
its parents Income, Age and Marital-Status.
7
Example

the table contains 12 attributes
Age, Worker-Class, Education,Marital-Status,
Industry, Race, Sex, Child-Support, Earner,
Children, Income, and Employment-Type.
The domain sizes for the attributes are,
respectively 18, 9, 17, 7, 24, 5, 2, 3, 3, 42,
and 4.
Children attribute depends on other attributes
only via the attributes
Income, Age, Marital-Status-gt Children.
Thus, Children is conditionally independent of
all other attributes given Income, Age, and
Marital-Status.

With BN, over Children given
Income?17.5K, Agelt55, and Marital-Statusnever-ma
rried is (0.19, 0.04, 0.07)
Income?17.5K, Agelt50, and Marital-Statusmarried
is (0.26, 0.47, 0.27)
Income?17.5K, Agelt50, and Marital-Status
widowed is (0.26, 0.47, 0.27)

The conditional independence assumptions
associated with the BN , together with the CPDs
associated with the nodes, uniquely determine a
joint probability distribution over the
attributes via the chain rule

10
BNs for Query Estimation

A Bayesian network is a compact representation of
a full joint distribution.
It implicitly contains the answer to any query
about the probability of any assignment of values
to a set of attributes.

11
Problem

Example
A medical database with two tables
Patient, containing tuberculosis (TB) patients,
Contact, containing people with whom a patient
has had contact, and who may or may not be
infected with the disease.
Queries involving a join between these two
tables.
// finding all patients whose age is over 60
who have had contact with a roommate.
patient.Age 60 and contact. Patient
patient.Patient-ID
where contact.Contype roommate,

Attributes of different table are often
correlated.
In general, foreign keys are often used to
connect tuples in different tables that are
semantically related, and hence the attributes of
tuples related through foreign key joins are
often correlated.
For example, there is a clear correlation between
the age of the patient and the type of contacts
they have in fact, elderly patients with
roommates are quite rare, and this naive approach
would overestimate their number.
the probability that two tuples join with each
other can also be correlated with various
attributes.
For example, middle-aged patients typically have
more contacts than older patients.

13
PRM

Probabilistic relational models (PRMs)
extend Bayesian networks to the relational
setting.
allow us to model correlations not only between
attributes of the same tuple, but also between
attributes of related tuples in different tables.
a parent of an attribute R.A, an attribute S.B in
another relation S such that R has a foreign key
for S .
allow dependencies on attributes in relations
that are related to R via a longer chain of joins

14
PRM
15

A PRM for our TB domain is shown below, Here, for
example, we have that the type of the contact
depends on the age and gender of the patient.

16
The key component of PRM

the annotation of a frame with a probability
model
a BN represent a distribution over the possible
values of the slots in the frame
Each simple slot in the frame is annotated with a
local probability model, represent the dependence
of its value on the values of related slots

17
PRM is more expressive than BN

Allow the probability model of a slot to depend
on a slot chain
Allow the properties of one instance depend on
the properties of other related instances
Allow inheritance
Express structure uncertainty,
express number uncertainty about the set of
entities in the model(the number of PHD students
in a department)
represent relevance uncertainty about relations
between entities(which of several conferences a
paper appeared in)

18
Our Project

Modeling uncertain XML data
Modeling uncertainty using probabilisties
Considering XFDs in probabilistic XML data
Defining data dependencies in probabilistic XML
data
Effective query evaluation on a probabilistic XML
database
Define query languages
Compute the probability of a node
Compute the joint probability of the event
Support different query requirement

19
References

Learning Probabilistic Models of Link
Structure, L. Getoor, N. Friedman, D. Koller
and B. Taskar, JMLR 2002.
Probabilistic Models of Text and Link Structure
for Hypertext Classification, L. Getoor, E.
Segal, B. Taskar and D. Koller, IJCAI WS Text
Learning Beyond Classification, 2001.
Selectivity Estimation using Probabilistic
Models, L. Getoor, B. Taskar and D. Koller,
SIGMOD-01.
Learning Probabilistic Relational Models, L.
Getoor, N. Friedman, D. Koller, and A. Pfeffer,
chapter in Relation Data Mining, eds. S. Dzeroski
and N. Lavrac, 2001.
see also N. Friedman, L. Getoor, D. Koller, and
A. Pfeffer, IJCAI-99.
Learning Probabilistic Models of Relational
Structure, L. Getoor, N. Friedman, D. Koller,
and B. Taskar, ICML-01.
From Instances to Classes in Probabilistic
Relational Models, L. Getoor, D. Koller and N.
Friedman, ICML Workshop on Attribute-Value and
Relational Learning Crossing the Boundaries,
2000.
Notes from AAAI Workshop on Learning Statistical
Models from Relational Data, eds. L.Getoor and D.
Jensen, 2000.
Notes from IJCAI Workshop on Learning Statistical
Models from Relational Data, eds. L.Getoor and D.
Jensen, 2003.