Probabilistic XML database PRM - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Probabilistic XML database PRM

Description:

Uncertain data exist anywhere in the research: Automated data minging and analysis tools are often error-prone ... Data integrated from various sources is often ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 21
Provided by: wsdbE
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic XML database PRM


1
Probabilistic XML database-----PRMBN
  • Qihong SHAO

2
Motivation
  • Uncertain data exist anywhere in the research
  • Automated data minging and analysis tools are
    often error-prone
  • Data integrated from various sources is often
    uncertain or conflict
  • Experiment results are observed in conditions
    which maybe subject to a range of variability in
    natural environment
  • Human errors (typos, operation errors, biases, )
  • ! Lack of built-in tools to manage data quality
    and uncertain information

3
Challenges
  • Current research attempts to design probabilistic
    databases and query uncertain data, the node is
    associated with an optinal confidence
  • Assumption data tuples are independent
  • Fail to capture data dependence relationship,
    result in incorrect confidence computation in
    many applications

4
Outline
  • What is BN?
  • What is PRM?
  • Our project

5
Bayesian Networks
  • A Bayesian network consists of two components.
  • a directed acyclic graph G,
  • nodes attributes A1,A2,..An
  • edges in the graph denote a direct dependence of
    an attribute Ai on its parents Parents(Ai)
  • a set of conditional independence assumptions
    each node Ai is conditionally independent of its
    non-descendants given its parents.
  • CPD(Conditional Probability Distribution)
  • For each attribute, P(AiParents(Ai)) specifies
    the distribution over the values of Ai given any
    possible assignment of values to its parents.

6
a) A Bayesian network for the census domain. b)
A tree-structured CPD for the Children node given
its parents Income, Age and Marital-Status.
7
Example
  • the table contains 12 attributes
  • Age, Worker-Class, Education,Marital-Status,
    Industry, Race, Sex, Child-Support, Earner,
    Children, Income, and Employment-Type.
  • The domain sizes for the attributes are,
    respectively 18, 9, 17, 7, 24, 5, 2, 3, 3, 42,
    and 4.
  • Children attribute depends on other attributes
    only via the attributes
  • Income, Age, Marital-Status-gt Children.
  • Thus, Children is conditionally independent of
    all other attributes given Income, Age, and
    Marital-Status.

8
  • With BN, over Children given
  • Income?17.5K, Agelt55, and Marital-Statusnever-ma
    rried is (0.19, 0.04, 0.07)
  • Income?17.5K, Agelt50, and Marital-Statusmarried
    is (0.26, 0.47, 0.27)
  • Income?17.5K, Agelt50, and Marital-Status
    widowed is (0.26, 0.47, 0.27)

9
  • The conditional independence assumptions
    associated with the BN , together with the CPDs
    associated with the nodes, uniquely determine a
    joint probability distribution over the
    attributes via the chain rule

10
BNs for Query Estimation
  • A Bayesian network is a compact representation of
    a full joint distribution.
  • It implicitly contains the answer to any query
    about the probability of any assignment of values
    to a set of attributes.

11
Problem
  • Example
  • A medical database with two tables
  • Patient, containing tuberculosis (TB) patients,
  • Contact, containing people with whom a patient
    has had contact, and who may or may not be
    infected with the disease.
  • Queries involving a join between these two
    tables.
  • // finding all patients whose age is over 60
    who have had contact with a roommate.
  • patient.Age 60 and contact. Patient
    patient.Patient-ID
  • where contact.Contype roommate,

12
  • Attributes of different table are often
    correlated.
  • In general, foreign keys are often used to
    connect tuples in different tables that are
    semantically related, and hence the attributes of
    tuples related through foreign key joins are
    often correlated.
  • For example, there is a clear correlation between
    the age of the patient and the type of contacts
    they have in fact, elderly patients with
    roommates are quite rare, and this naive approach
    would overestimate their number.
  • the probability that two tuples join with each
    other can also be correlated with various
    attributes.
  • For example, middle-aged patients typically have
    more contacts than older patients.

13
PRM
  • Probabilistic relational models (PRMs)
  • extend Bayesian networks to the relational
    setting.
  • allow us to model correlations not only between
    attributes of the same tuple, but also between
    attributes of related tuples in different tables.
  • a parent of an attribute R.A, an attribute S.B in
    another relation S such that R has a foreign key
    for S .
  • allow dependencies on attributes in relations
    that are related to R via a longer chain of joins

14
PRM
15
  • A PRM for our TB domain is shown below, Here, for
    example, we have that the type of the contact
    depends on the age and gender of the patient.

16
The key component of PRM
  • the annotation of a frame with a probability
    model
  • a BN represent a distribution over the possible
    values of the slots in the frame
  • Each simple slot in the frame is annotated with a
    local probability model, represent the dependence
    of its value on the values of related slots

17
PRM is more expressive than BN
  • Allow the probability model of a slot to depend
    on a slot chain
  • Allow the properties of one instance depend on
    the properties of other related instances
  • Allow inheritance
  • Express structure uncertainty,
  • express number uncertainty about the set of
    entities in the model(the number of PHD students
    in a department)
  • represent relevance uncertainty about relations
    between entities(which of several conferences a
    paper appeared in)

18
Our Project
  • Modeling uncertain XML data
  • Modeling uncertainty using probabilisties
  • Considering XFDs in probabilistic XML data
  • Defining data dependencies in probabilistic XML
    data
  • Effective query evaluation on a probabilistic XML
    database
  • Define query languages
  • Compute the probability of a node
  • Compute the joint probability of the event
  • Support different query requirement

19
References
  • Learning Probabilistic Models of Link
    Structure, L. Getoor, N. Friedman, D. Koller
    and B. Taskar, JMLR 2002.
  • Probabilistic Models of Text and Link Structure
    for Hypertext Classification, L. Getoor, E.
    Segal, B. Taskar and D. Koller, IJCAI WS Text
    Learning Beyond Classification, 2001.
  • Selectivity Estimation using Probabilistic
    Models, L. Getoor, B. Taskar and D. Koller,
    SIGMOD-01.
  • Learning Probabilistic Relational Models, L.
    Getoor, N. Friedman, D. Koller, and A. Pfeffer,
    chapter in Relation Data Mining, eds. S. Dzeroski
    and N. Lavrac, 2001.
  • see also N. Friedman, L. Getoor, D. Koller, and
    A. Pfeffer, IJCAI-99.
  • Learning Probabilistic Models of Relational
    Structure, L. Getoor, N. Friedman, D. Koller,
    and B. Taskar, ICML-01.
  • From Instances to Classes in Probabilistic
    Relational Models, L. Getoor, D. Koller and N.
    Friedman, ICML Workshop on Attribute-Value and
    Relational Learning Crossing the Boundaries,
    2000.
  • Notes from AAAI Workshop on Learning Statistical
    Models from Relational Data, eds. L.Getoor and D.
    Jensen, 2000.
  • Notes from IJCAI Workshop on Learning Statistical
    Models from Relational Data, eds. L.Getoor and D.
    Jensen, 2003.

20
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com