Title: Probabilistic RDF
1Probabilistic RDF
- Octavian Udrea1
- V.S. Subrahmanian1
- Zoran Majkic2
- 1University of Maryland College Park
- 2University La Sapienza, Rome, Italy
2Motivation
- Not all information on the Web is easily
expressible in classic models (i.e.,
relational) - RDF extraction from text
- STORY is the first, very successful prototype
- Need to extend RDF with temporal, uncertainty
components - Goal build a logical model of RDF with
uncertainty and provide query algorithms
3The Probabilistic RDF idea
- An RDF theory is a set of triples (subject,
property, value) - (USA hasCapital Washington DC),
- (Washington DC hasPopulation 500,000)
- Probabilistic RDF extends this model with
uncertainty over the set of values. - (USA hasCapital (Washington DC, 0.95), (State of
Washington, 0.05))
4Probabilistic RDF example
Extracted based on www.wrongdiagnosis .com
5Probabilistic RDF example
6Probabilistic RDF example
7Probabilistic RDF example
8Probabilistic RDF syntax
- Schema uncertainty
- (c subClassOf (C,d))
- Sd?C d(d) lt 1
- Class-instance uncertainty
- (x rdftype (C,d))
- Sd?C d(d) lt 1
- Instance-based uncertainty
- (x p (Y, d))
- Sy?Y d(y) lt 1
9Probabilistic RDF syntax
- Sanity requirements
- (c subClassOf (C1,d1)), ((c subClassOf (C2,d2))
gt (C1 C2 and d1 d2) or C1 n C2 Ø - Same applies for other types of uncertainty
- Transitive properties
- Simple inferential capability
- Examples associatedWith, controlledBy
- P-path
- A set of triples connected by transitive
properties
10Example p-path
11P-path semantics and t-norms
- We cannot generally assume independence between
triples on a transitive path - Flu, AcuteBronchitis, Pneumonia
- T-norms are used to express the users knowledge
of the relationship between triples - ? is associative, commutative
- 0 ? x 0, 1 ? x x
- x lt y, z lt w gt x ? z lt y ? w
- P-Path probability t-norm applied to individual
probabilities on the path
12Example p-path
(Flu, associatedWith, (Pneumonia, 0.455)) w.r.t.
the product t-norm
13pRDF semantics
- A world W is a set of simple triples (with no
probabilities) - An interpretation I associates a probability to
each world - I satisfies a pRDF theory
- For each (s, p, (V,d)), d(v) lt S I(W), where W
contains (s,p,v) - Same applies to paths w.r.t. to a given t-norm
14pRDF semantics
- A theory is consistent iff it has a satisfying
interpretation - Every pRDF theory is consistent
- Entailment T entails T iff every satisfying
interpretation of T satisfies T - Closure of a theory The entire set of triples
entailed by the theory - Maximal w.r.t. the probability values
15pRDF fixpoint semantics
- The closure operator ? adds exactly one entailed
triple at each step - (Flu associatedWith, (Acute Bronchitis, .7)) and
- (Acute Bronchitis associatedWith (Pneumonia,
.65)) yields - (Flu associatedWith, (Pneumonia, 0.455))
- w.r.t. the product t-norm
- ? has a fixpoint which is the theory closure.
16pRDF query processing
- We will consider only simple queries a triple
with a variable term - Example (? associatedWith Pneumonia 4)
- What is associated with Pneumonia with
probability above .4? - Simple method
- Compute the closure
- Select any triple in the closure that matches the
query - VERY expensive computationally
17pRDF query processing
- Set of algorithms for answering simple queries
and conjunctions - pRDF_Subject, pRDF_Property, , pRDF_conjunction
- Central idea
- Apply ? in only those directions that yield
tuples relevant to the query - Cut off path computations when the threshold can
no longer be reached. - min?(current_probability, threshold)
18Experimental results
- Implementation
- Java, 1700 LOC
- Disk-based storage for pRDF theories
- Synthetically generated datasets
- According to varying underlying distributions
- Datasets extracted from Web sources
19Experimental questions
- Does the underlying distribution affect query
running time? - From a practical point of view, which are the
fastest types of queries? - How does running time vary with the number of
atoms in a conjunction? - What other theory-dependent factors affect
running time? - Theory width
- Number of properties
20Query running time (Poisson)
21Query running time (zipf)
22Conjunctive queries running time
23Dependence on property width
24Number of properties
25Take away points
- RDF syntax with uncertainty
- Model-theory and fixpoint semantics for pRDF
- Efficient query algorithms for pRDF
26The end
- http//om.umiacs.umd.edu/
- Thank you!
- Questions comments