Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases

About This Presentation

Title:

Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases

Description:

Jay Aronson, Liming Cai, ... their tickets less than 24hrs before departure time or with cash, ... AND (?x, age, ?z) FILTER (?z 15) Generalized Graph Patterns ... – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 70

Provided by: Kema8

Category:

more less

Transcript and Presenter's Notes

Title: Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases

1
Supporting Link Analysis Using Advanced Querying
Methods in Semantic Web Databases
Kemafor AnyanwuAdvisory Committee Amit Sheth
(Major) Jay Aronson, Liming Cai, John Miller
2
Outline

Link Analysis
Link Analysis in Semantic Web Databases
Query Language Support
Query Evaluation Support
Result Ranking

3
Link Analysis

From Wikipedia
a subset of network analysis, exploring
associations between objects
provides the crucial relationships and
associations between very many objects of
different types that are not apparent from
isolated pieces of information
Currently has no database support!!!

4
. Everything's connected, all along the line.
Cause and effect. That's the beauty of it. Our
job is to trace the connections and reveal them.

Jack in Terry Gilliams 1985 film - Brazil

5
An Example

Find relationships between passengers on flights
to New York or Washington DC, who either
purchased their tickets less than 24hrs before
departure time or with cash, and are linked to
flight training

6
Graph Data Models

Contemporary data models
Stanfords Object Exchange Model (OEM)
W3Cs eXtensible Markup Language (XML)
W3Cs Resource Description Framework (RDF)
Semantic Web
Many others more domain-specific
Link analysis queries can be seen as querying
about structure
Examples finds a subgraph connecting certain
passengers!
Steiner Tree Problem NP-Hard

7
Link Analysis on the Semantic Web
8
Semantic Web languages - RDF

In RDF
Each resource has an IRI
Triples (subject, property, object) are
statements linking resources
(Xkemafor, YhasEmail, anyanwu_at_cs.uga.edu)
X and Y are aliases for namespaces e.g.
http//somelong.url/
A triple has an equivalent directed labeled graph
RDF Schema provides a metaschema for describing
classes and relation types

9
An Example RDF Database
Semantic Web
course_title
Pu1
offers
C1
U1
author_of
enrolled_in
enrolled_in
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
course_title
C3
Databases
10
Subgraph Matching contd.

Find universities that offer a course on the
Semantic Web and the students enrolled in those
courses

11
Find universities that offer a course on the
Semantic Web and the students enrolled in those
courses
Semantic Web
course_title
X
Pu1
offers
C1
U1
author_of
enrolled_in
enrolled_in
Z
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
course_title
C3
Databases
12
Find links between S1A1 and P1
Semantic Web
course_title
Pu1
offers
??P path variable
C1
U1
author_of
enrolled_in
enrolled_in
??P1
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
??P2
course_title
C3
Databases
13
(No Transcript)
14
Ranking

Finding the most important query results
What is most important may vary from context to
context
For link analysis queries most weighted paths
Need flexible ranking models that allow users
bump into new information

15
Ranking contd.

Has been addressed for the
Web, Semantic Web, Databases
Three categories
Fixed Models
Seeded Models
Variable Models
Typically for databases are query results have
known set of attributes that can be varied
Query results have an unknown structure !!

16
Problem Statement and Contributions

Problem Statement
Develop database infrastructure for Path
Extraction Queries
Query language Support
Query Evaluation Support
Query Result Management Support
Contributions
Query Language Support (Anyanwu et al WWW2007,
WWW2003, SIGREC2002, ISWWS2001)
Query Evaluation Infrastructure (Anyanwu et al
WWW2007)
SemRank Ranking query results (Anyanwu et al
WWW2005, IJWS08 )

17
Query Language Support
18
Language Design Goals

An integrated query language for expressing both
subgraph matching and extraction
Build on existing languages and standard - SPARQL
Have an expressive language with simple
constructs.

19
Language Requirements

Path Variables
Path Constraint Expressions

20
Constraints in path extraction queries

paths must contain a specific node type
from an MTB surface molecule to via a
Phoshoinositide 3-Kinase enzyme to a cellular
response event.
paths lengths are bounded
close connections (less than 4 hops) between
SalesPersonA and CIO-Y.
paths must contain a specific pattern
paths from organizationO to billB that involves a
sponsorship relationship to congressmanC
associated to a positive vote on the bill

21
Algebraic syntax for SPARQ2L

Let T RDF terms i.e.
I - IRIs,
L - literals
B - blank nodes
A term variable ?x ranges over T
A triple pattern is a triple with a term variable
(?x, email, ?y)
Graph pattern - set of triple patterns combined
using AND, UNION, FILTER, OPTIONAL
(?x, email, ?y) AND (?x, age, ?z) FILTER (?z gt
15)

22
Generalized Graph Patterns

A path variable ??p ranges over 2T
A generalized triple pattern (gtp) - triple
pattern that permits a path variable in the
property position
(x, ??p, y)
Generalized graph patterns multiple gtp
patterns combined similarly PATHFILTER operator

23
Generalized Graph Patterns

LetVP is the set of path variables
PATHFILTER based on built-in path functions
containsAny (VP, 2I) ? boolean
containsAll (VP, 2I) ? boolean
isSimple VP ? boolean
containsPattern (VP, R(T)) ? Boolean
cost VP ? R
(X, ??p, Y) PATHFILTER (containsAny(??p, P3K) )

24
SPARQ2Ls containsPattern()

containsPattern (VP, R(T)) ? Boolean
Allows regular expressions over triples
Extended Triple Patterns
(s , , . , p, , o) -- matches a
path where
s is the subject of the first triple,
the property/edge in the last triple is p,
the object of last triple is o
and ,
matches arbitrary intermediate nodes/edges on
the path.
If T is the set of all triple patterns, R(T) is
the set of regular expressions over T

25
SPARQ2Ls regular expressions

Let ?o, ?c and ?b be term variables ranging over
the sets of organizations, congressmen and bills
respectively.
Find a path from an organization to a bill such
that the organization has sponsored something
linked to the official and that official is
linked to a positive vote for the bill.
(?o, ., . , sponsors, . , .) ? ( . , .,
. , ., . , ?c) ? ( ?c, ., . , votesFor,
. , ?b)

26
SPARQ2Ls formal semantics

Important !!!
Provides a framework for precisely describing
queries
Helps identify valid optimizations
Details for SPARQ2L not given here

27
Complexity

isSimple()
NP-Hard
ContainsAny()
DISJOINTPATH NP-Hard
ContainsALL()
HAMPATH NP-Hard
ContainsPattern()
Generally NP-Hard

28
Comparison to other languages
29
Query Evaluation For Path Extraction Queries
30
Path Extraction Query (PEQ) Evaluation

Most approaches
Graph traversals in main memory dbs
Limited support of constraints i.e. path filter
conditions
Our goal
a good linear representation for general graphs
that provides good performance for most classes
of PEQs

31
Representation wish list

Queries should be answerable in a single scan
Labeling scheme for efficient pruning
Compact representation of partial path
information
Clustering of related path information
minimizing external path length bin packing
problem

32
Foundations of evaluation framework2
Given a directed graph G (V, E)
A P-Expression or Path Summary (P, u, v) is a
regular expression P over E such that s ? L(P)
of type (u, v), i.e. represents a path from u to
v. Example Assume E (u, p1, w), (u, p2, w),
(w, p3, v) then (u, p1, w) ? (u, p2, w) ? (w,
p3, v) is an p-expression of type (u, v). The
Path Sequence for a graph G is the sequence (P1,
s1, d1), (P2, s2, d2), (P3, s3, d3), , (Pf, sf,
df), , (Pg, sg, dg), , (Pl, sl, dl) p
p1,
p2, , pk for any non-empty
path p in G.
g lt l
2 lt
f lt
2 Tarjan Fast Algorithms for solving path
problems JACM81
33
c
d
h
1
2
6
3
a
b
e
f
5
g
4
Path Sequence PS for G
d.b ? d ? L(P2) and b ? L(P7) c.a ? c ? L(P1)
and a ? L(P10) c.a.c ? c ? L(P1) and a.c ? L(P3)
34
c
d
h
1
2
6
3
a
b
e
f
5
LU decomposition of a graphs matrix!!!
g
4
Path Sequence PS for G
u lt v paths from u to v with all intermediate
vertices w lt u.
u ? v paths from u to v with no intermediate
vertex w gt v.
PS p-expressions with u ? v in increasing order
of u, followed p-expressions u gt v in decreasing
order of u
35
c
d
h
1
2
6
3
a
b
e
f
5
g
4
Solving (2, 6)
(2, 3) (2, 3) ? (2, 2) ? (2, 3) ? ? ((a ? c)
? a ? d) ..
(2, 2) ?
(2, 2) ? (1, 2) (2, 2) ?
(2, 6) (2, 6) ? (2, 3) ? (3, 6)
? ? (a ? c) ? a ?
d ? h .
(2, 2) ? (1, 3) (2, 3) ?
(2, 2) ? (2, 2) (2, 2) ? ? (a ? c)
O(pathsequencelength) !!
36
Managing path sequences

Cluster path information using heuristics
prunability and prunable equivalence
Using Btree, query answering ? extended range
query.

Let Q (s, d) be a query and PS be the path
sequence for a data graph G.

A p-expression pe is said to be prunable from PS
if Q can be
solved using PS - pe
Two p-expressions pe1, pe2 are prunable
equivalent with
respect to Q if prunability of pe1 ? prunability
of pe2

37
Labeling a path sequence
1
16
course_in
15
advises
enrolled_in
has_subject_area
2
12
enrolled_in
14
advises
enrolled_in
author_of
13
author_of
editor_of
author_of
4
3
6
required_text
8
author_of
advises
enrolled_in
11
course_in
has_subject_area
9
10
related_to_project
taught_by
5
current_project
17
19
P-expressions for SCCs are prunable equivalent
e.g. 8, 9 and 10
project_in
project_in
So are those for edges connecting them
20
May be assigned the same key values
18
38
1
2
1
16
course_in
15
advises
teaches
enrolled_in
2
12
enrolled_in
14
advises
enrolled_in
author_of
author_of
6
editor_of
13
author_of
4
3
required_text
author_of
8
advises
enrolled_in
has_subject_area
11
course_in
9
10
Dangling Trees
taught_by
5
Disconnected
related_to_project
current_project
17
19
project_in
project_in
Interval of tree subgraph identifiers sids
disjoint from that of non-tree sids
20
18
3
4
39
Tree-induced prunability equivalence
Refine the partitioning of nodes and edges
in nontree subgraphs using an optimal spanning
tree (OST) OST select edges that lie on the
longest path from root to a node.
1
1
advises
teaches
2
2
enrolled_in
author_of
3
3
4
3
course_in
has_subject_area
Only nodes and edges at level j lt i can reach a
node at level i. !!!!
4
5
40
2-Color Node Labels

Label each scc with three identifiers - subgraph
identifier s, OST-level identifier l and a
preorder identifier t.
Each label can be of one of the forms
slt if in a non-tree region
level order
stl if in a dangling tree region
depth first order
The subgraph identifiers for trees and non-trees
are non-overlapping

41
2-Color Code for a Graph

2-Color code is a sequence of key-value pairs
for scci, 2CL()i, 2CL()i) ? PSi
for sccx, sccy connected by edges e1, e2, .. ek,
((2CL()x, 2CL()y ) ? pee1, pee2, peek
preserves path sequence property

42
1
2
1
16
course_in
15
advises
teaches
enrolled_in
2
12
enrolled_in
14
advises
enrolled_in
author_of
author_of
6
editor_of
13
author_of
4
3
required_text
author_of
8
advises
enrolled_in
has_subject_area
11
course_in
9
10
Dangling Trees
taught_by
5
Disconnected
related_to_project
current_project
17
19
project_in
project_in
Interval of tree subgraph identifiers sids
disjoint from that of non-tree sids
20
18
3
4
43

1 (1,1,1), (1,1,1), , 2
(1,1,1), (1,2,2), (advises, 1, 2) ,
.
16 (2,1,6), (2,1,6), (enrolled_in, 8, 10),
(advises?enrolled_in, 9, 10),
(
(taught_by?advises?enrolled_in), 10, 10),
(taught_by,
10, 9), (advises, 9, 8) ,
17 (2,1,6), (2,2,4), (author_of, 8, 13),
(required_text, 10, 13),
20 .
31 (3,1,1), (3,1,1), , 32 (3,2,1),
(3,1,1), (project_in, 17, 18) ,
34 (4,1,1), (2,9,1), (current_project, 11,
19) ,

44
2-Color Code Properties

Order Property for GN / GT
u in GN, v in GT ? label(u) precedes label(v).
e (u, v) ? label(u) precedes label(e)
Trees ? label(e) immediately precedes label(v)
NonReachability Property (su, lu, t)u, (sv, lv,
tv)v
su ? sv ? result is empty.
lu ? lv ? result is empty.
for query (u, v) with levels i, j, any node w
with level k with k lt i or k gt j is prunable

45
Cost of 2-Color code construction

Find strong components of G - O(n m)
Find roots of dangling trees - O(n m)
Find optimal spanning tree - O(n m)
Find PS for each strong component i in increasing
order of level in OST O ?ni3

46
Evaluation

Strategy
2-color code vs. relational databases using joins
strawman comparison
2-color code vs. randomly chosen topological
orderings
Datasets
Queries
6 query classes
NT-NT, NT-T, T-T
Positive, Negative
40 ? 6 queries

47

Positive Queries
Negative Queries

48
Constrained Path Extraction Queries

Inline Processing
Bounded length paths
associate (P, u, v) with 2 mappings cost,
shortestpath (Tarjan81)
Post-filtering approaches
Our approach
Compute (P, u, v)
Check if given nodes/edges satisfy constraints
Then filter

49
Bit encoding of p-expressions
?
1, 3
(1, a, 2) (2, c, 3) (1, a, 2) (2, d, 3) (1, b, 2)
(2, c, 3) (1, b, 2) (2, d, 3)
(a ? b) ? (c ? d)
1, 2
2, 3
?
b
a
Encoding table
1, 2
1, 2
2, 3
2, 3
Example containsALL(p, a, b
a and b will agree on some suffix beginning with
a 1 in an even position but . will disagree on
a preceding odd bit position
50
Ranking Query Results
51
The Relationship Ranking Problem
query q (1, 3) (a pair of nodes)
g
a
f
2
3
1
b
d
e
4
c
could be done with step 1 or as a separate step
5
. . .

Find the subgraph that covers q

2. List the results in order of relevance
52
SemRank Ranking PEQ results

Modulative Ranking
Product of functions fi, where fi is a function
of a single user input m
Metrics
Refraction Count
Measures how different a path is from the paths
at the schema layer?
Calculated in terms of a Semantic Summary
Semantic Information Gain - SIG
How much information does a user gain when result
is given?
- log(Prob(relation)) with resp to node types

53
SemRank Ranking PEQ results

S-Match
Best semantic match with user description (if
provided)
Distance between keyword and related term in
conceptual hierarchy

54
High Information Gain High Refraction Count High
S-Match
Low Information Gain Low Refraction Count High
S-Match
adjustable search mode
55
Refraction
Spouse
married_to
Student
Course
Professor
enrolled_in
taught_by
enrolled_in
taught_by
married_to
1
2
4
3

The path enrolled_in ? taught_by ? married_to
doesnt exist anywhere at schema layer
We say that the path refracts at node 3
High refraction count in a path ? low
predictability

56
Semantic Summary
Representative Ontology Class
p3
p1
C1
C2
p5
p2
p4
C5
p1, p2
p5
C1 ? C3
p1, p2
p1
p4
C4
C3
p2
p3
57
Properties of a Semantic Summary

All expected consecutive edge pairs are recorded
in the semantic summary
If p1.p2 is not in semantic summary ? refraction
Given two representative ontology classes, we can
tell all the valid properties/edges that can
connect them
Calculate probability of a property in terms of
all known valid properties
Information gain

58
The Index Subsystem

FDIX Frequency Distribution IndeX
Stores the frequency distribution of properties
ROIX Representative Ontology IndeX
Maps classes to Representative Ontology Classes
Stores the semantic summary graph
PHIX Property Hierarchy IndeX
Encodes the hierarchical relationships using a
Dewey Decimal labeling scheme
Used for computing S-Match (match between
keywords and properties in a path)

59
Top-K Evaluation
Final Top_k 1. g.i, 18 2. c. f, 9
60
Evaluation

Datasets
SWETO manual added information
Domains
Politics, Entertainment, Sports
58 Human Subjects
24 sample results in 3 groups
Subjects were asked to pick top 3 for
conventional and top 3 discovery
Conducted a statistical test of significance

61
(No Transcript)
62
Three groups

Arnold Schwarzenegger and Andre Agassi
(S A)
examples
S supports GrandSlamForKids organized by A.
S starredIn LastActionHero with Bridget Wilson
ex-wife of A
Arnold Schwarzenegger and George W. Bush (S B)
Michael Jordan and Tiger Woods (J W)

63
Statistical test of significance

Null hypothesis rankings are random
If true each relationship will occur at the top
3, 3/8 of the time
But if some occur more times, then reject the
null hypothesis
Experiment
Simulate a distribution of rankings under the
null hypothesis
Test the significance of observed results at 95
significance level under this distribution

64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
Publications

Core Thesis publications
Kemafor Anyanwu, Angela Maduko, Amit Sheth.
SPARQ2L Towards Support For Subgraph Extraction
Queries in RDF Databases, The 16th International
World Wide Web Conference, (WWW2007), Banff,
Canada, May 8-12, 2007. (acceptance rate 14)
Kemafor Anyanwu, Angela Maduko, Amit Sheth.
SemRank Ranking Complex Relationship Search
Results on the Semantic Web, The 14th
International World Wide Web Conference,
(WWW2005), Chiba, Japan, May 10-14, 2005.
(acceptance rate 14) (current of citations
21)
Kemafor Anyanwu, Amit Sheth. ?-queries Enabling
Querying for Semantic Associations on the
Semantic Web, The 12th International World Wide
Web Conference, (WWW2003), Budapest, Hungary, May
20-24, 2003. pp.823-833. (acceptance rate
13) (current of citations 49)
Kemafor Anyanwu, Amit Sheth. The ? Operator
Discovering and Ranking Associations on the
Semantic Web, SIGMOD Record (Special issue on
Amicalola Workshop on DB-IS Research for Semantic
Web and Enterprises), 31 (4), pp. 42-47. 2002.
(current of citations 20)
Kemafor Anyanwu, Amit Sheth. Supporting
Knowledge Discovery on the Semantic Web by
Exploiting the Semantics of Complex Relationships
, International Semantic Web Working Symposium
2001, position paper, Stanford University,
California, USA, July 30 - August 1, 2001.
Kemafor Anyanwu, Angela Maduko, Amit Sheth. From
Link Analysis Ranking to Relationship Analysis
Ranking - Adding Semantics to the Mix. (to be
submitted for second round reviews for the
Journal of Web Semantics)
Angela Maduko, Kemafor Anyanwu, Amit Sheth,
Estimating the Cardinality of RDF Graph Patterns.
WWW2007 poster paper

68
Comments About Impact

3 key papers have over 110 citations in aggregate
SemRank paper acknowledged as notable
contributions to Semantic Web conference at
WWW2005
Invited for journal special issue
Recent paper on SPARQ2L subject of some blogs
about SPARQL extensions

69
Ongoing and Future Work