Title: Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases
1Supporting Link Analysis Using Advanced Querying
Methods in Semantic Web Databases
Kemafor AnyanwuAdvisory Committee Amit Sheth
(Major) Jay Aronson, Liming Cai, John Miller
2Outline
- Link Analysis
- Link Analysis in Semantic Web Databases
- Query Language Support
- Query Evaluation Support
- Result Ranking
3Link Analysis
- From Wikipedia
- a subset of network analysis, exploring
associations between objects - provides the crucial relationships and
associations between very many objects of
different types that are not apparent from
isolated pieces of information - Currently has no database support!!!
4. Everything's connected, all along the line.
Cause and effect. That's the beauty of it. Our
job is to trace the connections and reveal them.
- Jack in Terry Gilliams 1985 film - Brazil
5An Example
- Find relationships between passengers on flights
to New York or Washington DC, who either
purchased their tickets less than 24hrs before
departure time or with cash, and are linked to
flight training
6Graph Data Models
- Contemporary data models
- Stanfords Object Exchange Model (OEM)
- W3Cs eXtensible Markup Language (XML)
- W3Cs Resource Description Framework (RDF)
- Semantic Web
- Many others more domain-specific
- Link analysis queries can be seen as querying
about structure - Examples finds a subgraph connecting certain
passengers! - Steiner Tree Problem NP-Hard
7Link Analysis on the Semantic Web
8Semantic Web languages - RDF
- In RDF
- Each resource has an IRI
- Triples (subject, property, object) are
statements linking resources - (Xkemafor, YhasEmail, anyanwu_at_cs.uga.edu)
- X and Y are aliases for namespaces e.g.
http//somelong.url/ - A triple has an equivalent directed labeled graph
- RDF Schema provides a metaschema for describing
classes and relation types
9An Example RDF Database
Semantic Web
course_title
Pu1
offers
C1
U1
author_of
enrolled_in
enrolled_in
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
course_title
C3
Databases
10Subgraph Matching contd.
- Find universities that offer a course on the
Semantic Web and the students enrolled in those
courses
11Find universities that offer a course on the
Semantic Web and the students enrolled in those
courses
Semantic Web
course_title
X
Pu1
offers
C1
U1
author_of
enrolled_in
enrolled_in
Z
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
course_title
C3
Databases
12Find links between S1A1 and P1
Semantic Web
course_title
Pu1
offers
??P path variable
C1
U1
author_of
enrolled_in
enrolled_in
??P1
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
??P2
course_title
C3
Databases
13(No Transcript)
14Ranking
- Finding the most important query results
- What is most important may vary from context to
context - For link analysis queries most weighted paths
- Need flexible ranking models that allow users
bump into new information
15Ranking contd.
- Has been addressed for the
- Web, Semantic Web, Databases
- Three categories
- Fixed Models
- Seeded Models
- Variable Models
- Typically for databases are query results have
known set of attributes that can be varied - Query results have an unknown structure !!
16Problem Statement and Contributions
- Problem Statement
- Develop database infrastructure for Path
Extraction Queries - Query language Support
- Query Evaluation Support
- Query Result Management Support
- Contributions
- Query Language Support (Anyanwu et al WWW2007,
WWW2003, SIGREC2002, ISWWS2001) - Query Evaluation Infrastructure (Anyanwu et al
WWW2007) - SemRank Ranking query results (Anyanwu et al
WWW2005, IJWS08 )
17Query Language Support
18Language Design Goals
- An integrated query language for expressing both
subgraph matching and extraction - Build on existing languages and standard - SPARQL
- Have an expressive language with simple
constructs.
19Language Requirements
- Path Variables
- Path Constraint Expressions
20Constraints in path extraction queries
- paths must contain a specific node type
- from an MTB surface molecule to via a
Phoshoinositide 3-Kinase enzyme to a cellular
response event. - paths lengths are bounded
- close connections (less than 4 hops) between
SalesPersonA and CIO-Y. - paths must contain a specific pattern
- paths from organizationO to billB that involves a
sponsorship relationship to congressmanC
associated to a positive vote on the bill
21Algebraic syntax for SPARQ2L
- Let T RDF terms i.e.
- I - IRIs,
- L - literals
- B - blank nodes
- A term variable ?x ranges over T
- A triple pattern is a triple with a term variable
- (?x, email, ?y)
- Graph pattern - set of triple patterns combined
using AND, UNION, FILTER, OPTIONAL - (?x, email, ?y) AND (?x, age, ?z) FILTER (?z gt
15)
22Generalized Graph Patterns
- A path variable ??p ranges over 2T
- A generalized triple pattern (gtp) - triple
pattern that permits a path variable in the
property position - (x, ??p, y)
- Generalized graph patterns multiple gtp
patterns combined similarly PATHFILTER operator
23Generalized Graph Patterns
- LetVP is the set of path variables
- PATHFILTER based on built-in path functions
- containsAny (VP, 2I) ? boolean
- containsAll (VP, 2I) ? boolean
- isSimple VP ? boolean
- containsPattern (VP, R(T)) ? Boolean
- cost VP ? R
- (X, ??p, Y) PATHFILTER (containsAny(??p, P3K) )
24SPARQ2Ls containsPattern()
- containsPattern (VP, R(T)) ? Boolean
- Allows regular expressions over triples
- Extended Triple Patterns
- (s , , . , p, , o) -- matches a
path where - s is the subject of the first triple,
- the property/edge in the last triple is p,
- the object of last triple is o
and , - matches arbitrary intermediate nodes/edges on
the path. - If T is the set of all triple patterns, R(T) is
the set of regular expressions over T
25SPARQ2Ls regular expressions
- Let ?o, ?c and ?b be term variables ranging over
the sets of organizations, congressmen and bills
respectively. - Find a path from an organization to a bill such
that the organization has sponsored something
linked to the official and that official is
linked to a positive vote for the bill. - (?o, ., . , sponsors, . , .) ? ( . , .,
. , ., . , ?c) ? ( ?c, ., . , votesFor,
. , ?b)
26SPARQ2Ls formal semantics
- Important !!!
- Provides a framework for precisely describing
queries - Helps identify valid optimizations
- Details for SPARQ2L not given here
27Complexity
- isSimple()
- NP-Hard
- ContainsAny()
- DISJOINTPATH NP-Hard
- ContainsALL()
- HAMPATH NP-Hard
- ContainsPattern()
- Generally NP-Hard
28Comparison to other languages
29Query Evaluation For Path Extraction Queries
30Path Extraction Query (PEQ) Evaluation
- Most approaches
- Graph traversals in main memory dbs
- Limited support of constraints i.e. path filter
conditions - Our goal
- a good linear representation for general graphs
that provides good performance for most classes
of PEQs
31Representation wish list
- Queries should be answerable in a single scan
- Labeling scheme for efficient pruning
- Compact representation of partial path
information - Clustering of related path information
- minimizing external path length bin packing
problem
32Foundations of evaluation framework2
Given a directed graph G (V, E)
A P-Expression or Path Summary (P, u, v) is a
regular expression P over E such that s ? L(P)
of type (u, v), i.e. represents a path from u to
v. Example Assume E (u, p1, w), (u, p2, w),
(w, p3, v) then (u, p1, w) ? (u, p2, w) ? (w,
p3, v) is an p-expression of type (u, v). The
Path Sequence for a graph G is the sequence (P1,
s1, d1), (P2, s2, d2), (P3, s3, d3), , (Pf, sf,
df), , (Pg, sg, dg), , (Pl, sl, dl) p
p1,
p2, , pk for any non-empty
path p in G.
g lt l
2 lt
f lt
2 Tarjan Fast Algorithms for solving path
problems JACM81
33c
d
h
1
2
6
3
a
b
e
f
5
g
4
Path Sequence PS for G
d.b ? d ? L(P2) and b ? L(P7) c.a ? c ? L(P1)
and a ? L(P10) c.a.c ? c ? L(P1) and a.c ? L(P3)
34c
d
h
1
2
6
3
a
b
e
f
5
LU decomposition of a graphs matrix!!!
g
4
Path Sequence PS for G
u lt v paths from u to v with all intermediate
vertices w lt u.
u ? v paths from u to v with no intermediate
vertex w gt v.
PS p-expressions with u ? v in increasing order
of u, followed p-expressions u gt v in decreasing
order of u
35c
d
h
1
2
6
3
a
b
e
f
5
g
4
Solving (2, 6)
(2, 3) (2, 3) ? (2, 2) ? (2, 3) ? ? ((a ? c)
? a ? d) ..
(2, 2) ?
(2, 2) ? (1, 2) (2, 2) ?
(2, 6) (2, 6) ? (2, 3) ? (3, 6)
? ? (a ? c) ? a ?
d ? h .
(2, 2) ? (1, 3) (2, 3) ?
(2, 2) ? (2, 2) (2, 2) ? ? (a ? c)
O(pathsequencelength) !!
36Managing path sequences
- Cluster path information using heuristics
- prunability and prunable equivalence
- Using Btree, query answering ? extended range
query.
Let Q (s, d) be a query and PS be the path
sequence for a data graph G.
- A p-expression pe is said to be prunable from PS
if Q can be - solved using PS - pe
- Two p-expressions pe1, pe2 are prunable
equivalent with - respect to Q if prunability of pe1 ? prunability
of pe2
37Labeling a path sequence
1
16
course_in
15
advises
enrolled_in
has_subject_area
2
12
enrolled_in
14
advises
enrolled_in
author_of
13
author_of
editor_of
author_of
4
3
6
required_text
8
author_of
advises
enrolled_in
11
course_in
has_subject_area
9
10
related_to_project
taught_by
5
current_project
17
19
P-expressions for SCCs are prunable equivalent
e.g. 8, 9 and 10
project_in
project_in
So are those for edges connecting them
20
May be assigned the same key values
18
381
2
1
16
course_in
15
advises
teaches
enrolled_in
2
12
enrolled_in
14
advises
enrolled_in
author_of
author_of
6
editor_of
13
author_of
4
3
required_text
author_of
8
advises
enrolled_in
has_subject_area
11
course_in
9
10
Dangling Trees
taught_by
5
Disconnected
related_to_project
current_project
17
19
project_in
project_in
Interval of tree subgraph identifiers sids
disjoint from that of non-tree sids
20
18
3
4
39Tree-induced prunability equivalence
Refine the partitioning of nodes and edges
in nontree subgraphs using an optimal spanning
tree (OST) OST select edges that lie on the
longest path from root to a node.
1
1
advises
teaches
2
2
enrolled_in
author_of
3
3
4
3
course_in
has_subject_area
Only nodes and edges at level j lt i can reach a
node at level i. !!!!
4
5
402-Color Node Labels
- Label each scc with three identifiers - subgraph
identifier s, OST-level identifier l and a
preorder identifier t. - Each label can be of one of the forms
- slt if in a non-tree region
- level order
- stl if in a dangling tree region
- depth first order
- The subgraph identifiers for trees and non-trees
are non-overlapping
412-Color Code for a Graph
- 2-Color code is a sequence of key-value pairs
- for scci, 2CL()i, 2CL()i) ? PSi
- for sccx, sccy connected by edges e1, e2, .. ek,
((2CL()x, 2CL()y ) ? pee1, pee2, peek
- preserves path sequence property
421
2
1
16
course_in
15
advises
teaches
enrolled_in
2
12
enrolled_in
14
advises
enrolled_in
author_of
author_of
6
editor_of
13
author_of
4
3
required_text
author_of
8
advises
enrolled_in
has_subject_area
11
course_in
9
10
Dangling Trees
taught_by
5
Disconnected
related_to_project
current_project
17
19
project_in
project_in
Interval of tree subgraph identifiers sids
disjoint from that of non-tree sids
20
18
3
4
43- 1 (1,1,1), (1,1,1), , 2
(1,1,1), (1,2,2), (advises, 1, 2) , - .
- 16 (2,1,6), (2,1,6), (enrolled_in, 8, 10),
-
(advises?enrolled_in, 9, 10), - (
(taught_by?advises?enrolled_in), 10, 10), - (taught_by,
10, 9), (advises, 9, 8) , -
- 17 (2,1,6), (2,2,4), (author_of, 8, 13),
(required_text, 10, 13), - 20 .
- 31 (3,1,1), (3,1,1), , 32 (3,2,1),
(3,1,1), (project_in, 17, 18) , - 34 (4,1,1), (2,9,1), (current_project, 11,
19) ,
442-Color Code Properties
- Order Property for GN / GT
- u in GN, v in GT ? label(u) precedes label(v).
- e (u, v) ? label(u) precedes label(e)
- Trees ? label(e) immediately precedes label(v)
- NonReachability Property (su, lu, t)u, (sv, lv,
tv)v - su ? sv ? result is empty.
- lu ? lv ? result is empty.
- for query (u, v) with levels i, j, any node w
with level k with k lt i or k gt j is prunable
45Cost of 2-Color code construction
- Find strong components of G - O(n m)
- Find roots of dangling trees - O(n m)
- Find optimal spanning tree - O(n m)
- Find PS for each strong component i in increasing
order of level in OST O ?ni3
46Evaluation
- Strategy
- 2-color code vs. relational databases using joins
- strawman comparison
- 2-color code vs. randomly chosen topological
orderings - Datasets
- Queries
- 6 query classes
- NT-NT, NT-T, T-T
- Positive, Negative
- 40 ? 6 queries
47 Positive Queries
Negative Queries
48Constrained Path Extraction Queries
- Inline Processing
- Bounded length paths
- associate (P, u, v) with 2 mappings cost,
shortestpath (Tarjan81) - Post-filtering approaches
- Our approach
- Compute (P, u, v)
- Check if given nodes/edges satisfy constraints
- Then filter
49Bit encoding of p-expressions
?
1, 3
(1, a, 2) (2, c, 3) (1, a, 2) (2, d, 3) (1, b, 2)
(2, c, 3) (1, b, 2) (2, d, 3)
(a ? b) ? (c ? d)
1, 2
2, 3
?
b
a
Encoding table
1, 2
1, 2
2, 3
2, 3
Example containsALL(p, a, b
a and b will agree on some suffix beginning with
a 1 in an even position but . will disagree on
a preceding odd bit position
50Ranking Query Results
51The Relationship Ranking Problem
query q (1, 3) (a pair of nodes)
g
a
f
2
3
1
b
d
e
4
c
could be done with step 1 or as a separate step
5
. . .
- Find the subgraph that covers q
2. List the results in order of relevance
52SemRank Ranking PEQ results
- Modulative Ranking
- Product of functions fi, where fi is a function
of a single user input m - Metrics
- Refraction Count
- Measures how different a path is from the paths
at the schema layer? - Calculated in terms of a Semantic Summary
- Semantic Information Gain - SIG
- How much information does a user gain when result
is given? - - log(Prob(relation)) with resp to node types
53SemRank Ranking PEQ results
- S-Match
- Best semantic match with user description (if
provided) - Distance between keyword and related term in
conceptual hierarchy
54High Information Gain High Refraction Count High
S-Match
Low Information Gain Low Refraction Count High
S-Match
adjustable search mode
55Refraction
Spouse
married_to
Student
Course
Professor
enrolled_in
taught_by
enrolled_in
taught_by
married_to
1
2
4
3
- The path enrolled_in ? taught_by ? married_to
doesnt exist anywhere at schema layer - We say that the path refracts at node 3
- High refraction count in a path ? low
predictability
56Semantic Summary
Representative Ontology Class
p3
p1
C1
C2
p5
p2
p4
C5
p1, p2
p5
C1 ? C3
p1, p2
p1
p4
C4
C3
p2
p3
57Properties of a Semantic Summary
- All expected consecutive edge pairs are recorded
in the semantic summary - If p1.p2 is not in semantic summary ? refraction
- Given two representative ontology classes, we can
tell all the valid properties/edges that can
connect them - Calculate probability of a property in terms of
all known valid properties - Information gain
58The Index Subsystem
- FDIX Frequency Distribution IndeX
- Stores the frequency distribution of properties
- ROIX Representative Ontology IndeX
- Maps classes to Representative Ontology Classes
- Stores the semantic summary graph
- PHIX Property Hierarchy IndeX
- Encodes the hierarchical relationships using a
Dewey Decimal labeling scheme - Used for computing S-Match (match between
keywords and properties in a path)
59Top-K Evaluation
Final Top_k 1. g.i, 18 2. c. f, 9
60Evaluation
- Datasets
- SWETO manual added information
- Domains
- Politics, Entertainment, Sports
- 58 Human Subjects
- 24 sample results in 3 groups
- Subjects were asked to pick top 3 for
conventional and top 3 discovery - Conducted a statistical test of significance
61(No Transcript)
62Three groups
- Arnold Schwarzenegger and Andre Agassi
- (S A)
- examples
- S supports GrandSlamForKids organized by A.
- S starredIn LastActionHero with Bridget Wilson
ex-wife of A - Arnold Schwarzenegger and George W. Bush (S B)
- Michael Jordan and Tiger Woods (J W)
63Statistical test of significance
- Null hypothesis rankings are random
- If true each relationship will occur at the top
3, 3/8 of the time - But if some occur more times, then reject the
null hypothesis - Experiment
- Simulate a distribution of rankings under the
null hypothesis - Test the significance of observed results at 95
significance level under this distribution
64(No Transcript)
65(No Transcript)
66(No Transcript)
67Publications
- Core Thesis publications
- Kemafor Anyanwu, Angela Maduko, Amit Sheth.
SPARQ2L Towards Support For Subgraph Extraction
Queries in RDF Databases, The 16th International
World Wide Web Conference, (WWW2007), Banff,
Canada, May 8-12, 2007. (acceptance rate 14) - Kemafor Anyanwu, Angela Maduko, Amit Sheth.
SemRank Ranking Complex Relationship Search
Results on the Semantic Web, The 14th
International World Wide Web Conference,
(WWW2005), Chiba, Japan, May 10-14, 2005.
(acceptance rate 14) (current of citations
21) - Kemafor Anyanwu, Amit Sheth. ?-queries Enabling
Querying for Semantic Associations on the
Semantic Web, The 12th International World Wide
Web Conference, (WWW2003), Budapest, Hungary, May
20-24, 2003. pp.823-833. (acceptance rate
13) (current of citations 49) - Kemafor Anyanwu, Amit Sheth. The ? Operator
Discovering and Ranking Associations on the
Semantic Web, SIGMOD Record (Special issue on
Amicalola Workshop on DB-IS Research for Semantic
Web and Enterprises), 31 (4), pp. 42-47. 2002.
(current of citations 20) - Kemafor Anyanwu, Amit Sheth. Supporting
Knowledge Discovery on the Semantic Web by
Exploiting the Semantics of Complex Relationships
, International Semantic Web Working Symposium
2001, position paper, Stanford University,
California, USA, July 30 - August 1, 2001. - Kemafor Anyanwu, Angela Maduko, Amit Sheth. From
Link Analysis Ranking to Relationship Analysis
Ranking - Adding Semantics to the Mix. (to be
submitted for second round reviews for the
Journal of Web Semantics) - Angela Maduko, Kemafor Anyanwu, Amit Sheth,
Estimating the Cardinality of RDF Graph Patterns.
WWW2007 poster paper
68Comments About Impact
- 3 key papers have over 110 citations in aggregate
- SemRank paper acknowledged as notable
contributions to Semantic Web conference at
WWW2005 - Invited for journal special issue
- Recent paper on SPARQ2L subject of some blogs
about SPARQL extensions
69Ongoing and Future Work
- Develop a complete system that includes Parser
and support for all path filter conditions - Integrating SemRank with link analysis approaches
70Thank you!!