Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases

Description:

Jay Aronson, Liming Cai, ... their tickets less than 24hrs before departure time or with cash, ... AND (?x, age, ?z) FILTER (?z 15) Generalized Graph Patterns ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 70
Provided by: Kema8
Category:

less

Transcript and Presenter's Notes

Title: Supporting Link Analysis Using Advanced Querying Methods in Semantic Web Databases


1
Supporting Link Analysis Using Advanced Querying
Methods in Semantic Web Databases
Kemafor AnyanwuAdvisory Committee Amit Sheth
(Major) Jay Aronson, Liming Cai, John Miller
2
Outline
  • Link Analysis
  • Link Analysis in Semantic Web Databases
  • Query Language Support
  • Query Evaluation Support
  • Result Ranking

3
Link Analysis
  • From Wikipedia
  • a subset of network analysis, exploring
    associations between objects
  • provides the crucial relationships and
    associations between very many objects of
    different types that are not apparent from
    isolated pieces of information
  • Currently has no database support!!!

4
. Everything's connected, all along the line.
Cause and effect. That's the beauty of it. Our
job is to trace the connections and reveal them.
  • Jack in Terry Gilliams 1985 film - Brazil

5
An Example
  • Find relationships between passengers on flights
    to New York or Washington DC, who either
    purchased their tickets less than 24hrs before
    departure time or with cash, and are linked to
    flight training

6
Graph Data Models
  • Contemporary data models
  • Stanfords Object Exchange Model (OEM)
  • W3Cs eXtensible Markup Language (XML)
  • W3Cs Resource Description Framework (RDF)
  • Semantic Web
  • Many others more domain-specific
  • Link analysis queries can be seen as querying
    about structure
  • Examples finds a subgraph connecting certain
    passengers!
  • Steiner Tree Problem NP-Hard

7
Link Analysis on the Semantic Web
8
Semantic Web languages - RDF
  • In RDF
  • Each resource has an IRI
  • Triples (subject, property, object) are
    statements linking resources
  • (Xkemafor, YhasEmail, anyanwu_at_cs.uga.edu)
  • X and Y are aliases for namespaces e.g.
    http//somelong.url/
  • A triple has an equivalent directed labeled graph
  • RDF Schema provides a metaschema for describing
    classes and relation types

9
An Example RDF Database
Semantic Web
course_title
Pu1
offers
C1
U1
author_of
enrolled_in
enrolled_in
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
course_title
C3
Databases
10
Subgraph Matching contd.
  • Find universities that offer a course on the
    Semantic Web and the students enrolled in those
    courses

11
Find universities that offer a course on the
Semantic Web and the students enrolled in those
courses
Semantic Web
course_title
X
Pu1
offers
C1
U1
author_of
enrolled_in
enrolled_in
Z
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
course_title
C3
Databases
12
Find links between S1A1 and P1
Semantic Web
course_title
Pu1
offers
??P path variable
C1
U1
author_of
enrolled_in
enrolled_in
??P1
S2
author_of
advisor
P1
S1A1
Pu2
advisor
editor_of
P2, E1
taught_by
enrolled_in
??P2
course_title
C3
Databases
13
(No Transcript)
14
Ranking
  • Finding the most important query results
  • What is most important may vary from context to
    context
  • For link analysis queries most weighted paths
  • Need flexible ranking models that allow users
    bump into new information

15
Ranking contd.
  • Has been addressed for the
  • Web, Semantic Web, Databases
  • Three categories
  • Fixed Models
  • Seeded Models
  • Variable Models
  • Typically for databases are query results have
    known set of attributes that can be varied
  • Query results have an unknown structure !!

16
Problem Statement and Contributions
  • Problem Statement
  • Develop database infrastructure for Path
    Extraction Queries
  • Query language Support
  • Query Evaluation Support
  • Query Result Management Support
  • Contributions
  • Query Language Support (Anyanwu et al WWW2007,
    WWW2003, SIGREC2002, ISWWS2001)
  • Query Evaluation Infrastructure (Anyanwu et al
    WWW2007)
  • SemRank Ranking query results (Anyanwu et al
    WWW2005, IJWS08 )

17
Query Language Support
18
Language Design Goals
  • An integrated query language for expressing both
    subgraph matching and extraction
  • Build on existing languages and standard - SPARQL
  • Have an expressive language with simple
    constructs.

19
Language Requirements
  • Path Variables
  • Path Constraint Expressions

20
Constraints in path extraction queries
  • paths must contain a specific node type
  • from an MTB surface molecule to via a
    Phoshoinositide 3-Kinase enzyme to a cellular
    response event.
  • paths lengths are bounded
  • close connections (less than 4 hops) between
    SalesPersonA and CIO-Y.
  • paths must contain a specific pattern
  • paths from organizationO to billB that involves a
    sponsorship relationship to congressmanC
    associated to a positive vote on the bill

21
Algebraic syntax for SPARQ2L
  • Let T RDF terms i.e.
  • I - IRIs,
  • L - literals
  • B - blank nodes
  • A term variable ?x ranges over T
  • A triple pattern is a triple with a term variable
  • (?x, email, ?y)
  • Graph pattern - set of triple patterns combined
    using AND, UNION, FILTER, OPTIONAL
  • (?x, email, ?y) AND (?x, age, ?z) FILTER (?z gt
    15)

22
Generalized Graph Patterns
  • A path variable ??p ranges over 2T
  • A generalized triple pattern (gtp) - triple
    pattern that permits a path variable in the
    property position
  • (x, ??p, y)
  • Generalized graph patterns multiple gtp
    patterns combined similarly PATHFILTER operator

23
Generalized Graph Patterns
  • LetVP is the set of path variables
  • PATHFILTER based on built-in path functions
  • containsAny (VP, 2I) ? boolean
  • containsAll (VP, 2I) ? boolean
  • isSimple VP ? boolean
  • containsPattern (VP, R(T)) ? Boolean
  • cost VP ? R
  • (X, ??p, Y) PATHFILTER (containsAny(??p, P3K) )

24
SPARQ2Ls containsPattern()
  • containsPattern (VP, R(T)) ? Boolean
  • Allows regular expressions over triples
  • Extended Triple Patterns
  • (s , , . , p, , o) -- matches a
    path where
  • s is the subject of the first triple,
  • the property/edge in the last triple is p,
  • the object of last triple is o
    and ,
  • matches arbitrary intermediate nodes/edges on
    the path.
  • If T is the set of all triple patterns, R(T) is
    the set of regular expressions over T

25
SPARQ2Ls regular expressions
  • Let ?o, ?c and ?b be term variables ranging over
    the sets of organizations, congressmen and bills
    respectively.
  • Find a path from an organization to a bill such
    that the organization has sponsored something
    linked to the official and that official is
    linked to a positive vote for the bill.
  • (?o, ., . , sponsors, . , .) ? ( . , .,
    . , ., . , ?c) ? ( ?c, ., . , votesFor,
    . , ?b)

26
SPARQ2Ls formal semantics
  • Important !!!
  • Provides a framework for precisely describing
    queries
  • Helps identify valid optimizations
  • Details for SPARQ2L not given here

27
Complexity
  • isSimple()
  • NP-Hard
  • ContainsAny()
  • DISJOINTPATH NP-Hard
  • ContainsALL()
  • HAMPATH NP-Hard
  • ContainsPattern()
  • Generally NP-Hard

28
Comparison to other languages
29
Query Evaluation For Path Extraction Queries
30
Path Extraction Query (PEQ) Evaluation
  • Most approaches
  • Graph traversals in main memory dbs
  • Limited support of constraints i.e. path filter
    conditions
  • Our goal
  • a good linear representation for general graphs
    that provides good performance for most classes
    of PEQs

31
Representation wish list
  • Queries should be answerable in a single scan
  • Labeling scheme for efficient pruning
  • Compact representation of partial path
    information
  • Clustering of related path information
  • minimizing external path length bin packing
    problem

32
Foundations of evaluation framework2
Given a directed graph G (V, E)
A P-Expression or Path Summary (P, u, v) is a
regular expression P over E such that s ? L(P)
of type (u, v), i.e. represents a path from u to
v. Example Assume E (u, p1, w), (u, p2, w),
(w, p3, v) then (u, p1, w) ? (u, p2, w) ? (w,
p3, v) is an p-expression of type (u, v). The
Path Sequence for a graph G is the sequence (P1,
s1, d1), (P2, s2, d2), (P3, s3, d3), , (Pf, sf,
df), , (Pg, sg, dg), , (Pl, sl, dl) p
p1,
p2, , pk for any non-empty
path p in G.
g lt l
2 lt
f lt
2 Tarjan Fast Algorithms for solving path
problems JACM81
33
c
d
h
1
2
6
3
a
b
e
f
5
g
4
Path Sequence PS for G
d.b ? d ? L(P2) and b ? L(P7) c.a ? c ? L(P1)
and a ? L(P10) c.a.c ? c ? L(P1) and a.c ? L(P3)
34
c
d
h
1
2
6
3
a
b
e
f
5
LU decomposition of a graphs matrix!!!
g
4
Path Sequence PS for G
u lt v paths from u to v with all intermediate
vertices w lt u.
u ? v paths from u to v with no intermediate
vertex w gt v.
PS p-expressions with u ? v in increasing order
of u, followed p-expressions u gt v in decreasing
order of u
35
c
d
h
1
2
6
3
a
b
e
f
5
g
4
Solving (2, 6)
(2, 3) (2, 3) ? (2, 2) ? (2, 3) ? ? ((a ? c)
? a ? d) ..
(2, 2) ?
(2, 2) ? (1, 2) (2, 2) ?
(2, 6) (2, 6) ? (2, 3) ? (3, 6)
? ? (a ? c) ? a ?
d ? h .
(2, 2) ? (1, 3) (2, 3) ?
(2, 2) ? (2, 2) (2, 2) ? ? (a ? c)
O(pathsequencelength) !!
36
Managing path sequences
  • Cluster path information using heuristics
  • prunability and prunable equivalence
  • Using Btree, query answering ? extended range
    query.

Let Q (s, d) be a query and PS be the path
sequence for a data graph G.
  • A p-expression pe is said to be prunable from PS
    if Q can be
  • solved using PS - pe
  • Two p-expressions pe1, pe2 are prunable
    equivalent with
  • respect to Q if prunability of pe1 ? prunability
    of pe2

37
Labeling a path sequence
1
16
course_in
15
advises
enrolled_in
has_subject_area
2
12
enrolled_in
14
advises
enrolled_in
author_of
13
author_of
editor_of
author_of
4
3
6
required_text
8
author_of
advises
enrolled_in
11
course_in
has_subject_area
9
10
related_to_project
taught_by
5
current_project
17
19
P-expressions for SCCs are prunable equivalent
e.g. 8, 9 and 10
project_in
project_in
So are those for edges connecting them
20
May be assigned the same key values
18
38
1
2
1
16
course_in
15
advises
teaches
enrolled_in
2
12
enrolled_in
14
advises
enrolled_in
author_of
author_of
6
editor_of
13
author_of
4
3
required_text
author_of
8
advises
enrolled_in
has_subject_area
11
course_in
9
10
Dangling Trees
taught_by
5
Disconnected
related_to_project
current_project
17
19
project_in
project_in
Interval of tree subgraph identifiers sids
disjoint from that of non-tree sids
20
18
3
4
39
Tree-induced prunability equivalence
Refine the partitioning of nodes and edges
in nontree subgraphs using an optimal spanning
tree (OST) OST select edges that lie on the
longest path from root to a node.
1
1
advises
teaches
2
2
enrolled_in
author_of
3
3
4
3
course_in
has_subject_area
Only nodes and edges at level j lt i can reach a
node at level i. !!!!
4
5
40
2-Color Node Labels
  • Label each scc with three identifiers - subgraph
    identifier s, OST-level identifier l and a
    preorder identifier t.
  • Each label can be of one of the forms
  • slt if in a non-tree region
  • level order
  • stl if in a dangling tree region
  • depth first order
  • The subgraph identifiers for trees and non-trees
    are non-overlapping

41
2-Color Code for a Graph
  • 2-Color code is a sequence of key-value pairs
  • for scci, 2CL()i, 2CL()i) ? PSi
  • for sccx, sccy connected by edges e1, e2, .. ek,
    ((2CL()x, 2CL()y ) ? pee1, pee2, peek
  • preserves path sequence property

42
1
2
1
16
course_in
15
advises
teaches
enrolled_in
2
12
enrolled_in
14
advises
enrolled_in
author_of
author_of
6
editor_of
13
author_of
4
3
required_text
author_of
8
advises
enrolled_in
has_subject_area
11
course_in
9
10
Dangling Trees
taught_by
5
Disconnected
related_to_project
current_project
17
19
project_in
project_in
Interval of tree subgraph identifiers sids
disjoint from that of non-tree sids
20
18
3
4
43
  • 1 (1,1,1), (1,1,1), , 2
    (1,1,1), (1,2,2), (advises, 1, 2) ,
  • .
  • 16 (2,1,6), (2,1,6), (enrolled_in, 8, 10),

  • (advises?enrolled_in, 9, 10),
  • (
    (taught_by?advises?enrolled_in), 10, 10),
  • (taught_by,
    10, 9), (advises, 9, 8) ,
  • 17 (2,1,6), (2,2,4), (author_of, 8, 13),
    (required_text, 10, 13),
  • 20 .
  • 31 (3,1,1), (3,1,1), , 32 (3,2,1),
    (3,1,1), (project_in, 17, 18) ,
  • 34 (4,1,1), (2,9,1), (current_project, 11,
    19) ,

44
2-Color Code Properties
  • Order Property for GN / GT
  • u in GN, v in GT ? label(u) precedes label(v).
  • e (u, v) ? label(u) precedes label(e)
  • Trees ? label(e) immediately precedes label(v)
  • NonReachability Property (su, lu, t)u, (sv, lv,
    tv)v
  • su ? sv ? result is empty.
  • lu ? lv ? result is empty.
  • for query (u, v) with levels i, j, any node w
    with level k with k lt i or k gt j is prunable

45
Cost of 2-Color code construction
  • Find strong components of G - O(n m)
  • Find roots of dangling trees - O(n m)
  • Find optimal spanning tree - O(n m)
  • Find PS for each strong component i in increasing
    order of level in OST O ?ni3

46
Evaluation
  • Strategy
  • 2-color code vs. relational databases using joins
  • strawman comparison
  • 2-color code vs. randomly chosen topological
    orderings
  • Datasets
  • Queries
  • 6 query classes
  • NT-NT, NT-T, T-T
  • Positive, Negative
  • 40 ? 6 queries

47



Positive Queries
Negative Queries


48
Constrained Path Extraction Queries
  • Inline Processing
  • Bounded length paths
  • associate (P, u, v) with 2 mappings cost,
    shortestpath (Tarjan81)
  • Post-filtering approaches
  • Our approach
  • Compute (P, u, v)
  • Check if given nodes/edges satisfy constraints
  • Then filter

49
Bit encoding of p-expressions
?
1, 3
(1, a, 2) (2, c, 3) (1, a, 2) (2, d, 3) (1, b, 2)
(2, c, 3) (1, b, 2) (2, d, 3)
(a ? b) ? (c ? d)
1, 2
2, 3
?
b
a
Encoding table
1, 2
1, 2
2, 3
2, 3
Example containsALL(p, a, b
a and b will agree on some suffix beginning with
a 1 in an even position but . will disagree on
a preceding odd bit position
50
Ranking Query Results
51
The Relationship Ranking Problem
query q (1, 3) (a pair of nodes)
g
a
f
2
3
1
b
d
e
4
c
could be done with step 1 or as a separate step
5
. . .
  • Find the subgraph that covers q

2. List the results in order of relevance
52
SemRank Ranking PEQ results
  • Modulative Ranking
  • Product of functions fi, where fi is a function
    of a single user input m
  • Metrics
  • Refraction Count
  • Measures how different a path is from the paths
    at the schema layer?
  • Calculated in terms of a Semantic Summary
  • Semantic Information Gain - SIG
  • How much information does a user gain when result
    is given?
  • - log(Prob(relation)) with resp to node types

53
SemRank Ranking PEQ results
  • S-Match
  • Best semantic match with user description (if
    provided)
  • Distance between keyword and related term in
    conceptual hierarchy

54
High Information Gain High Refraction Count High
S-Match
Low Information Gain Low Refraction Count High
S-Match
adjustable search mode
55
Refraction
Spouse
married_to
Student
Course
Professor
enrolled_in
taught_by
enrolled_in
taught_by
married_to
1
2
4
3
  • The path enrolled_in ? taught_by ? married_to
    doesnt exist anywhere at schema layer
  • We say that the path refracts at node 3
  • High refraction count in a path ? low
    predictability

56
Semantic Summary
Representative Ontology Class
p3
p1
C1
C2
p5
p2
p4
C5
p1, p2
p5
C1 ? C3
p1, p2
p1
p4
C4
C3
p2
p3
57
Properties of a Semantic Summary
  • All expected consecutive edge pairs are recorded
    in the semantic summary
  • If p1.p2 is not in semantic summary ? refraction
  • Given two representative ontology classes, we can
    tell all the valid properties/edges that can
    connect them
  • Calculate probability of a property in terms of
    all known valid properties
  • Information gain

58
The Index Subsystem
  • FDIX Frequency Distribution IndeX
  • Stores the frequency distribution of properties
  • ROIX Representative Ontology IndeX
  • Maps classes to Representative Ontology Classes
  • Stores the semantic summary graph
  • PHIX Property Hierarchy IndeX
  • Encodes the hierarchical relationships using a
    Dewey Decimal labeling scheme
  • Used for computing S-Match (match between
    keywords and properties in a path)

59
Top-K Evaluation
Final Top_k 1. g.i, 18 2. c. f, 9
60
Evaluation
  • Datasets
  • SWETO manual added information
  • Domains
  • Politics, Entertainment, Sports
  • 58 Human Subjects
  • 24 sample results in 3 groups
  • Subjects were asked to pick top 3 for
    conventional and top 3 discovery
  • Conducted a statistical test of significance

61
(No Transcript)
62
Three groups
  • Arnold Schwarzenegger and Andre Agassi
  • (S A)
  • examples
  • S supports GrandSlamForKids organized by A.
  • S starredIn LastActionHero with Bridget Wilson
    ex-wife of A
  • Arnold Schwarzenegger and George W. Bush (S B)
  • Michael Jordan and Tiger Woods (J W)

63
Statistical test of significance
  • Null hypothesis rankings are random
  • If true each relationship will occur at the top
    3, 3/8 of the time
  • But if some occur more times, then reject the
    null hypothesis
  • Experiment
  • Simulate a distribution of rankings under the
    null hypothesis
  • Test the significance of observed results at 95
    significance level under this distribution

64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
Publications
  • Core Thesis publications
  • Kemafor Anyanwu, Angela Maduko, Amit Sheth.
    SPARQ2L Towards Support For Subgraph Extraction
    Queries in RDF Databases, The 16th International
    World Wide Web Conference, (WWW2007), Banff,
    Canada, May 8-12, 2007. (acceptance rate 14)
  • Kemafor Anyanwu, Angela Maduko, Amit Sheth.
    SemRank Ranking Complex Relationship Search
    Results on the Semantic Web, The 14th
    International World Wide Web Conference,
    (WWW2005), Chiba, Japan, May 10-14, 2005.
    (acceptance rate 14) (current of citations
    21)
  • Kemafor Anyanwu, Amit Sheth. ?-queries Enabling
    Querying for Semantic Associations on the
    Semantic Web, The 12th International World Wide
    Web Conference, (WWW2003), Budapest, Hungary, May
    20-24, 2003. pp.823-833. (acceptance rate
    13) (current of citations 49)
  • Kemafor Anyanwu, Amit Sheth. The ? Operator
    Discovering and Ranking Associations on the
    Semantic Web, SIGMOD Record (Special issue on
    Amicalola Workshop on DB-IS Research for Semantic
    Web and Enterprises), 31 (4), pp. 42-47. 2002.
    (current of citations 20)
  • Kemafor Anyanwu, Amit Sheth. Supporting
    Knowledge Discovery on the Semantic Web by
    Exploiting the Semantics of Complex Relationships
    , International Semantic Web Working Symposium
    2001, position paper, Stanford University,
    California, USA, July 30 - August 1, 2001.
  • Kemafor Anyanwu, Angela Maduko, Amit Sheth. From
    Link Analysis Ranking to Relationship Analysis
    Ranking - Adding Semantics to the Mix. (to be
    submitted for second round reviews for the
    Journal of Web Semantics)
  • Angela Maduko, Kemafor Anyanwu, Amit Sheth,
    Estimating the Cardinality of RDF Graph Patterns.
    WWW2007 poster paper

68
Comments About Impact
  • 3 key papers have over 110 citations in aggregate
  • SemRank paper acknowledged as notable
    contributions to Semantic Web conference at
    WWW2005
  • Invited for journal special issue
  • Recent paper on SPARQ2L subject of some blogs
    about SPARQL extensions

69
Ongoing and Future Work
  • Develop a complete system that includes Parser
    and support for all path filter conditions
  • Integrating SemRank with link analysis approaches

70
Thank you!!
Write a Comment
User Comments (0)
About PowerShow.com