Title: gStore: Answering SPARQL Queries Via Subgraph Matching
1gStore Answering SPARQL Queries Via Subgraph
Matching
- Lei Zou1, Jinghui Mo1, Lei Chen2, M. Tamer
Özsu3, Dongyan Zhao1
1Peking University, 2Hong Kong University of
Science and Technology, 3University of Waterloo
2Outline
- Background Related Work
- Overview of gStore
- Encoding Technique
- VS-tree Query Algorithm
- Experiments
- Conclusions
3Outline
- Background Related Work
- Overview of gStore
- Encoding Technique
- VS-tree Query Algorithm
- Experiments
- Conclusions
4Semantic Web
Semantic Web Technologies is a collection of
standard technologies to realize a Web of Data.
5RDF Data Model
URI
Literals
URI
6RDF Graph
Literal Vertex
Entity Vertex
7SPARQL Queries
SPARQL Query Select ?name Where ?m lthasNamegt
?name. ?m ltBornOnDategt 1809-02-12. ?m
ltDiedOnDategt 1865-04-15.
Query Graph
8Subgraph Match vs. SPARQL Queries
9Naïve Triple Store
SPARQL Query Select ?name Where ?m lthasNamegt
?name. ?m ltBornOnDategt 1809-02-12. ?m
ltDiedOnDategt 1865-04-15.
Too many Self-Joins
SQL Select T3.Subject From T as T1, T as T2, T
as T3 Where T1.PredictBornOnDate and
T1.Object1809-02-12 and T2.PredictDiedOnDate
and T2.Object1865-04-15 and T3.
PredicthasName and T1.Subject T2.Subject
and T2. Subject T3.subject
10Existing Solutions
- Three categories of solutions are proposed to
speed up query processing - Property Table
- Jena K. Wilkinson et al. SWDB 03,
- 2. Vertically Partitioned Solution
- SW-store D. J. Abadi et al. VLDB 07,
- 3. Exhaustive-IndexingRDF-3x T. Neumann et
al. VLDB 08, Hexastore C. Weiss et al. VLDB 08
, -
11Existing Solutions-Property Table
SPARQL Query Select ?name Where ?m lthasNamegt
?name. ?m ltBornOnDategt 1809-02-12. ?m
ltDiedOnDategt 1865-04-15.
Reducing of join steps
SQL Select People.hasName from People where
People.BornOnDate 1809-02-12 and
People.DiedOnDate 1865-04-15.
12Existing Solutions-Vertically Partitioned
Solution
Fast Merge Join
13Existing Solutions- Exhaustive-Indexing
Range query Merge Join
- Each SPARQL query statement can be translated
into one range query. - SPARQL Query
- Select ?name Where ?m lthasNamegt ?name. ?m
ltBornOnDategt 1809-02-12. ?m ltDiedOnDategt
1865-04-15.
14Some Limitations
- Difficult to handle wildcard queries.
- Difficult to handle updates.
-
-
15Outline
- Background Related Work
- Overview of gStore
- Encoding Technique
- VS-tree Query Algorithm
- Experiments
- Conclusions
16Intuition of gStore
Finding Matches over a Large Graph is not a
trivial task.
17Preliminaries
Literal Vertex
Entity Vertex
18Preliminaries
19Preliminaries
20Preliminaries
21Preliminaries
22Storage Schema in gStore
Encoding all neibhors into a bit-string, called
signature.
23Encoding Technique (1)
- eSig(e).e M.
- we employ m different string hash functions Hi (i
1, ...,m) - For each hash function Hi, we set the (Hi(eLabel)
MOD M)-th bit in eS ig(e).e to be 1 - Encoding Sig(e).n is the same
- eSig(e).n N
- n different hash functions
24Encoding Technique (2)
Abr, bra, rah, aha, .,
0000 0010 0000 0000
( hasName, Abraham Lincoln)
1000 0000 0000 0000
0010 0000 0000
1000 0010 0100 0001
0000 0000 0100 0000
( BornOnDate, 1809-02-12)
0100 0000 0000
0100 0010 0100 1000
0000 0000 0000 0001
OR
( DiedOnDate, 1865-04-15)
1000 0010 0100 0001
0000 1000 0000
0000 0010 0100 0000
OR
( DiedIn, yWashington_D.c)
0110 1010 0000
1100 0010 0100 1001
0000 0010 0000
1000 0010 0100 0001
25Encoding Technique (3)
26Encoding Technique (4)
27Encoding Technique (5)
28Outline
- Background Related Work
- Overview of gStore
- Encoding Technique
- VS-tree Query Algorithm
- Experiments
- Conclusions
29A Straightforward Solution (1)
u2
u1
001
004
006
002
003
006
L1
L2
30A Straightforward Solution (2)
L1
L2
Large Join Space ! ?
001
004
006
002
003
006
31VS-tree
32VS-Tree query definition
33Pruning Technique
Reduced Join Space! ?
u2
u1
10010
001
004
006
002
003
006
34Query Algorithm-Top-Down
35Optimized method
- Too many super edges
- Which level to start search
- No brute-force enumeration
36VS-Tree Insert
- The criterion in the VS-tree only depends on the
Hamming distance between the signatures of u and
the node in VS-tree. - the criterion in VS- tree depends on both node
signatures and Gs structure
37Updates- Insertion in G
38Updates- Insertion in VS-tree
39VS-Tree split
- the B1 entities of the node will be partitioned
into two new nodes, where B is the maximal fanout
for a node in VS-tree. - 1. we find two entities that have the maximal
Hamming distance between them as two seed nodes - 2. we associate each left entry with the nearest
seed node, according to Equation 1.
40VS-Tree deletion
- Similar to split
- if some node d has less than b entries, where b
is the minimal fanout of node in VS-tree, then d
is deleted and its entries are reinserted into
VS-tree.
41Updates- Deletion in VS-tree
To be deleted
42Which Level To Begin
- a concept pruning power of GI with regard to Q
denoted as P(Q,GI )
43Estimate P(Q,GI)
44Finding Valid Child States
- propose a DFS strategy to find all valid child
states of J. - start a DFS over G beginning from some vertex vi
45(No Transcript)
46Outline
- Background Related Work
- Overview of gStore
- Encoding Technique
- VS-tree Query Algorithm
- Experiments
- Conclusions
47Datasets
Triple Size
Yago 20 million 3.1GB
DBLP 8 million 0.8 GB
48Offline Performance
49Exact Queries
50Wildcard Queries
51Outline
- Background Related Work
- Overview of gStore
- Encoding Technique
- VS-tree Query Algorithm
- Experiments
- Conclusions
52Conclusions
- Vertex Encoding Technique
- An Efficient index Structure VS-tree
- A Novel Filtering Technique.
53Q/A
Thank You!