Keyword Proximity Search on XML Graphs - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Keyword Proximity Search on XML Graphs

Description:

PaPaPa = PaPa PaPa. Then CN. is evaluated as PJohn PLPa PaPaPa PaVCR. instead of. PJohn LPref LPa PaPa PaPa PaVCR. Spare 2 joins! PersonJohn. Lineitem. Part ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 38

Provided by: users

Category:

more less

Transcript and Presenter's Notes

Title: Keyword Proximity Search on XML Graphs

1
Keyword Proximity Search on XML Graphs

Vagelis Hristidis
Yannis Papakonstantinou
Andrey Balmin
University of California, San Diego

2
Motivation

Keyword Search is the dominant information
discovery method in plain text documents
Increasing amount of data stored in XML databases

3
Motivation

Currently, information discovery in XML databases
requires
Knowledge of schema
Knowledge of a query language (eg XQuery)
Knowledge of the role of the keywords

XKeyword eliminates these requirements

4
Keyword Query - Semantics

Keywords are
in same XML text node
in same element
connected through edges

lttitlegt Storage of XML databaseslt\titlegt
lttopicsgt lttopicgt XML survey lt\topicgt lttopicgt
Storage of database lt\topicgt lt\topicsgt
lttopicgt XML survey lt\topicgt
idref
lttopicgt Storage of database lt\topicgt
5
Result of Keyword Query

Result is tree T of XML nodes where
every keyword contained in a node of T (total)
no node of T is redundant (minimal)
Score of result
distance of keywords within a text node
distance between keywords in number of edges
weighted distance
PageRank-like methods

6
Example - Schema
TPCH-like schema
7
Example - Data
8
Example Keyword Query
Query John, VCR
9
Example Keyword Query
Query John, VCR Result trees T1
size 6 T2 size 8
10
Example Keyword Query
Target Objects
11
Presentation

Number of results explodes due to MVDs
Example

Results R1. p1-l1-pa3-pa1 R2.
p1-l2-pa3-pa2 R3. p1-l2-pa3-pa1 R4.
p1-l1-pa3-pa2 R3, R4 are implied by R1,
R2!
12
Presentation

Create a Presentation Graph for each CN

13
Demo

Demo on DBLP dataset available at
www.db.ucsd.edu/XKeyword

14
Demo
15
Demo
16
Demo
17
Architecture
John, VCR
John person.name VCR part.name, product.descr
PersonJohn-Lineitem-ProductVCR,
PersonJohn-Lineitem-PartVCR
Person1-Lineitem1-Product1, Person1-Lineitem2-Part
1
PersonJohn,US-Lineitemquant6, Oct 14
2001 Productid2005,descrSet of VCR and
DVD,
18
Architecture
19
Candidate Networks Generator

Adaptation of CN generator of DISCOVER
(Hristidis et al. VLDB 2002) to XML databases
Example
CNs of size3 for query John,VCR

supplier
ProductVCR
PersonJohn
Lineitem
supplier
PersonJohn
PartVCR
Lineitem
supplier
subpart
Part
PersonJohn
Lineitem
PartVCR
PartVCR
PersonJohn
Order
Lineitem
ProductVCR
PersonJohn
Order
Lineitem
20
Architecture
21
Decomposer

Storing Data in XKeyword
Each target object is stored in a CLOB
Connections between target objects in ID
Relations

Minimal ID Relations Lineitem_Part LPa
(L_id, Pa_id) Lineitem_Person_ref LPref (L_id,
P_id) Part_Part PaPa (Pa_id1,Pa_id2)
LPa L_id Pa_id 100 123 101 123
22
Decomposer
PLPa P_id L_id Pa_id

Create redundant ID Relations to improve
performance
Examples
PLPa LPref LPa
PaPaPa PaPa PaPa

supplier
subpart
Part
PartVCR
PersonJohn
Lineitem
Then CN is evaluated as PJohn PLPa
PaPaPa PaVCR instead of PJohn LPref
LPa PaPa PaPa PaVCR Spare 2 joins!
23
Decomposer - Rules

Create redundant ID Relations when not MVDs.
Eg Person Order Lineitem (POL)
Avoid MVD ID Relations
Eg Order Person Lineitem (OPLref)

ref supplier

OPLref O_id P_id L_id 127
105 100 127 105
101 129 105 100
129 105 101
PO P_id O_id 105 127 105
129
LPref L_id P_id 100 105
101 105

There is always decomposition with fragments
maximum size L ?M/(J1)?, s.t. any CN of size
up to M is evaluated with at most J joins

24
Clustering

ID Relations are stored clustered.
Eg POL is clustered on P,O,L
LOP is clustered on L,O,P
Use ID Relations clustered as join direction
Eg use POL and not LOP when evaluating CN
Person-Order-Lineitem-Product from left to right

25
Architecture
26
Execution Module

Get top-K results using nested loops join on each
CN.
Caching intermediate results.
Eg PartVCR ? Part ? PartTV
Assume we have evaluated p1-p2-x
if we reach partial result p3-p2-x, no need to
join with PartTV
Multithreaded execution One thread for each CN

subpart
subpart
27
Execution Module

Execution is guided by navigation in Presentation
Graph

XML storage
28
Previous Work

DBXplorer (Agrawal et al. ICDE 2002),
DISCOVER (Hristidis et al. VLDB 2002)
Work on Relational Databases
Execute an SQL statement for each CN
Drawbacks
Redundancy in Presentation
No control on Storage of data
Keyword Search in Graph Databases (Goldman et.
al. VLDB 98)
look for hub nodes
No schema

29
Experimentation Evaluate decompositions

MinClust Minimal, both directions of clustering
MinNClustIndx Minimal, no clustering, indexed
Complete All ID relations- MVD non-MVD
XKeyword all ID relations of size up to 2 (2
edges)

30
Experimentation Evaluate decompositions

Maximum CN size 6, top-K
2 keywords
DBLP dataset

31
Evaluation of Optimized Execution Algorithm

2 keywords
DBLP dataset

32
Conclusions

XKeyword is system for plain keyword search in
XML databases
Focus on
Storage of XML in relations
Presentation
Future work
Investigate other relevance semantics.
Eg ranking based on link-structure.

33
Questions?
34
Candidate Networks Generator

A keyword may appear in multiple nodes
candidate networks can be too big (sometimes
unbounded)
Adaptation of CN generator of DISCOVER (Hristidis
et al. VLDB 2002) to XML databases

35
Candidate Network - Examples
CNs of size3 for query John,VCR
supplier
ProductVCR
PersonJohn
Lineitem
supplier
PersonJohn
PartVCR
Lineitem
supplier
subpart
Part
PersonJohn
Lineitem
PartVCR
PartVCR
PersonJohn
Order
Lineitem
ProductVCR
PersonJohn
Order
Lineitem
36
Candidate Networks Generator is Complete and
Non-Redundant

Prove that the set of Candidate Networks
generated is
Complete All solutions generated by a CN
Non-redundant There is database instance, where
by removing a CN a solution is lost

37
Experimentation Evaluate decompositions