Title: Keyword Proximity Search on XML Graphs
1Keyword Proximity Search on XML Graphs
- Vagelis Hristidis
- Yannis Papakonstantinou
- Andrey Balmin
- University of California, San Diego
2Motivation
- Keyword Search is the dominant information
discovery method in plain text documents - Increasing amount of data stored in XML databases
3Motivation
- Currently, information discovery in XML databases
requires - Knowledge of schema
- Knowledge of a query language (eg XQuery)
- Knowledge of the role of the keywords
- XKeyword eliminates these requirements
4Keyword Query - Semantics
- Keywords are
- in same XML text node
- in same element
- connected through edges
lttitlegt Storage of XML databaseslt\titlegt
lttopicsgt lttopicgt XML survey lt\topicgt lttopicgt
Storage of database lt\topicgt lt\topicsgt
lttopicgt XML survey lt\topicgt
idref
lttopicgt Storage of database lt\topicgt
5Result of Keyword Query
- Result is tree T of XML nodes where
- every keyword contained in a node of T (total)
- no node of T is redundant (minimal)
- Score of result
- distance of keywords within a text node
- distance between keywords in number of edges
- weighted distance
- PageRank-like methods
6Example - Schema
TPCH-like schema
7Example - Data
8Example Keyword Query
Query John, VCR
9Example Keyword Query
Query John, VCR Result trees T1
size 6 T2 size 8
10Example Keyword Query
Target Objects
11Presentation
- Number of results explodes due to MVDs
- Example
Results R1. p1-l1-pa3-pa1 R2.
p1-l2-pa3-pa2 R3. p1-l2-pa3-pa1 R4.
p1-l1-pa3-pa2 R3, R4 are implied by R1,
R2!
12Presentation
- Create a Presentation Graph for each CN
13Demo
- Demo on DBLP dataset available at
www.db.ucsd.edu/XKeyword
14Demo
15Demo
16Demo
17Architecture
John, VCR
John person.name VCR part.name, product.descr
PersonJohn-Lineitem-ProductVCR,
PersonJohn-Lineitem-PartVCR
Person1-Lineitem1-Product1, Person1-Lineitem2-Part
1
PersonJohn,US-Lineitemquant6, Oct 14
2001 Productid2005,descrSet of VCR and
DVD,
18Architecture
19Candidate Networks Generator
- Adaptation of CN generator of DISCOVER
(Hristidis et al. VLDB 2002) to XML databases - Example
- CNs of size3 for query John,VCR
supplier
ProductVCR
PersonJohn
Lineitem
supplier
PersonJohn
PartVCR
Lineitem
supplier
subpart
Part
PersonJohn
Lineitem
PartVCR
PartVCR
PersonJohn
Order
Lineitem
ProductVCR
PersonJohn
Order
Lineitem
20Architecture
21Decomposer
- Storing Data in XKeyword
- Each target object is stored in a CLOB
- Connections between target objects in ID
Relations
Minimal ID Relations Lineitem_Part LPa
(L_id, Pa_id) Lineitem_Person_ref LPref (L_id,
P_id) Part_Part PaPa (Pa_id1,Pa_id2)
LPa L_id Pa_id 100 123 101 123
22Decomposer
PLPa P_id L_id Pa_id
- Create redundant ID Relations to improve
performance - Examples
- PLPa LPref LPa
- PaPaPa PaPa PaPa
supplier
subpart
Part
PartVCR
PersonJohn
Lineitem
Then CN is evaluated as PJohn PLPa
PaPaPa PaVCR instead of PJohn LPref
LPa PaPa PaPa PaVCR Spare 2 joins!
23Decomposer - Rules
- Create redundant ID Relations when not MVDs.
- Eg Person Order Lineitem (POL)
- Avoid MVD ID Relations
- Eg Order Person Lineitem (OPLref)
ref supplier
OPLref O_id P_id L_id 127
105 100 127 105
101 129 105 100
129 105 101
PO P_id O_id 105 127 105
129
LPref L_id P_id 100 105
101 105
- There is always decomposition with fragments
maximum size L ?M/(J1)?, s.t. any CN of size
up to M is evaluated with at most J joins
24Clustering
- ID Relations are stored clustered.
- Eg POL is clustered on P,O,L
- LOP is clustered on L,O,P
- Use ID Relations clustered as join direction
- Eg use POL and not LOP when evaluating CN
Person-Order-Lineitem-Product from left to right
25Architecture
26Execution Module
- Get top-K results using nested loops join on each
CN. - Caching intermediate results.
- Eg PartVCR ? Part ? PartTV
- Assume we have evaluated p1-p2-x
- if we reach partial result p3-p2-x, no need to
join with PartTV - Multithreaded execution One thread for each CN
subpart
subpart
27Execution Module
- Execution is guided by navigation in Presentation
Graph
XML storage
28Previous Work
- DBXplorer (Agrawal et al. ICDE 2002),
- DISCOVER (Hristidis et al. VLDB 2002)
- Work on Relational Databases
- Execute an SQL statement for each CN
- Drawbacks
- Redundancy in Presentation
- No control on Storage of data
- Keyword Search in Graph Databases (Goldman et.
al. VLDB 98) - look for hub nodes
- No schema
29Experimentation Evaluate decompositions
- MinClust Minimal, both directions of clustering
- MinNClustIndx Minimal, no clustering, indexed
- Complete All ID relations- MVD non-MVD
- XKeyword all ID relations of size up to 2 (2
edges)
30Experimentation Evaluate decompositions
- Maximum CN size 6, top-K
- 2 keywords
- DBLP dataset
31Evaluation of Optimized Execution Algorithm
32Conclusions
- XKeyword is system for plain keyword search in
XML databases - Focus on
- Storage of XML in relations
- Presentation
- Future work
- Investigate other relevance semantics.
- Eg ranking based on link-structure.
33Questions?
34Candidate Networks Generator
- A keyword may appear in multiple nodes
- candidate networks can be too big (sometimes
unbounded) - Adaptation of CN generator of DISCOVER (Hristidis
et al. VLDB 2002) to XML databases
35Candidate Network - Examples
CNs of size3 for query John,VCR
supplier
ProductVCR
PersonJohn
Lineitem
supplier
PersonJohn
PartVCR
Lineitem
supplier
subpart
Part
PersonJohn
Lineitem
PartVCR
PartVCR
PersonJohn
Order
Lineitem
ProductVCR
PersonJohn
Order
Lineitem
36Candidate Networks Generator is Complete and
Non-Redundant
- Prove that the set of Candidate Networks
generated is - Complete All solutions generated by a CN
- Non-redundant There is database instance, where
by removing a CN a solution is lost
37Experimentation Evaluate decompositions
- Maximum CN size 6, all results
- 2 keywords
- DBLP dataset