Keyword Proximity Search on XML Graphs - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Keyword Proximity Search on XML Graphs

Description:

PaPaPa = PaPa PaPa. Then CN. is evaluated as PJohn PLPa PaPaPa PaVCR. instead of. PJohn LPref LPa PaPa PaPa PaVCR. Spare 2 joins! PersonJohn. Lineitem. Part ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 38
Provided by: users
Category:

less

Transcript and Presenter's Notes

Title: Keyword Proximity Search on XML Graphs


1
Keyword Proximity Search on XML Graphs
  • Vagelis Hristidis
  • Yannis Papakonstantinou
  • Andrey Balmin
  • University of California, San Diego

2
Motivation
  • Keyword Search is the dominant information
    discovery method in plain text documents
  • Increasing amount of data stored in XML databases

3
Motivation
  • Currently, information discovery in XML databases
    requires
  • Knowledge of schema
  • Knowledge of a query language (eg XQuery)
  • Knowledge of the role of the keywords
  • XKeyword eliminates these requirements

4
Keyword Query - Semantics
  • Keywords are
  • in same XML text node
  • in same element
  • connected through edges

lttitlegt Storage of XML databaseslt\titlegt
lttopicsgt lttopicgt XML survey lt\topicgt lttopicgt
Storage of database lt\topicgt lt\topicsgt
lttopicgt XML survey lt\topicgt
idref
lttopicgt Storage of database lt\topicgt
5
Result of Keyword Query
  • Result is tree T of XML nodes where
  • every keyword contained in a node of T (total)
  • no node of T is redundant (minimal)
  • Score of result
  • distance of keywords within a text node
  • distance between keywords in number of edges
  • weighted distance
  • PageRank-like methods

6
Example - Schema
TPCH-like schema
7
Example - Data
8
Example Keyword Query
Query John, VCR
9
Example Keyword Query
Query John, VCR Result trees T1
size 6 T2 size 8
10
Example Keyword Query
Target Objects
11
Presentation
  • Number of results explodes due to MVDs
  • Example

Results R1. p1-l1-pa3-pa1 R2.
p1-l2-pa3-pa2 R3. p1-l2-pa3-pa1 R4.
p1-l1-pa3-pa2 R3, R4 are implied by R1,
R2!
12
Presentation
  • Create a Presentation Graph for each CN

13
Demo
  • Demo on DBLP dataset available at
    www.db.ucsd.edu/XKeyword

14
Demo
15
Demo
16
Demo
17
Architecture
John, VCR
John person.name VCR part.name, product.descr
PersonJohn-Lineitem-ProductVCR,
PersonJohn-Lineitem-PartVCR
Person1-Lineitem1-Product1, Person1-Lineitem2-Part
1
PersonJohn,US-Lineitemquant6, Oct 14
2001 Productid2005,descrSet of VCR and
DVD,
18
Architecture
19
Candidate Networks Generator
  • Adaptation of CN generator of DISCOVER
    (Hristidis et al. VLDB 2002) to XML databases
  • Example
  • CNs of size3 for query John,VCR

supplier
ProductVCR
PersonJohn
Lineitem
supplier
PersonJohn
PartVCR
Lineitem
supplier
subpart
Part
PersonJohn
Lineitem
PartVCR
PartVCR
PersonJohn
Order
Lineitem
ProductVCR
PersonJohn
Order
Lineitem
20
Architecture
21
Decomposer
  • Storing Data in XKeyword
  • Each target object is stored in a CLOB
  • Connections between target objects in ID
    Relations

Minimal ID Relations Lineitem_Part LPa
(L_id, Pa_id) Lineitem_Person_ref LPref (L_id,
P_id) Part_Part PaPa (Pa_id1,Pa_id2)
LPa L_id Pa_id 100 123 101 123
22
Decomposer
PLPa P_id L_id Pa_id
  • Create redundant ID Relations to improve
    performance
  • Examples
  • PLPa LPref LPa
  • PaPaPa PaPa PaPa

supplier
subpart
Part
PartVCR
PersonJohn
Lineitem
Then CN is evaluated as PJohn PLPa
PaPaPa PaVCR instead of PJohn LPref
LPa PaPa PaPa PaVCR Spare 2 joins!
23
Decomposer - Rules
  • Create redundant ID Relations when not MVDs.
  • Eg Person Order Lineitem (POL)
  • Avoid MVD ID Relations
  • Eg Order Person Lineitem (OPLref)



ref supplier

OPLref O_id P_id L_id 127
105 100 127 105
101 129 105 100
129 105 101
PO P_id O_id 105 127 105
129
LPref L_id P_id 100 105
101 105
  • There is always decomposition with fragments
    maximum size L ?M/(J1)?, s.t. any CN of size
    up to M is evaluated with at most J joins

24
Clustering
  • ID Relations are stored clustered.
  • Eg POL is clustered on P,O,L
  • LOP is clustered on L,O,P
  • Use ID Relations clustered as join direction
  • Eg use POL and not LOP when evaluating CN
    Person-Order-Lineitem-Product from left to right

25
Architecture
26
Execution Module
  • Get top-K results using nested loops join on each
    CN.
  • Caching intermediate results.
  • Eg PartVCR ? Part ? PartTV
  • Assume we have evaluated p1-p2-x
  • if we reach partial result p3-p2-x, no need to
    join with PartTV
  • Multithreaded execution One thread for each CN

subpart
subpart
27
Execution Module
  • Execution is guided by navigation in Presentation
    Graph

XML storage
28
Previous Work
  • DBXplorer (Agrawal et al. ICDE 2002),
  • DISCOVER (Hristidis et al. VLDB 2002)
  • Work on Relational Databases
  • Execute an SQL statement for each CN
  • Drawbacks
  • Redundancy in Presentation
  • No control on Storage of data
  • Keyword Search in Graph Databases (Goldman et.
    al. VLDB 98)
  • look for hub nodes
  • No schema

29
Experimentation Evaluate decompositions
  • MinClust Minimal, both directions of clustering
  • MinNClustIndx Minimal, no clustering, indexed
  • Complete All ID relations- MVD non-MVD
  • XKeyword all ID relations of size up to 2 (2
    edges)

30
Experimentation Evaluate decompositions
  • Maximum CN size 6, top-K
  • 2 keywords
  • DBLP dataset

31
Evaluation of Optimized Execution Algorithm
  • 2 keywords
  • DBLP dataset

32
Conclusions
  • XKeyword is system for plain keyword search in
    XML databases
  • Focus on
  • Storage of XML in relations
  • Presentation
  • Future work
  • Investigate other relevance semantics.
  • Eg ranking based on link-structure.

33
Questions?
34
Candidate Networks Generator
  • A keyword may appear in multiple nodes
  • candidate networks can be too big (sometimes
    unbounded)
  • Adaptation of CN generator of DISCOVER (Hristidis
    et al. VLDB 2002) to XML databases

35
Candidate Network - Examples
CNs of size3 for query John,VCR
supplier
ProductVCR
PersonJohn
Lineitem
supplier
PersonJohn
PartVCR
Lineitem
supplier
subpart
Part
PersonJohn
Lineitem
PartVCR
PartVCR
PersonJohn
Order
Lineitem
ProductVCR
PersonJohn
Order
Lineitem
36
Candidate Networks Generator is Complete and
Non-Redundant
  • Prove that the set of Candidate Networks
    generated is
  • Complete All solutions generated by a CN
  • Non-redundant There is database instance, where
    by removing a CN a solution is lost

37
Experimentation Evaluate decompositions
  • Maximum CN size 6, all results
  • 2 keywords
  • DBLP dataset
Write a Comment
User Comments (0)
About PowerShow.com