Title: Flexible Querying of XML Documents
1Flexible Querying of XML Documents
- Krishnaprasad Thirunarayan and Trivikram Immaneni
- Department of Computer Science and Engineering
- Wright State University
- Dayton, OH-45435, USA
-
2Talk Outline
- Goal (What?)
- Background and Motivation (Why?)
- Query Language and Examples (What?)
- Implementation Details (How?)
- Evaluation and Applications (Why?)
- Conclusions
3Goal
4- Develop a keyword-based XML Query Language and
its Semantics that is - flexible and sufficiently expressive
- easy to use (for query formulation)
- Implement, reusing mature software components,
for efficient indexing and search
5Background and Motivation
6XML vs Text Documents
- DATA Exploit metadata/markup and aggregation
structure implicit in XML documents - For expressiveness and precision
- QUERY Obtain progressively improved extractions
using convenient keyword-based queries in
contrast with accurate extractions using complex
XML-based queries
7Relationship to Other Work
- Extends XSEarch (Cohen et al)
- Expressive power Incorporates attributes and
their values - Equivalence (E.g., RDF)
- ltT A"s"/gt
- vs
- ltTgt ltAgt s lt/Agt lt/Tgt
8- Invariance under Refinement
- ltTgt ltAgt word_1 and word_2 lt/Agt lt/Tgt
- vs
- ltTgt ltAgt
- ltBgt word_1 lt/Bgt
- and
- ltCgt word_2 lt/Cgt
- lt/Agt lt/Tgt
9Coherence
- Interconnectedness Infer related pieces of
information using aggregation implicit in XML - Cohen et al Name equivalence
- XSEarch
- Li et al Structural equivalence
- Scheme-free XML
- Guo et al Completeness
- XRANK
10- Information Retrieval
- Explore robust relevance ranking strategy to deal
with high recall - Variation on TFIDF
- Naïve implementation computationally prohibitive
- Extension beyond type-delimited-document
unclear
11Query Language and Examples (What?)
12Query Syntax
- Entity-Attribute-Keyword
- Search Terms
- eak
- ea, ak, ek
- e, a, k
- Signed/optional Search Terms
- eak vs eak
13Single Search Term Satisfaction
- eak
- The search term eak is satisfied by a tree
containing a subtree with the top element e that
is associated with the attribute a with value
containing k, or a subelement a with descendant
text node containing k.
14Example (Mondial)
- ltcountry id"f0_149" name"Austria"
capital"f0_1467" population"8023244" - datacode"AU" total_area"83850"
population_growth"0.41" - infant_mortality"6.2" ...
government"federal republic" ...gt ... - lt/countrygt
- nameVienna is satisfied by
- ltprovince id"f0_17447" name"Vienna" ...gt
- ltcity id"f0_1467" country"f0_149"
province"f0_17447" ...gt - ltnamegtViennalt/namegt ltpopulation
year"94"gt1583000lt/populationgt lt/citygt - lt/provincegt ...
- nameVienna is satisfied by a part of it
- ltnamegtViennalt/namegt.
15Example (Heterogeneity)
- ltauthorgtltnamegtAdam Dinglelt/namegtlt/authorgt
- ltauthor name"A. Dingle" gtlt/authorgt
- ltarticle id"3"gt
- _at_inproceedingsIMN97,
- author"Adam Dingle and Ed MacNair and
Thao Nguyen", -
- lt/articlegt
- authornameDingle misses the last one.
16Query Answer Candidate
- Query Answer Candidate for the query
Q(t_1,t_2,...,t_m), is a - Most preferred satisfying collection of trees
(P_1,P_2,...,P_m) - Precise smallest enclosing
- Adequate optional search terms satisfied as
much as possible
17Query Answer
- Query Answer for the query Q(t_1,t_2,...,t_m), is
a - Query Answer Candidate (P_1,P_2,...,P_m) in
which - Trees P_is are Interconnected
- Specifies trees related to the same real-world
entity
18Interconnectedness (Cohen et al)
- Two subtrees T_a and T_b are said to be
interconnected if the path from their roots to
the lowest common ancestor does not contain two
distinct nodes with the same element, or the only
distinct nodes with the same element are these
roots.
19Interconnectedness (Li et al)
- Two subtrees T_a and T_b are said to be
interconnected, if the path from T_a's root to
their lowest common ancestor in the tree does not
contain another node that is the lowest common
ancestor of T_a and a distinct subtree T_b ,
where T_b' has the same root element label as T_b.
20Interconnectedness (Two Approaches)
21Implementation Details (How?)
22Tools Used
- Apache Lucene 2.0 APIs in Java
- A high-performance, text search engine library
with smart indexing strategies. - Further tuned for memory-centric operation in
contrast with disk-centric defaults - SAXParser APIs
23Mapping to Lucene
- XML documents to Lucene documents for indexing
- XML keyword-based queries to Lucene queries for
searching - ENCODING
- XML fragment of an XML document is referred to
internally using the filename and the XPath (of
the XML fragment's root from the XML document
root),
24Evaluation and Application (Why?)
25Experiments
- DATASETs Sigmod, Mondial, and DBLP.
- PLATFORMS
- For Sigmod and Mondial datasets HP xw9300
Workstation with 2 GHz AMD Opteron dual-core
processor (270), 4 GB of main memory, and 250 GB
7200 rpm hard drive, running 32-bit Windows XP.
- (java -Xms750M -Xmx1500M).
- For DBLP dataset SUN Ultra-40 Workstation with
2.4 GHz dual AMD Opteron dual-core processor
(280), 8GB of main memory, and 250GB 7500 rpm
hard drive, running 64-bit Solaris 10. - (java -Xms1000M -Xmx3600M).
26Dataset Sizes
DATASET SIZE
Sigmod 468 KB
Mondial 1743 KB
DBLP 337 MB
27Dataset Indexing via Lucene
DATASET INDEXING TIME INDEX SIZE
Sigmod 32 sec 6 MB
Mondial 180 sec 16 MB
DBLP 36 hrs 4 GB
28Query Answer Computation Time vs Display Time
DATASET SIMPLE QUERY COMPLEX QUERY
Sigmod 35 ms / 1 sec 400 ms / 3 min
Mondial 25 ms / 350 ms 1 sec / 2 min
DBLP 335 ms / 1 sec ---
29More Subtle Example
- In Extended Paper
- A Coherent Keyword-Based XML Query Language
30Pubs.xml
- ltpublicationsgt
- - ltbookgt
- Â lttitlegtModern Information Retrievallt/titlegt
- Â ltauthorgtRicardo Baeza-Yateslt/authorgt Â
ltauthorgtBerthier Ribeiro-Netolt/authorgt - - ltchaptergt
- Â lttitlegtDigital Librarieslt/titlegt
- Â ltauthorgtEdward A. Foxlt/authorgt Â
ltauthorgtOhm Sornillt/authorgt - Â lt/chaptergt
- Â lt/bookgt
- - ltarticlegt
- Â lttitlegtThe Anatomy of a Large-Scale
Hypertextual Web Search Enginelt/titlegt - Â ltauthorgtSergey Brinlt/authorgt Â
ltauthorgtLawrence Pagelt/authorgt - Â lt/articlegt
- - ltarticlegt
- Â lttitlegtAn Algorithm for Suffix
Strippinglt/titlegt - Â ltauthorgtM.F.Porterlt/authorgt
- Â lt/articlegt
- - ltarticlegt
- lttitlegtIndexing by Latent Semantic
Analysislt/titlegt Â
31Characteristics of Pubs.xml
- Total number of authors 7
- Total number of titles 5
- Title Distribution
- 1 book (with 1 chapter) 3 articles
- Author Distribution
- 2 ( 2 ) 2 1 0
32Queries to Pubs.xml (Answer counts)
- Arbitrary mix and match of authors and titles 7
5 35 - author, title (8 hits)
- author, title (7 hits)
- author, title, author (4 hits)
33Completeness ( pWord, qWord) (Guo et al)
34Conclusions
35- Developed declarative semantics for keyword-based
XML Query language with an effective query
answering algorithm - Developed a notion of interconnectedness that
provides coherent answers - Implemented using Lucene 2.0 APIs
- Indexing Time and Space Intensive
- But
- Query Answering Quick
36