Title: Vague Content and Structure VCAS Retrieval for DocumentCentric XML Collections
1Vague Content and Structure (VCAS) Retrieval for
Document-Centric XML Collections
-
- Shaorong Liu, Wesley W. Chu, and Ruzan Shahinian
- UCLA Computer Science Department
- sliu, wwc, ruzan_at_cs.ucla.edu
- June 17, 2005
2Motivation
- XML standard for representation and exchange
- Increasing amount of XML collections available
- Flexible and effective XML retrieval is desired
XML Collections
3Content Only (CO) Retrieval
- Query
- Content condition what a result should be about
- Example sensor network
4Strict Content and Structure (SCAS) Retrieval
- Query
- Content condition what a result should be about
- Structure condition what a result is
- Example
- //article/body/sec/paragraphabout(., sensor
network)
5Problem
- XML structure is usually very complex
- An XML collection may contain hundreds of tags
- Vague content and structure (VCAS) retrieval is
more desired - Fuzzy matching of a querys structure conditions
- Example //article//paragraphabout(., reliable
multicast)
article
body
sec
sec
paragraph
A comparison of reliable multicast schemes
Uncertainty in sensor network
6Challenges
- Many existing XML IR engines support
- Content only retrieval and/or
- Strict content and structure retrieval
- How to approximately process structure
conditions? - How to measure such fuzzy matching?
7Related Work
- XML IR engines
- HyREX Fuhr et al., 01
- XXL Theobald et al., 02
- JuruXML Mass et al., 03
- XSEarch Cohen et al., 03
- XRank Guo et al., 03
- Existing approaches to XML VCAS retrieval
- Content-oriented approach e.g., Sigurbjornsson
et al., 04 - Simple, but lose precision benefits from XML
structure - Query-relaxation-based approach Amer-Yahia et
al., 02 04 - Systematic and efficient, but may miss relevant
answers
8Our Approach
VCAS results
VCAS query
SCAS sub-query
SCAS results
Retrieval
Decomposition
Combination
CO sub-query
Retrieval
CO results
9Roadmap
- Introduction
- Our Approach
- Decomposition
- Retrieval
- Combination
- Experimental Study
- Conclusion
10Query Language Narrowed Extended XPath I
- Content-oriented XPath-like language Trotman et
al., 04 - Syntax path1abouts1////pathnaboutsn
- abouts a Boolean combination of about functions
- about function about(path, string)
- Example
- Q1 //articleabout(., sensor network)//paragraph
about(., reliable multicast) - Structural hints to a search engine
- Support where to search, e.g., article
- Target what to return, e.g., paragraph
11Decomposition
- Method
- Example
- Q1 //articleabout(., sensor network)//paragraph
about(., reliable multicast) - Q1SCAS //articleabout(., sensor network)
- Q1CO reliable multicast
Q path1abouts1////pathnaboutsn
12Retrieval
- Use existing XML IR engines to process both
sub-queries - Two result lists
- Rscas results for the SCAS sub-query
- Rco results for the CO sub-query
13Combination
- Method
- Example
- Q1 //articleabout(., sensor network)
//paragraphabout(., reliable multicast)
v an XML node that is an approximate answer vt
an XML node that matches Qs structure condition
exactly
14Similarity Measure
- Measure I path-oriented
- Similarity between two nodes is the similarity
between their corresponding paths from the root
node - The greater of common prefixing nodes two paths
share, the more similar they are
15Similarity Measure
- Measure II content-oriented
- Similarity between two nodes is the similarity
between their content - Use vector space model for content similarity
16Roadmap
- Introduction
- Our Approach
- Decomposition
- Retrieval
- Combination
- Experimental Study
- Conclusion
17Experiment Setup
- INEX
- Initiative for the Evaluation of XML retrieval
- Similar to TREC for text retrieval
- Document collection
- Scientific articles from IEEE Computer Society 95
- 02 - Each article consists of 1500 XML nodes on
average - About 500M in size and 170 different XML tags
- Query set all the 33 queries in INEX 04s VCAS
task - Gold standard INEX 04 VCAS relevance assessments
18Experiment Results
- Experiment
- CO (Baseline) ignoring structure conditions in a
query - VCAS-1 w/o using similarity measure, i.e., ?vi,
vj, sim(vi, vj) 1 - VCAS-path using path-oriented similarity measure
- VCAS-cont using content-oriented similarity
measure - Results
All the experiments were conducted using our XML
IR engine (Liu et al., 04)
19Roadmap
- Introduction
- Our Approach
- Decomposition
- Retrieval
- Combination
- Experimental Study
- Conclusion
20Conclusion
- An approach to XML VCAS retrieval using existing
engines - Decomposition
- Retrieval
- Combination
- Experimental results demonstrate the
effectiveness of the approach - Future work
- Investigate more general fuzzy matching of a
querys structure conditions - Incorporate the efficiency aspect
21Thank You!
Questions?