Vague Content and Structure VCAS Retrieval for DocumentCentric XML Collections - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Vague Content and Structure VCAS Retrieval for DocumentCentric XML Collections

Description:

UCLA Computer Science Department {sliu, wwc, ruzan}_at_cs.ucla.edu. June 17, 2005. 2. 2 ... Each article consists of 1500 XML nodes on average ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: San7272
Category:

less

Transcript and Presenter's Notes

Title: Vague Content and Structure VCAS Retrieval for DocumentCentric XML Collections


1
Vague Content and Structure (VCAS) Retrieval for
Document-Centric XML Collections
  • Shaorong Liu, Wesley W. Chu, and Ruzan Shahinian
  • UCLA Computer Science Department
  • sliu, wwc, ruzan_at_cs.ucla.edu
  • June 17, 2005

2
Motivation
  • XML standard for representation and exchange
  • Increasing amount of XML collections available
  • Flexible and effective XML retrieval is desired

XML Collections
3
Content Only (CO) Retrieval
  • Query
  • Content condition what a result should be about
  • Example sensor network

4
Strict Content and Structure (SCAS) Retrieval
  • Query
  • Content condition what a result should be about
  • Structure condition what a result is
  • Example
  • //article/body/sec/paragraphabout(., sensor
    network)

5
Problem
  • XML structure is usually very complex
  • An XML collection may contain hundreds of tags
  • Vague content and structure (VCAS) retrieval is
    more desired
  • Fuzzy matching of a querys structure conditions
  • Example //article//paragraphabout(., reliable
    multicast)

article
body
sec
sec

paragraph
A comparison of reliable multicast schemes
Uncertainty in sensor network
6
Challenges
  • Many existing XML IR engines support
  • Content only retrieval and/or
  • Strict content and structure retrieval
  • How to approximately process structure
    conditions?
  • How to measure such fuzzy matching?

7
Related Work
  • XML IR engines
  • HyREX Fuhr et al., 01
  • XXL Theobald et al., 02
  • JuruXML Mass et al., 03
  • XSEarch Cohen et al., 03
  • XRank Guo et al., 03
  • Existing approaches to XML VCAS retrieval
  • Content-oriented approach e.g., Sigurbjornsson
    et al., 04
  • Simple, but lose precision benefits from XML
    structure
  • Query-relaxation-based approach Amer-Yahia et
    al., 02 04
  • Systematic and efficient, but may miss relevant
    answers

8
Our Approach
VCAS results
VCAS query
SCAS sub-query
SCAS results
Retrieval
Decomposition
Combination
CO sub-query
Retrieval
CO results
9
Roadmap
  • Introduction
  • Our Approach
  • Decomposition
  • Retrieval
  • Combination
  • Experimental Study
  • Conclusion

10
Query Language Narrowed Extended XPath I
  • Content-oriented XPath-like language Trotman et
    al., 04
  • Syntax path1abouts1////pathnaboutsn
  • abouts a Boolean combination of about functions
  • about function about(path, string)
  • Example
  • Q1 //articleabout(., sensor network)//paragraph
    about(., reliable multicast)
  • Structural hints to a search engine
  • Support where to search, e.g., article
  • Target what to return, e.g., paragraph

11
Decomposition
  • Method
  • Example
  • Q1 //articleabout(., sensor network)//paragraph
    about(., reliable multicast)
  • Q1SCAS //articleabout(., sensor network)
  • Q1CO reliable multicast

Q path1abouts1////pathnaboutsn
12
Retrieval
  • Use existing XML IR engines to process both
    sub-queries
  • Two result lists
  • Rscas results for the SCAS sub-query
  • Rco results for the CO sub-query

13
Combination
  • Method
  • Example
  • Q1 //articleabout(., sensor network)
    //paragraphabout(., reliable multicast)

v an XML node that is an approximate answer vt
an XML node that matches Qs structure condition
exactly
14
Similarity Measure
  • Measure I path-oriented
  • Similarity between two nodes is the similarity
    between their corresponding paths from the root
    node
  • The greater of common prefixing nodes two paths
    share, the more similar they are

15
Similarity Measure
  • Measure II content-oriented
  • Similarity between two nodes is the similarity
    between their content
  • Use vector space model for content similarity

16
Roadmap
  • Introduction
  • Our Approach
  • Decomposition
  • Retrieval
  • Combination
  • Experimental Study
  • Conclusion

17
Experiment Setup
  • INEX
  • Initiative for the Evaluation of XML retrieval
  • Similar to TREC for text retrieval
  • Document collection
  • Scientific articles from IEEE Computer Society 95
    - 02
  • Each article consists of 1500 XML nodes on
    average
  • About 500M in size and 170 different XML tags
  • Query set all the 33 queries in INEX 04s VCAS
    task
  • Gold standard INEX 04 VCAS relevance assessments

18
Experiment Results
  • Experiment
  • CO (Baseline) ignoring structure conditions in a
    query
  • VCAS-1 w/o using similarity measure, i.e., ?vi,
    vj, sim(vi, vj) 1
  • VCAS-path using path-oriented similarity measure
  • VCAS-cont using content-oriented similarity
    measure
  • Results

All the experiments were conducted using our XML
IR engine (Liu et al., 04)
19
Roadmap
  • Introduction
  • Our Approach
  • Decomposition
  • Retrieval
  • Combination
  • Experimental Study
  • Conclusion

20
Conclusion
  • An approach to XML VCAS retrieval using existing
    engines
  • Decomposition
  • Retrieval
  • Combination
  • Experimental results demonstrate the
    effectiveness of the approach
  • Future work
  • Investigate more general fuzzy matching of a
    querys structure conditions
  • Incorporate the efficiency aspect

21
Thank You!
Questions?
Write a Comment
User Comments (0)
About PowerShow.com