Title: CoXML: A Cooperative XML Query Answering System
1CoXML A Cooperative XML Query Answering System
- Shaorong Liu and Wesley W. Chu
- APWeb/WAIM 2007
2Motivation
- XML has become the standard format for
information representation and data exchange - XML schema is usually very complex
- E.g., the XML schema for the IEEE Computer
Society publications contains about 170 distinct
tags and more than 1000 distinct paths - It is often unrealistic for users to fully
understand a schema before asking queries - Exact query answering is inadequate and
approximate query answering is more desirable!
3Our Contribution CoXML
4Roadmap
- Introduction
- Background
- CoXML
- Related Work
- Conclusion
5XML Query Relaxation Types
- Value relaxation enlarging a value conditions
search scope - Node relabel changing the label a node to a
similar or a more general label by domain
knowledge
1 Tree Pattern Relaxation (S. Amer-Yahia, et
al., 2000)
6XML Query Relaxation Types
- Edge generalization relaxing a / edge to a
// edge - Node deletion dropping a node from a query tree
7XML Relaxation Properties
- Definition
- Relaxation operation an application of a
relaxation type to a specific query node or edge - Lemma
- Given a query tree with n applicable relaxation
operations, there are potentially up to 2n
relaxed trees - Possible combinations
8XML Query Relaxation Challenges
- Query relaxation is often user-specific
- Different users may have different approximate
matching specifications for a given query tree - How to provide user-specific approximate query
answering? - A query with n relaxation operations has
potentially up to 2n relaxed queries - How to systematically relax a query?
- Query relaxation generates a set of approximate
answers - How to effectively rank the returned approximate
answers?
9CoXML System Overview
relaxation language
ranked results
query
Relaxation Engine
Ranking Module
results
relaxed query
Relaxation Index Builder
exact answers
query
CoXML
XML Database Engine
10Roadmap
- Introduction
- Background
- CoXML
- Relaxation Language
- Relaxation Index Structure
- Ranking of Approximate Answers
- Experimental Studies
- Related Work
- Conclusion
11Relaxation Language
- A relaxation-enabled query is a tuple T, R, C,
S - T tree-pattern query
- R relaxation constructs
- E.g., delete/re-label a node, generalize an edge
- C relaxation controls
- E.g., prefer/reject certain relaxation
operations, use certain relaxation types, control
relaxation orders, etc - S stop condition
- E.g., the minimum of approximate answers to be
returned
12Relaxation Language Example
ltinex_topic topic_id"267" gt ltcastitlegt
//article//fm//atlabout(., "digital
libraries") lt/castitlegt ltdescriptiongt
Articles containing "digital libraries" in their
title. lt/descriptiongt ltnarrativegt I'm
interested in articles discussing Digital
Libraries as their main subject. Therefore I
require that the title of any relevant article
mentions "digital library" explicitly. Documents
that mention digital libraries only under the
bibliography are not relevant, as well as
documents that do not have the phrase "digital
library" in their title. lt/narrativegt lt/inex_to
picgt
13How to Relax Queries?
- Naïve approach
- Generate all possible relaxed queries
iteratively select the best relaxed query to
derive approximate answers - Exhaustive, but not scalable
- Observation
- Many queries share the same (or similar) tree
structures - Our approach relaxation index structure
- Consider the structure of a query tree T as a
template - Build indexes on the relaxed trees of T
- Use the index to guide the relaxations of any
query with the same (or similar) tree structure
as that of T
14Relaxation Index Structure - XTAH
- XTAH
- A hierarchical multi-level labeled cluster of
relaxed trees for a given query tree - Building an XTAH
- Given a query structure template T, generate all
possible relaxed trees - Each relaxed trees uses an unique set of
relaxation operations - Cluster relaxed trees into groups based on
relaxation operations and distances -- similar to
suffix-tree clustering
15XTAH Example for Template Structure T
gen(eu, v) relaxing the edge between u and
v del(u) deleting the node u
16XTAH Properties
- Each group consists of a set of relaxed trees
derived from similar relaxation operations - The relaxed trees can be located efficiently
based on the type of relaxation operation - The higher level group in the XTAH yields lesser
relaxation than the lower group - Query can be relaxed to different level of
granularities by traversing up and down the XTAH
17Ranking of XML Approximate Answers
- Content similarity cont_sim(A, Q)
- An extended vector space model 2
- Structure similarity struct_dist(A, Q)
- Use tree editing distance for measuring structure
similarity - Propose a cost model that assigns operation cost
based on relaxation semantics - Overall relevancy sim(A, Q)
- A ranking model combing both content similarity
and structure distance
? is a small constant between 0 and 1
2 Configurable Indexing and Ranking for XML
Information Retrieval (S. Liu, et al., 2004)
18Experimental Studies
- Experiment Setup
- INEX (INitiative for the Evaluation of Xml) 05
test collection - Document collection
- Query set
- Gold standard
- Evaluation Metrics
- nxCG (normalized extended cumulative gain)
- the official evaluation metric used in INEX 05
- Given a number i (i?1), nxCG_at_i, similar to
precision_at_i, measures the relative gain users
accumulated up to the rank i
19Retrieval performance improvements with semantic
cost model
- Query set all content-and-structure queries in
INEX 05
nxCG_at_10 (?, cost model)
Assigning relaxation operation with different
cost based on the similarities of the nodes being
operated improves retrieval performance! nxCG_at_25
and nxCG_at_50 yield similar results
20Evaluation of Relaxation Control
Relaxation control enables the system to provide
answers with greater relevancy!
21Related Work
- Relaxation based on schema conversions (LC01,
LMC01, LMC03) - Without structure relaxation
- Native XML relaxation
- Proposed structure relaxation types e.g., KS01,
ACS02 - Used the relaxation types ACS02 in our work
- Investigate efficient algorithms for deriving
top-K answers based on relaxation types e.g,
Sch02, ACS02, ALP04, AKM05 - Without relaxation control
22Conclusion
- Cooperative XML (CoXML) query answering
- Relaxation-enabled query language allows users to
effectively express the relaxed query conditions
as well as controlling the relaxation process - XTAH provides systematic query relaxation
guidance - Used both content and structure similarity
metrics for evaluating the relevancy of
approximate answers - Evaluation studies with the INEX test collections
validate the effectiveness of our methodology