Title: XCluster Synopses for Structured XML Content
1XCluster Synopses for Structured XML Content
- Alkis Polyzotis (UC Santa Cruz)
- Minos Garofalakis (Intel Research, Berkeley)
2XML Summarization
XML Data
- Synopses are essential for XML data management
- Statistics for XML query optimization
- Approximate query answering
- Active research topic in the field of XML
databases - Markov Tables, XSketch, XPathLearner, CSTs,
TreeSketch,...
3Content Heterogeneity
Numerical
2003 The history of
histograms (abridged) Yannis
Ioannidis The history of
histograms is long and rich, full of detailed
information in every step. It... er
String
Text
Range
Substring
//paperyear2000author contains
Ioannidis// abstractftcontains
histograms,history
Term Containment
4Synopses and Heterogeneity
//paperyear2000author contains
Ioannidis// abstractftcontains
histograms,history
- Mixed predicates Unified summarization model
- Path structure
- Values of different types
- Correlations between and across
- Summarization for textual values
5XCluster Synopses
- Data synopses for heterogeneous XML content
- Unified summarization for path structure and
numerical, string, and textual content - Support for twig queries with mixed predicates
- XCluster model Element clustering
- Tight cluster Similar structure and values
- Extensibility to other value types
- Principled compression framework
- Experimental results high accuracy with low
storage requirements
6Outline
- Preliminaries
- XCluster Model
- XCluster Compression
- Construction Algorithm
- Experimental Study
7Data and Query Model
Data
Query
q0
Range
Substring
q1
q3
q2
Numerical
Text
Term Containment
String
for q0 in /, q1 in q0/py1999, q2 in
q1/tcontains(XML), q3 in q1/ab
ftcontains(synopsis,data)
Text
- Tree data with heterogeneous value content
- Tree-pattern queries with XPath expressions
- Result set of binding tuples
8Problem Definition
Synopsis
- Problem build a data synopsis that can estimate
the selectivity of any query - Challenges
- Heterogeneity of content
- Data correlations
9XCluster Model
10Structural Summarization
XCluster
Data
- Node Elements of same tag
- Statistical information node- and edge-counts
- Node-count number of elements in cluster
- Edge-count average number of children
11Value Summarization
XCluster
Data
- Value summary Fractional value distribution
- Single-dimensional
- Approximation method depends on value type
12Types of Value Summaries
- Numerical Content Histograms
- String Content Pruned Suffix Tries
- Text Content End-biased Term Histograms
The history of histograms is long and rich, full
of detailed information in every step. It...
Text
Term Matrix
Term Histogram
13XCluster Model
- A node aggregates information about its elements
- Correspondence to clustering node cluster
centroid element - Basic assumptions independence and uniformity
- Tight clusters Valid assumptions
14Estimation Example
Query
1 element
2 children
1st children
1/2sk children
sel(Q)(1)(2)(1st)(1/2sk)
XCluster
- Two-step estimation algorithm
- Identify embeddings
- Estimate selectivity of each embedding
- Accuracy depends on tightness of centroids
15XCluster Compression
16Structural Compression
- Merge two nodes of same tag
- New node acquires aggregate characteristics
- Node- and edge-counts are aggregated
- Value summaries are fused
- Conceptually equivalent to cluster merging
17Value-Based Compression
- Reduce the storage of a single value summary
- Specifics depend on type of summary
- Histogram merge k buckets
- Pruned Suffix Trie prune k nodes
- Remove leaf nodes based on statistical
independence - Term Histogram move k terms to the uniform bucket
18Compression vs. Accuracy
Original XCluster S
Compressed XCluster S
- ?(S,S) difference in accuracy between S and S
- Key idea apply operations with low ?(S,S)
- Absolute vs. Relative metric
S
S
R
Absolute
Relative
S
S
19Distance Metric ?(S,S)
S
S
- µ-query basic query involving structurevalues
- us/c the number of children in c per element
in u that satisfies value predicate s - Intuition capture centroid information
pertaining to c and s - ?(S,S) difference of estimates for µ-queries
20XCluster Construction
21XCluster Construction
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
Step 1
Step 2
Step 3
- Step 1 Build reference synopsis
- Count stability Detailed value summaries
- Step 2 Compress structural information
- Step 3 Compress value-based information
22Structural Compression
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
Step 1
Step 2
Step 3
- Algorithm sketch
- Generate pool of candidate merge operations
- Apply operations in increasing order of ?(S,S)
- Repeat until size
- A-priori generation of candidates
- Merges at level l trigger merges at level l-1
- Adaptive, leaf-to-root merging of nodes
23Value-Based Compression
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
- Algorithm sketch
- Generate one operation for each value summary
- Apply value compression with least ?(S,S)
- Repeat until size
- Generate operations of least effect
- Histograms merge buckets with least difference
- PSTs prune leaves with max independence
- Term Histograms remove singletons of least freq.
Step 1
Step 2
Step 3
24Experimental Study
25Methodology
- Data sets
- Workloads random twig queries
- Structure only and with predicates
- Biased toward high selectivities
- Metrics
- Absolute relative error true-estim/max(true,s)
- Absolute error true-estim
26Accuracy of XClusters
IMDB
27XCluster vs. TreeSketch
XMark
28Conclusions
- XML synopses are essential for XML query
optimization - Our contribution XCluster Synopses
- XML summaries for heterogeneous content
- Support for twig queries with numerical, string,
and textual predicates - XCluster model generalized element clustering
- Principled construction algorithm
- Experimental results high accuracy with low
storage requirements