XCluster Synopses for Structured XML Content - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

XCluster Synopses for Structured XML Content

Description:

Active research topic in the field of XML databases ... IMDB. 236822. 7. 462. Accuracy of XClusters. IMDB. XCluster vs. TreeSketch. XMark. Conclusions ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 29
Provided by: usersS
Category:

less

Transcript and Presenter's Notes

Title: XCluster Synopses for Structured XML Content


1
XCluster Synopses for Structured XML Content
  • Alkis Polyzotis (UC Santa Cruz)
  • Minos Garofalakis (Intel Research, Berkeley)

2
XML Summarization
XML Data
  • Synopses are essential for XML data management
  • Statistics for XML query optimization
  • Approximate query answering
  • Active research topic in the field of XML
    databases
  • Markov Tables, XSketch, XPathLearner, CSTs,
    TreeSketch,...

3
Content Heterogeneity
  • Data
  • Queries

Numerical
2003 The history of
histograms (abridged) Yannis
Ioannidis The history of
histograms is long and rich, full of detailed
information in every step. It... er
String
Text
Range
Substring
//paperyear2000author contains
Ioannidis// abstractftcontains
histograms,history
Term Containment
4
Synopses and Heterogeneity
//paperyear2000author contains
Ioannidis// abstractftcontains
histograms,history
  • Mixed predicates Unified summarization model
  • Path structure
  • Values of different types
  • Correlations between and across
  • Summarization for textual values

5
XCluster Synopses
  • Data synopses for heterogeneous XML content
  • Unified summarization for path structure and
    numerical, string, and textual content
  • Support for twig queries with mixed predicates
  • XCluster model Element clustering
  • Tight cluster Similar structure and values
  • Extensibility to other value types
  • Principled compression framework
  • Experimental results high accuracy with low
    storage requirements

6
Outline
  • Preliminaries
  • XCluster Model
  • XCluster Compression
  • Construction Algorithm
  • Experimental Study

7
Data and Query Model
Data
Query
q0
Range
Substring
q1
q3
q2
Numerical
Text
Term Containment
String
for q0 in /, q1 in q0/py1999, q2 in
q1/tcontains(XML), q3 in q1/ab
ftcontains(synopsis,data)
Text
  • Tree data with heterogeneous value content
  • Tree-pattern queries with XPath expressions
  • Result set of binding tuples

8
Problem Definition
Synopsis
  • Problem build a data synopsis that can estimate
    the selectivity of any query
  • Challenges
  • Heterogeneity of content
  • Data correlations

9
XCluster Model
10
Structural Summarization
XCluster
Data
  • Node Elements of same tag
  • Statistical information node- and edge-counts
  • Node-count number of elements in cluster
  • Edge-count average number of children

11
Value Summarization
XCluster
Data
  • Value summary Fractional value distribution
  • Single-dimensional
  • Approximation method depends on value type

12
Types of Value Summaries
  • Numerical Content Histograms
  • String Content Pruned Suffix Tries
  • Text Content End-biased Term Histograms

The history of histograms is long and rich, full
of detailed information in every step. It...
Text
Term Matrix
Term Histogram
13
XCluster Model
  • A node aggregates information about its elements
  • Correspondence to clustering node cluster
    centroid element
  • Basic assumptions independence and uniformity
  • Tight clusters Valid assumptions

14
Estimation Example
Query
1 element
2 children
1st children
1/2sk children
sel(Q)(1)(2)(1st)(1/2sk)
XCluster
  • Two-step estimation algorithm
  • Identify embeddings
  • Estimate selectivity of each embedding
  • Accuracy depends on tightness of centroids

15
XCluster Compression
16
Structural Compression
  • Merge two nodes of same tag
  • New node acquires aggregate characteristics
  • Node- and edge-counts are aggregated
  • Value summaries are fused
  • Conceptually equivalent to cluster merging

17
Value-Based Compression
  • Reduce the storage of a single value summary
  • Specifics depend on type of summary
  • Histogram merge k buckets
  • Pruned Suffix Trie prune k nodes
  • Remove leaf nodes based on statistical
    independence
  • Term Histogram move k terms to the uniform bucket

18
Compression vs. Accuracy
Original XCluster S
Compressed XCluster S
  • ?(S,S) difference in accuracy between S and S
  • Key idea apply operations with low ?(S,S)
  • Absolute vs. Relative metric

S
S
R
Absolute
Relative
S
S
19
Distance Metric ?(S,S)
S
S
  • µ-query basic query involving structurevalues
  • us/c the number of children in c per element
    in u that satisfies value predicate s
  • Intuition capture centroid information
    pertaining to c and s
  • ?(S,S) difference of estimates for µ-queries

20
XCluster Construction
21
XCluster Construction
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
Step 1
Step 2
Step 3
  • Step 1 Build reference synopsis
  • Count stability Detailed value summaries
  • Step 2 Compress structural information
  • Step 3 Compress value-based information

22
Structural Compression
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
Step 1
Step 2
Step 3
  • Algorithm sketch
  • Generate pool of candidate merge operations
  • Apply operations in increasing order of ?(S,S)
  • Repeat until size
  • A-priori generation of candidates
  • Merges at level l trigger merges at level l-1
  • Adaptive, leaf-to-root merging of nodes

23
Value-Based Compression
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
  • Algorithm sketch
  • Generate one operation for each value summary
  • Apply value compression with least ?(S,S)
  • Repeat until size
  • Generate operations of least effect
  • Histograms merge buckets with least difference
  • PSTs prune leaves with max independence
  • Term Histograms remove singletons of least freq.

Step 1
Step 2
Step 3
24
Experimental Study
25
Methodology
  • Data sets
  • Workloads random twig queries
  • Structure only and with predicates
  • Biased toward high selectivities
  • Metrics
  • Absolute relative error true-estim/max(true,s)
  • Absolute error true-estim

26
Accuracy of XClusters
IMDB
27
XCluster vs. TreeSketch
XMark
28
Conclusions
  • XML synopses are essential for XML query
    optimization
  • Our contribution XCluster Synopses
  • XML summaries for heterogeneous content
  • Support for twig queries with numerical, string,
    and textual predicates
  • XCluster model generalized element clustering
  • Principled construction algorithm
  • Experimental results high accuracy with low
    storage requirements
Write a Comment
User Comments (0)
About PowerShow.com