XCluster Synopses for Structured XML Content

About This Presentation

Title:

XCluster Synopses for Structured XML Content

Description:

Active research topic in the field of XML databases ... IMDB. 236822. 7. 462. Accuracy of XClusters. IMDB. XCluster vs. TreeSketch. XMark. Conclusions ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 29

Provided by: usersS

Category:

more less

Transcript and Presenter's Notes

Title: XCluster Synopses for Structured XML Content

1
XCluster Synopses for Structured XML Content

Alkis Polyzotis (UC Santa Cruz)
Minos Garofalakis (Intel Research, Berkeley)

2
XML Summarization
XML Data

Synopses are essential for XML data management
Statistics for XML query optimization
Approximate query answering
Active research topic in the field of XML
databases
Markov Tables, XSketch, XPathLearner, CSTs,
TreeSketch,...

3
Content Heterogeneity

Data
Queries

Numerical
2003 The history of
histograms (abridged) Yannis
Ioannidis The history of
histograms is long and rich, full of detailed
information in every step. It... er
String
Text
Range
Substring
//paperyear2000author contains
Ioannidis// abstractftcontains
histograms,history
Term Containment
4
Synopses and Heterogeneity
//paperyear2000author contains
Ioannidis// abstractftcontains
histograms,history

Mixed predicates Unified summarization model
Path structure
Values of different types
Correlations between and across
Summarization for textual values

5
XCluster Synopses

Data synopses for heterogeneous XML content
Unified summarization for path structure and
numerical, string, and textual content
Support for twig queries with mixed predicates
XCluster model Element clustering
Tight cluster Similar structure and values
Extensibility to other value types
Principled compression framework
Experimental results high accuracy with low
storage requirements

6
Outline

Preliminaries
XCluster Model
XCluster Compression
Construction Algorithm
Experimental Study

7
Data and Query Model
Data
Query
q0
Range
Substring
q1
q3
q2
Numerical
Text
Term Containment
String
for q0 in /, q1 in q0/py1999, q2 in
q1/tcontains(XML), q3 in q1/ab
ftcontains(synopsis,data)
Text

Tree data with heterogeneous value content
Tree-pattern queries with XPath expressions
Result set of binding tuples

8
Problem Definition
Synopsis

Problem build a data synopsis that can estimate
the selectivity of any query
Challenges
Heterogeneity of content
Data correlations

9
XCluster Model
10
Structural Summarization
XCluster
Data

Node Elements of same tag
Statistical information node- and edge-counts
Node-count number of elements in cluster
Edge-count average number of children

11
Value Summarization
XCluster
Data

Value summary Fractional value distribution
Single-dimensional
Approximation method depends on value type

12
Types of Value Summaries

Numerical Content Histograms
String Content Pruned Suffix Tries
Text Content End-biased Term Histograms

The history of histograms is long and rich, full
of detailed information in every step. It...
Text
Term Matrix
Term Histogram
13
XCluster Model

A node aggregates information about its elements
Correspondence to clustering node cluster
centroid element
Basic assumptions independence and uniformity
Tight clusters Valid assumptions

14
Estimation Example
Query
1 element
2 children
1st children
1/2sk children
sel(Q)(1)(2)(1st)(1/2sk)
XCluster

Two-step estimation algorithm
Identify embeddings
Estimate selectivity of each embedding
Accuracy depends on tightness of centroids

15
XCluster Compression
16
Structural Compression

Merge two nodes of same tag
New node acquires aggregate characteristics
Node- and edge-counts are aggregated
Value summaries are fused
Conceptually equivalent to cluster merging

17
Value-Based Compression

Reduce the storage of a single value summary
Specifics depend on type of summary
Histogram merge k buckets
Pruned Suffix Trie prune k nodes
Remove leaf nodes based on statistical
independence
Term Histogram move k terms to the uniform bucket

18
Compression vs. Accuracy
Original XCluster S
Compressed XCluster S

?(S,S) difference in accuracy between S and S
Key idea apply operations with low ?(S,S)
Absolute vs. Relative metric

S
S
R
Absolute
Relative
S
S
19
Distance Metric ?(S,S)
S
S

µ-query basic query involving structurevalues
us/c the number of children in c per element
in u that satisfies value predicate s
Intuition capture centroid information
pertaining to c and s
?(S,S) difference of estimates for µ-queries

20
XCluster Construction
21
XCluster Construction
XCluster with detailed value distributions
XML Data
Reference Summary
XCluster
ç
ç
ç
Step 1
Step 2
Step 3