Clustering XML Documents for Query Performance Enhancement - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Clustering XML Documents for Query Performance Enhancement

Description:

Number of Views:51

Avg rating:3.0/5.0

Slides: 23

Provided by: happy7

Category:

more less

Transcript and Presenter's Notes

Title: Clustering XML Documents for Query Performance Enhancement

1
Clustering XML Documentsfor Query Performance
Enhancement

2
Outline

3
Related Work

Besides storing XML documents in their native
format, using RDBMS is an established trend.
There are mainly two approaches for storing XML
documents in RDBMS
Schema-mapping
Structure-mapping

4
Related Work(cont)

In Schema-mapping
A database schema is derived from the DTD of XML
documents, therefore different DTD will generate
different database schema
In Structure mapping
The database schema is fixed by defining a set of
generic tables.

5
Motivation

Using both schema or structure mapping, documents
must be cut into pieces and inserted into tables.
To answer a query tables should be joined to
provide the answers.
As the size of tables grows larger, the join cost
may be very high.
Our observation is if a collection contains
documents of different structures, then
clustering on documents structures may reduce
the join cost.

6
Motivation(cont)

7
Motivation(cont)

8
Motivation(cont)

9
Motivation(cont)

10
Motivation(cont)

Suppose we want to answer an Xpath query
/conference/author.text()
Using unpartitioned schema, table conference (2
tuples )and author (9 tuples) should be joined,
Using partitioned schema, table conference1 (2
tuples ) and author1 (3 tuples) should be joined.

11
Our Approach

XML document is a mixture of structure
information and data value. In our context, only
structure information is used to do clustering.
We need a proper distance function before using
any clustering algorithm.

12
S-Graph

Given a set of XML documents C, the structure
graph (s-graph) of C, sg(C) (N, E), is a
directed graph such that N is the set of all the
elements and attributes in the documents in C and
(a, b)?E if and only if a is a parent element of
b in document(s) in C (b can be element or
attribute).
Certainly, s-graph does not catch all structure
information of documents, however it captures the
parent-child relationship which is valuable for
evaluating path expression.

13
Distance Function

For two sets, C1 and C2, of XML documents, the
distance between them,
where sg(Ci) is the number of edges in
sg(Ci), i1,2, and sg(C1)? sg(C2) is the set of
common edges of sg(C1) and sg(C2).

14
Distance Function(cont)
Dist(doc1,doc2)1 and Dist(doc2,doc3)0.25
Tree-dist(doc1,doc2) Tree-dist(doc2,doc3
)1
15
Clustering Algorithm

16
Clustering Algorithm(cont)

Complexity
n X, mSG,
Time complexity
The upper bound of pre-clustering is O(nm), in
general, it can be reduced to O(n).
Iterative merging O(m2logm)
Space complexity
O(m2)

17
Experimental Results

The clustering algorithm is tested on real data,
the DBLP XML records, which contains about
200,000 documents composed by 36 elements.
Pre-clustering is effective, after scan 200,000
documents, only 233 distinguished s-graphes
remain, which makes following clustering using
only less than 2 second.

18
Experimental Results(cont)

After setting the number of cluster to be 3, we
get three clusters containing about 193,000
documents, one for article, and the other two for
inproceedings.
The interesting thing is in those two clusters of
inproceedings, one s-graph is a subgraph of
another.

19
Experimental Results(cont)