Clustering XML Documents for Query Performance Enhancement - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Clustering XML Documents for Query Performance Enhancement

Description:

Besides storing XML documents in their native format, using RDBMS is an established trend. ... mainly two approaches for storing XML documents in RDBMS. Schema ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 23
Provided by: happy7
Category:

less

Transcript and Presenter's Notes

Title: Clustering XML Documents for Query Performance Enhancement


1
Clustering XML Documentsfor Query Performance
Enhancement
  • Wang Lian

2
Outline
  • Related Work
  • Motivation
  • Our Approach
  • S-Graph
  • Distance Function
  • Clustering Algorithm
  • Experimental Results

3
Related Work
  • Besides storing XML documents in their native
    format, using RDBMS is an established trend.
  • There are mainly two approaches for storing XML
    documents in RDBMS
  • Schema-mapping
  • Structure-mapping

4
Related Work(cont)
  • In Schema-mapping
  • A database schema is derived from the DTD of XML
    documents, therefore different DTD will generate
    different database schema
  • In Structure mapping
  • The database schema is fixed by defining a set of
    generic tables.

5
Motivation
  • Using both schema or structure mapping, documents
    must be cut into pieces and inserted into tables.
    To answer a query tables should be joined to
    provide the answers.
  • As the size of tables grows larger, the join cost
    may be very high.
  • Our observation is if a collection contains
    documents of different structures, then
    clustering on documents structures may reduce
    the join cost.

6
Motivation(cont)
  • An example
  • DTD is

7
Motivation(cont)
  • Documents in 3 clusters

8
Motivation(cont)
  • Unpartitioned schema

9
Motivation(cont)
  • Partitioned schema

10
Motivation(cont)
  • Suppose we want to answer an Xpath query
    /conference/author.text()
  • Using unpartitioned schema, table conference (2
    tuples )and author (9 tuples) should be joined,
  • Using partitioned schema, table conference1 (2
    tuples ) and author1 (3 tuples) should be joined.

11
Our Approach
  • XML document is a mixture of structure
    information and data value. In our context, only
    structure information is used to do clustering.
  • We need a proper distance function before using
    any clustering algorithm.

12
S-Graph
  • Given a set of XML documents C, the structure
    graph (s-graph) of C, sg(C) (N, E), is a
    directed graph such that N is the set of all the
    elements and attributes in the documents in C and
    (a, b)?E if and only if a is a parent element of
    b in document(s) in C (b can be element or
    attribute).
  • Certainly, s-graph does not catch all structure
    information of documents, however it captures the
    parent-child relationship which is valuable for
    evaluating path expression.

13
Distance Function
  • For two sets, C1 and C2, of XML documents, the
    distance between them,
  • where sg(Ci) is the number of edges in
    sg(Ci), i1,2, and sg(C1)? sg(C2) is the set of
    common edges of sg(C1) and sg(C2).

14
Distance Function(cont)
Dist(doc1,doc2)1 and Dist(doc2,doc3)0.25
Tree-dist(doc1,doc2) Tree-dist(doc2,doc3
)1
15
Clustering Algorithm
  • Input X the set of XML documents
  • Input k the number of clusters specified by user
  • SGpre-clustering(X)
  • While(remaining cluster numbergtk)
  • Merge cluster Ci and Cj which maximize a
    predefined goodness function

16
Clustering Algorithm(cont)
  • Complexity
  • n X, mSG,
  • Time complexity
  • The upper bound of pre-clustering is O(nm), in
    general, it can be reduced to O(n).
  • Iterative merging O(m2logm)
  • Space complexity
  • O(m2)

17
Experimental Results
  • The clustering algorithm is tested on real data,
    the DBLP XML records, which contains about
    200,000 documents composed by 36 elements.
  • Pre-clustering is effective, after scan 200,000
    documents, only 233 distinguished s-graphes
    remain, which makes following clustering using
    only less than 2 second.

18
Experimental Results(cont)
  • After setting the number of cluster to be 3, we
    get three clusters containing about 193,000
    documents, one for article, and the other two for
    inproceedings.
  • The interesting thing is in those two clusters of
    inproceedings, one s-graph is a subgraph of
    another.

19
Experimental Results(cont)
  • We use Oracle 8.1.5 to store all the documents in
    4 versions
  • Version 1 unpartitioned schema-mapping
  • Version 2 partitioned schema-mapping
  • Version 3 unpartitioned structure mapping
  • Version 4 partitioned structure mapping

20
Experimental Results
  • Query type
  • Q1 /A1/A2//Ak all possible absolute XPathes
    in the documents.
  • Q2 /A1/A2//Aktext()value''/text()
  • absolute XPaths in which Ai, i1,,k, are
    randomly
  • picked and "value" is the value of Ak in some
    documents.
  • Q3 /A1/A2//Akcontains(.,substring'')/text()
  • same as Q2 except that the condition tested is
    Ak contains a substring

21
Experimental Results
22
Question?
Write a Comment
User Comments (0)
About PowerShow.com