Naive clustering of a large XML document collection presentation

About This Presentation

Transcript and Presenter's Notes

Title: Naive clustering of a large XML document collection

1
Naive clustering of a large XML document
collection

Antoine Doucet
University of Helsinki
Department of Computer Science
1st INEX Workshop
Schloss Dagstuhl, 10.12.2002

2
Introduction

Conjecture
As structure is supplementary information
to raw text, there must exist a way to exploit
it, that improves the clustering quality
Document Clustering
in IR
XML Document Clustering

3
Introduction

Clustering experiments
Various feature sets
k-means clustering
Clustering quality evaluation
Results and future work

4
Document Clustering
5
Document Clustering in IR

Prior to the query
Goal form a taxonomy (Yahoo!)
Post-Retrieval Clustering
Cluster Hypothesis

6
XML Document Clustering

Most work on data-centric XML, that is
regular structure
few/poor text content
Edit Tree Distance
as a preprocessing of automated DTD
construction methods

7
Clustering Experiments
8
Vector Space Model

Various feature sets
tags only (183)
text only (188,417)
text and tags (188,600)
normalized tf-idf

9
Clustering

Similarity measure cosine
k-means algorithm

10
k-means clustering algorithm

Initialisation
k points chosen as centroids
Assign each point to the closest centroid
Iterate
Re-compute the centroid of each cluster
Assign each point to the closest centroid
Stop Condition
As soon as the centroids are stable

11
Clustering

Similarity measure cosine
k-means algorithm
partitional
linear

12
Clustering Evaluation

Internal Quality Measures
based on average inter- and intra-cluster
similarities
for example
cohesiveness (a.k.a. overall similarity)
Inappropriate for our experiments
Because we use different feature sets, and the
similarity measures are intrinsecally related to
the document representation, the similarity
measures are not comparable over different
feature sets (that is, throughout our
experiments).

13
Clustering Evaluation

External Quality Measures
Based on an existing manual classification
Entropy is based on the classes distribution in
a cluster.
Purity measures how much a cluster is
specialized in a class.
F-measure originates from IR.

14
Clustering Evaluation

External Quality Measures
Which classification can we use ?
Usually (TREC) manual assessments are used as
classes
INEX assessments were not yet available
The only available classes are the journals in
which the articles were published

15
Clustering Evaluation

Problems with the journal title classification
The classes are disjoint.
Obviously does not suit text documents
The classes are probably too strict.
e.g.,there should not be a strict border between
Transactions on Computers and Transactions on
ParallelDistributed Systems
...Finally we used the 18 journal classes an
extra volume class

16
Results
17
Results

15-way clustering

Features Text Tags Text Tags
Entropy 0.633 0.798 0.678
Purity 0.379 0.228 0.372
Time 754 s. 11 s. 837 s.

Text is best !
Tags only is very fast

18
Results

The volume class exception

Features Text Tags Text Tags
Entropy 0.722 0.016 1
Purity 0.295 0.992 0
Precision 28 99 100
Time 754 s. 11 s. 837 s.

The most discriminant text features are
misleading january, society, publish
The most discriminant tag features are not
ltentitygt, lttitlegt, ltsec1gt

19
Results

Hints at a better method, by using the tag
features as a preprocessing
k-means clustering based on tags only
The n clusters with a cohesiveness above a given
threshold are kept
(k-n)-means clustering of the remaining documents
based on text features only

20
Results

15-way clustering with and without pre-clustering

Standard Text Tags pre-clustering
Entropy 0.633 0.630
Purity 0.379 0.394
Time 754 s. 11742 s.

Does slightly better indeed

21
Conclusion - Discussion
22
Conclusion

Recall the evaluation is not 100 reliable
Combining structural similarity and content
similarity improves the clustering quality
But it is not straightforward !
We shall look for more sophisticated similarity
measures

23
Next

We need
More collections to validate the techniques
A better classification (manual assessments ?)

24
Next

More sophisticated similarity measures
How to weight XML elements ?
tf-idf ?
Element size ? Average element size ? Include the
sub-elements size ? How to normalize the
document vectors then ?
Store trees in the bag of words ?

25
Next

Structure feature selection
full path expression ?
local path expression ?
classes of tag labels ?
(tfmath, sgmlmath, math) ? metamath
Automatically ?

Write a Comment

User Comments (0)

About PowerShow.com

Naive clustering of a large XML document collection PowerPoint PPT Presentation