Experiments in clustering homogeneous XML documents to Validate an Existing Typology

About This Presentation

Title:

Experiments in clustering homogeneous XML documents to Validate an Existing Typology

Description:

I-Know 2005. Experiments in clustering homogeneous XML documents to ... ALADIN Bang Estime IDOPT Macs Mathfi Micmac OMEGA Opale Caiman Calvi Smash. 4b. Sigma2 ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 21

Provided by: yveslech

Category:

more less

Transcript and Presenter's Notes

Title: Experiments in clustering homogeneous XML documents to Validate an Existing Typology

1
Experiments in clustering homogeneous XML
documents to Validate an Existing Typology

Thierry Despeyroux
Yves Lechevallier
Brigitte Trousse
Anne-Marie Vercoustre
Inria
Projet Axis
E_mail firstname.surname_at_inria.fr

2
Scientific Activity Report at Inria
3
Homogeneous presentation
4
Some RA figures

146 files
229 000 text lines
14,8 M octets of data
one DTD
Optional sections
Free style and content

5
Grouping by Themes (2003)
6
Grouping by Themes (2004)
7
Problem

Presentation by Research themes
That varies overtime
Not politically neutral (funding, evaluation)
Is there any natural grouping?
What is the role of different parts of the report
in highlighting the themes?

8
Methodology

Select specific parts by using the XML structure
Select significant words by using a tool for
syntactic typing and stemming (TreeTagger)
Cluster the documents into disjoined clusters
Evaluate those clusters

9
Various experiments

K-F Keywords from sections foundations
K-all all Keywords
T-P text in section presentation
T-PF text in sections presentation et
foundations
T-C names of conferences, workshops, congress
etc. in the bibliography

10
TreeTagger

XML Tree Tagger
A3 presentation a3 JJ ltunknowngt
A3 presentation designs NNS design
A3 presentation methods NNS method
A3 presentation and CC and
A3 presentation tools NNS tool
A3 presentation used VVN use
A3 presentation by IN by
A3 presentation compilers NNS compiler
A3 presentation or CC or
A3 presentation users NNS user
A3 presentation for IN for
A3 presentation code NN code
A3 presentation analysis NN analysis

11
Clustering Method
The objective of the 3rd step is to cluster
documents in a set of disjoint classes, from the
vocabularies selected for the five
experiments. We use a partition method close to
the k-means algorithm where the distance between
documents is based on the word frequency.
12
K-F-a experiment list of representative Keywords

Classe 1 3d approximation, computer,
differential, environment , modeling,
processing , programming , vision
Classe 2 computing, equation, grid, problem,
transformation
Classe 3 code, design, event, network,
processor, time, traffic
Classe 4 calculus, database, datum, image,
indexing, information, integration,
knowledge, logic, mining, pattern, recognition,
user, web

For each cluster, the list of most representative
words can be associated. Those words can be
interpreted as summaries for those classes.
13
Repartition of clusters compared to themes 2003
14
Repartition of themes 2003 compared to clusters
15
Partition of projects
16
Partition des projets
17
Extern Evaluation
The evaluation of the quality of clusters can be
done by comparing the resulting clusters with the
two lists of themes used by INRIA
nij is the number of research projects with
their report classed in cluster Ui and allocated
to group Cj (theme j). ni. is the number of
research reports in cluster Ui , n.k is the
number of research projects allocated to group Ck
, n is the total number of research projects
analysed.
18
Two evaluation measures

The F-measure proposed by (Jardine and
Rijsbergen, 1963) combines the precision and
recall measure between Ui and Ck.
recall is defined by R(i,k)nik /ni.
precision is defined by P(i,k) nik /n.k

The F-measure between the a priori partition U in
K groupes and partition C of INRIA projects by
the clustering method is
The corrected Rand index (CR) proposed by
(Hubert and Arabie (1985)) to compare two
partitions.
19
Results
20
Conclusion

Combination of selection by structure and by
linguistic terms
Evaluation of clustering compared to an existing
typology
The quality of clustering strongly depends on
the selected parts in the activity reports (which
in turn gives an indication on where the report
could be improved)
Future
Measuring the stability of clusters when K varies
Evolution of classes overtime
Experiences with other collections

Write a Comment

User Comments (0)

About PowerShow.com

Experiments in clustering homogeneous XML documents to Validate an Existing Typology - PowerPoint PPT Presentation

Experiments in clustering homogeneous XML documents to Validate an Existing Typology

I-Know 2005. Experiments in clustering homogeneous XML documents to ... ALADIN Bang Estime IDOPT Macs Mathfi Micmac OMEGA Opale Caiman Calvi Smash. 4b. Sigma2 ... – PowerPoint PPT presentation