Title: Experiments in clustering homogeneous XML documents to Validate an Existing Typology
1Experiments in clustering homogeneous XML
documents to Validate an Existing Typology
- Thierry Despeyroux
- Yves Lechevallier
- Brigitte Trousse
- Anne-Marie Vercoustre
- Inria
- Projet Axis
- E_mail firstname.surname_at_inria.fr
2Scientific Activity Report at Inria
3Homogeneous presentation
4Some RA figures
- 146 files
- 229 000 text lines
- 14,8 M octets of data
- one DTD
- Optional sections
- Free style and content
5Grouping by Themes (2003)
6Grouping by Themes (2004)
7 Problem
- Presentation by Research themes
- That varies overtime
- Not politically neutral (funding, evaluation)
- Is there any natural grouping?
- What is the role of different parts of the report
in highlighting the themes?
8Methodology
- Select specific parts by using the XML structure
- Select significant words by using a tool for
syntactic typing and stemming (TreeTagger) - Cluster the documents into disjoined clusters
- Evaluate those clusters
9Various experiments
- K-F Keywords from sections foundations
- K-all all Keywords
- T-P text in section presentation
- T-PF text in sections presentation et
foundations - T-C names of conferences, workshops, congress
etc. in the bibliography
10TreeTagger
- XML Tree Tagger
- A3 presentation a3 JJ ltunknowngt
- A3 presentation designs NNS design
- A3 presentation methods NNS method
- A3 presentation and CC and
- A3 presentation tools NNS tool
- A3 presentation used VVN use
- A3 presentation by IN by
- A3 presentation compilers NNS compiler
- A3 presentation or CC or
- A3 presentation users NNS user
- A3 presentation for IN for
- A3 presentation code NN code
- A3 presentation analysis NN analysis
11Clustering Method
The objective of the 3rd step is to cluster
documents in a set of disjoint classes, from the
vocabularies selected for the five
experiments. We use a partition method close to
the k-means algorithm where the distance between
documents is based on the word frequency.
12K-F-a experiment list of representative Keywords
- Classe 1 3d approximation, computer,
differential, environment , modeling,
processing , programming , vision - Classe 2 computing, equation, grid, problem,
transformation - Classe 3 code, design, event, network,
processor, time, traffic - Classe 4 calculus, database, datum, image,
indexing, information, integration,
knowledge, logic, mining, pattern, recognition,
user, web
For each cluster, the list of most representative
words can be associated. Those words can be
interpreted as summaries for those classes.
13Repartition of clusters compared to themes 2003
14Repartition of themes 2003 compared to clusters
15Partition of projects
16Partition des projets
17Extern Evaluation
The evaluation of the quality of clusters can be
done by comparing the resulting clusters with the
two lists of themes used by INRIA
nij is the number of research projects with
their report classed in cluster Ui and allocated
to group Cj (theme j). ni. is the number of
research reports in cluster Ui , n.k is the
number of research projects allocated to group Ck
, n is the total number of research projects
analysed.
18Two evaluation measures
- The F-measure proposed by (Jardine and
Rijsbergen, 1963) combines the precision and
recall measure between Ui and Ck. - recall is defined by R(i,k)nik /ni.
- precision is defined by P(i,k) nik /n.k
The F-measure between the a priori partition U in
K groupes and partition C of INRIA projects by
the clustering method is
The corrected Rand index (CR) proposed by
(Hubert and Arabie (1985)) to compare two
partitions.
19Results
20Conclusion
- Combination of selection by structure and by
linguistic terms - Evaluation of clustering compared to an existing
typology - The quality of clustering strongly depends on
the selected parts in the activity reports (which
in turn gives an indication on where the report
could be improved) - Future
- Measuring the stability of clusters when K varies
- Evolution of classes overtime
- Experiences with other collections