Experiments in clustering homogeneous XML documents to Validate an Existing Typology - PowerPoint PPT Presentation

About This Presentation
Title:

Experiments in clustering homogeneous XML documents to Validate an Existing Typology

Description:

I-Know 2005. Experiments in clustering homogeneous XML documents to ... ALADIN Bang Estime IDOPT Macs Mathfi Micmac OMEGA Opale Caiman Calvi Smash. 4b. Sigma2 ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 21
Provided by: yveslech
Category:

less

Transcript and Presenter's Notes

Title: Experiments in clustering homogeneous XML documents to Validate an Existing Typology


1
Experiments in clustering homogeneous XML
documents to Validate an Existing Typology
  • Thierry Despeyroux
  • Yves Lechevallier
  • Brigitte Trousse
  • Anne-Marie Vercoustre
  • Inria
  • Projet Axis
  • E_mail firstname.surname_at_inria.fr

2
Scientific Activity Report at Inria
3
Homogeneous presentation
4
Some RA figures
  • 146 files
  • 229 000 text lines
  • 14,8 M octets of data
  • one DTD
  • Optional sections
  • Free style and content

5
Grouping by Themes (2003)
6
Grouping by Themes (2004)
7
Problem
  • Presentation by Research themes
  • That varies overtime
  • Not politically neutral (funding, evaluation)
  • Is there any natural grouping?
  • What is the role of different parts of the report
    in highlighting the themes?

8
Methodology
  • Select specific parts by using the XML structure
  • Select significant words by using a tool for
    syntactic typing and stemming (TreeTagger)
  • Cluster the documents into disjoined clusters
  • Evaluate those clusters

9
Various experiments
  • K-F Keywords from sections foundations
  • K-all all Keywords
  • T-P text in section presentation
  • T-PF text in sections presentation et
    foundations
  • T-C names of conferences, workshops, congress
    etc. in the bibliography

10
TreeTagger
  • XML Tree Tagger
  • A3 presentation a3 JJ ltunknowngt
  • A3 presentation designs NNS design
  • A3 presentation methods NNS method
  • A3 presentation and CC and
  • A3 presentation tools NNS tool
  • A3 presentation used VVN use
  • A3 presentation by IN by
  • A3 presentation compilers NNS compiler
  • A3 presentation or CC or
  • A3 presentation users NNS user
  • A3 presentation for IN for
  • A3 presentation code NN code
  • A3 presentation analysis NN analysis

11
Clustering Method
The objective of the 3rd step is to cluster
documents in a set of disjoint classes, from the
vocabularies selected for the five
experiments. We use a partition method close to
the k-means algorithm where the distance between
documents is based on the word frequency.
12
K-F-a experiment list of representative Keywords
  • Classe 1 3d approximation, computer,
    differential, environment , modeling,
    processing , programming , vision
  • Classe 2 computing, equation, grid, problem,
    transformation
  • Classe 3 code, design, event, network,
    processor, time, traffic
  • Classe 4 calculus, database, datum, image,
    indexing, information, integration,
    knowledge, logic, mining, pattern, recognition,
    user, web

For each cluster, the list of most representative
words can be associated. Those words can be
interpreted as summaries for those classes.
13
Repartition of clusters compared to themes 2003
14
Repartition of themes 2003 compared to clusters
15
Partition of projects
16
Partition des projets
17
Extern Evaluation
The evaluation of the quality of clusters can be
done by comparing the resulting clusters with the
two lists of themes used by INRIA
nij is the number of research projects with
their report classed in cluster Ui and allocated
to group Cj (theme j). ni. is the number of
research reports in cluster Ui , n.k is the
number of research projects allocated to group Ck
, n is the total number of research projects
analysed.
18
Two evaluation measures
  • The F-measure proposed by (Jardine and
    Rijsbergen, 1963) combines the precision and
    recall measure between Ui and Ck.
  • recall is defined by R(i,k)nik /ni.
  • precision is defined by P(i,k) nik /n.k

The F-measure between the a priori partition U in
K groupes and partition C of INRIA projects by
the clustering method is
The corrected Rand index (CR) proposed by
(Hubert and Arabie (1985)) to compare two
partitions.
19
Results
20
Conclusion
  • Combination of selection by structure and by
    linguistic terms
  • Evaluation of clustering compared to an existing
    typology
  • The quality of clustering strongly depends on
    the selected parts in the activity reports (which
    in turn gives an indication on where the report
    could be improved)
  • Future
  • Measuring the stability of clusters when K varies
  • Evolution of classes overtime
  • Experiences with other collections
Write a Comment
User Comments (0)
About PowerShow.com