Title: GRAPHBASED HIERARCHICAL CONCEPTUAL CLUSTERING
1GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING
- by
- Istvan Jonyer,
- Lawrence B. Holder and
- Diane J. Cook
- The University of Texas at Arlington
2Outline
- What is hierarchical conceptual clustering?
- Overview of Subdue
- Conceptual clustering in Subdue
- Evaluation of hierarchical clusterings
- Experiments and results
- Conclusions
3What is clustering?
4What is hierarchical conceptual clustering?
- Unsupervised concept learning
- Generating hierarchies to explain data
- Applications
- Hypothesis generation and testing
- Prediction based on groups
- Finding taxonomies
5Example hierarchical conceptual clustering
6The Problem
- Hierarchical conceptual clustering in
discrete-valued structural databases - Existing systems
- Continuous-valued
- Discrete but unstructured
- We can do better! (Field under explored)
7Related Work
- Cobweb
- Labyrinth
- AutoClass
- Snob
- In Euclidian space Chameleon, Cure
- Unsupervised learning algorithms
8The Solution
- Take Subdue and extend it!
9Overview of Subdue
- Data mining in graph representations of
structural databases
10Overview of Subdue
- Iteratively searching for best substructure by
MDL heuristic
11Overview of Subdue
- Compress using best substructure
12Overview of Subdue
- Fuzzy match
- Inexact matching of subgraphs
- Applications
- Defining fuzzy concepts
- Evaluation of clusterings
13Conceptual Clustering with Subdue
- Use Subdue to identify clusters
- The best subgraph in an iteration defines a
cluster - When to stop within an iteration?
- Use limit option
- Use size option
- Use first minimum heuristic (new)
14The First Minimum Heuristic
- Use subgraph at first local minimum
- Detect it using prune2 option
15The First Minimum Heuristic
- Not a greedy heuristic!
- Although first local minimum is usually the
global minimum - First local minimum is caused by a smaller, more
frequently occurring subgraph - Subsequent minima are caused by bigger, less
frequently occurring subgraphs - gt First subgraph is more general
16The First Minimum Heuristic
- A multi-minimum search space
17Lattice vs. Tree
- Previous work defined classification trees
- Inadequate in structured domains
- Better hierarchical description classification
lattice - A cluster can have more than one parent
- A parent can be at any level (not only one level
above)
18Hierarchical Clustering in Subdue
- Subdue can compress by a subgraph after each
iteration - Subsequent clusters may be defined in terms of
previously defined clusters - This results in a hierarchy
19Hierarchical Conceptual Clustering of an
Artificial Domain
20Hierarchical Conceptual Clustering of an
Artificial Domain
21Evaluation of Clusterings
- Traditional evaluation
- Not applicable to hierarchical domains
- No known evaluation for hierarchical clusterings
- Most hierarchical evaluations are anecdotal
22New Evaluation Heuristic for Hierarchical
Clusterings
- Properties of a good clustering
- Small number of clusters
- Large coverage ? good generality
- Big cluster descriptions
- More features ? more inferential power
- Minimal or no overlap between clusters
- More distinct clusters ? better defined concepts
23New Evaluation Heuristic for Hierarchical
Clusterings
- Big clusters bigger distance between disjoint
clusters - Overlap less overlap ? bigger distance
- Few clusters averaging comparisons
24Experiments and Results
- Validation in an artificial domain
- Validation in unstructured domains
- Comparison to existing systems
- Real world applications
25The Animal Domain
26Hierarchical Clustering of the Animal Domain
27Hierarchical Clustering of the Animal Domain by
Cobweb
28Comparison of Subdue and Cobweb
- Quality of Subdues lattice (tree) 2.60
- Quality of Cobwebs tree 1.74
- Therefore Subdue is better
- Reasons for a higher score
- Better generalization resulting in less clusters
- Eliminating overlap between (reptile) and
(amphibian/fish)
29Chemical Application Clustering of a DNA sequence
30Chemical Application Clustering of a DNA sequence
31Conclusions
- Goal of hierarchical conceptual clustering of
structured databases was achieved - Synthesized classification lattice
- Developed new evaluation heuristic for
hierarchical clusterings - Good performance in comparison to other systems,
even in unstructured domains
32Future Work
- More experiments on real-world domains
- Comparison to other systems
- Incorporation of evaluation tool into Subdue