Improving Web Clustering by Cluster Selection - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Improving Web Clustering by Cluster Selection

Description:

Snippets and Full Text. Precision. Cluster accuracy against the best matching ideal cluster ... ESTC Snippets 58% Grokker Snippets Page Titles 62% ESTC Full ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 20
Provided by: danielc72
Category:

less

Transcript and Presenter's Notes

Title: Improving Web Clustering by Cluster Selection


1
Improving Web Clustering by Cluster Selection
Daniel Crabtree, Xiaoying Gao, Peter
Andreae Victoria University of Wellington, New
Zealand
2
Web Search
2
  • Iterative Process
  • Problems with Standard Web Search
  • Many Irrelevant Results
  • Single Long List
  • Solution
  • Identify and Present Implicit Clusters

3
Web Clustering
3
Search Results for Jaguar 1 6 of
70,000,000
1. Jaguar Official worldwide web site of Jaguar
Cars. 2. Apple - Mac OS X The Apple Mac OS X
product page. 3. Jaguar UK - R is for Racing
The essence of the Jaguar breed 4.
Jaguar General information from Big Cats Online.
5. Jaguar AU - Jaguar Cars Services and
news 6. Jaguar -- Defenders of Wildlife Size,
appearance, life span and diet.
4. Jaguar General information from Big Cats
Online. 6. Jaguar -- Defenders of
Wildlife Size, appearance, life span and diet.
Clusters 1. Car 2. Animal 3. Mac OS 4. Other
4
Web Clustering Algorithms
4
  • Many standard clustering algorithms.
  • Text oriented clustering algorithms
  • STC - Suffix Tree Clustering
  • ESTC - Improvement on STC

5
Suffix Tree Clustering
5
Reference Zamir and Etzioni
6
STC Identify Base Clusters
6
7
STC Combining Base Clusters
7
Merge Clusters Based On Overlap
30
18
7
12
6
Merged Cluster Score is sum of base cluster scores
8
STC Rank/Select Clusters
8
  • Sort Clusters by Score
  • Select Best N

9
Problems with STC
9
  • STC is better than many other algorithms
  • BUT not good enough
  • Scores
  • Poor Cluster Quality Measure
  • Selection
  • Poor Coverage
  • Excessive Overlap

10
ESTC Better Cluster Scoring
10
  • Base Cluster Scores OK
  • Combined Cluster Scores BAD
  • Overlap between clusters over counted in sum
  • Example - Particularly Similar Pages

11
ESTC Scoring Solution
11
  • Solution
  • Eliminate the over counting of the overlap
  • Merged Cluster Score
  • Sum over document scores
  • Document Score
  • Average phrase score of base clusters containing
    the document in the merged cluster

12
ESTC Better Cluster Selection
12
  • Top N Clusters BAD
  • Dominant Topic over represented

13
ESTC Smarter Selection The Search
13
  • ESTC Smarter selection
  • Heuristic
  • Minimize Overlap
  • Maximize Coverage

14
ESTC The Search
14
  • Incremental
  • Greedy
  • Look-ahead Protection
  • Sophisticated Branch and Bound Pruning

15
Evaluation Method
15
  • Gold Standard - Ideal Clustering
  • 2 Searches and 2 Types of Input Data
  • Jaguar and Salsa
  • Snippets and Full Text
  • Precision
  • Cluster accuracy against the best matching ideal
    cluster
  • Recall
  • Coverage of ideal cluster in matched clusters
  • F-measure
  • Combination of precision and recall

16
Results STC, STC-NS, ESTC
16
Jaguar Full Text Clustering Results
17
Results ESTC vs Grokker
17
  • Similar performance without page titles
  • Page titles are often very useful
  • Algorithm Input F-measure
  • ESTC Snippets 58
  • Grokker Snippets Page Titles 62
  • ESTC Full Text 74

18
Conclusions
18
  • ESTC has
  • A new cluster scoring
  • A new cluster selection algorithm
  • ESTC is better than STC, and compares favourably
    with Grokker.
  • ESTC Scoring function applicable to any
    agglomerative clustering algorithm.
  • ESTC Cluster Selection algorithm more widely
    applicable.

19
Future Work
19
  • Make improvements to other stages of STC
  • Particularly Combining Base Clusters
  • Apply cluster selection method to other
    algorithms
  • Improve cluster selection heuristic
Write a Comment
User Comments (0)
About PowerShow.com