Title: Improving Web Page Clustering with Global Document Analysis
1Improving Web Page Clustering with Global
Document Analysis
- Daniel Crabtree
- Supervisors Dr Peter Andreae
- Dr Xiaoying Gao
2Outline
- The problem with search engines
- QDC a new web page clustering algorithm
- Evaluation showing QDC is significantly better
than other algorithms
3Problem Find Relevant Web Pages
- Current Approach Search Engines
- Many Irrelevant Results
- Long Result List
4Solution - Organise result set
Search Results for Jaguar 1 6 of
70,000,000
1. Jaguar Official worldwide web site of Jaguar
Cars. 2. Apple - Mac OS X The Apple Mac OS X
product page. 3. Jaguar UK - R is for Racing
The essence of the Jaguar breed 4.
Jaguar General information from Big Cats Online.
5. Jaguar AU - Jaguar Cars Services and
news 6. Jaguar -- Defenders of Wildlife Size,
appearance, life span and diet.
4. Jaguar General information from Big Cats
Online. 6. Jaguar -- Defenders of
Wildlife Size, appearance, life span and diet.
Clusters 1. Car 2. Animal 3. Mac OS 4. Other
5Related Work Web Page Clustering
- All Standard Algorithms
- partioning (k-means), hierarchical
(agglomerative, divisive), - Web Features
- structure, hyperlinks, colour
- Textual Features
- STC phrases, Lingo latent semantic indexing
- Word Semantics
- Global document analysis, co-occurrence
statistics
6QDC Query Directed Clustering
7QDC 1 Construct Clusters
Score 3 f(base clusters)
Score 2
distance(cluster,query)
Score 1
- Clean Pages
- Identify Base Clusters
- Prune 1
- Prune 2
- Subsume
- Prune 3
8QDC 1 Construct Clusters
- Removes Many Base Clusters
- Normally Negative Effect on Performance
- Query Directed Score
- Reliable Guide to Cluster Quality
- Removes just Low Quality Clusters
- Improves Performance
9QDC 2 Merge Clusters
- Single-link Clustering
- Similarity Function
- Extension (by page overlap)
- Intension (by description similarity)
- Global document analysis co-occurrence frequency
relative to expected frequency if independent
10QDC 2 Merge Clusters
- Incorporating description similarity
- Allows page overlap threshold to be reduced
- More semantically related clusters merge
- Increasing cluster coverage
- Fewer semantically unrelated clusters merge
- Increasing cluster quality
11QDC 3 Select Clusters
- ESTC cluster selection algorithm
- Heuristic based hill-climbing search with
look-ahead and advanced branch and bound pruning - Original heuristic
- Page Coverage and Cluster Overlap
- New heuristic
- Page Coverage, Cluster Overlap, Pages Not
Covered, Number of Clusters, and Cluster Quality
12Evaluation
- Algorithm Speed
- Ten Times Faster than STC
- One Hundred Times Faster than K-means
- Algorithm Performance
- External Evaluation against a rich gold standard
- Real World Usability
- Informal Usability Comparison with four
algorithms - K-means, ESTC, Lingo, Vivisimo
13Evaluation Algorithm Performance
- External Evaluation against a rich gold standard
- Four Algorithms
- STC, ESTC, K-means, Random
- Four Data Sets
- Salsa, Jaguar, GP, Victoria University
- Eleven Measurements
- Average and Weighted Quality, Coverage,
Precision, Recall, and Entropy Mutual
Information - Snippets and Full Page Text
14Evaluation Quality and Coverage
15Evaluation Improvement over Random
16Evaluation Precision and Recall
17Evaluation Entropy and Mutual Information
18Evaluation Real World Usability
- QDC finds broader topics
- Maximizes probability of refinement
- Simplifies users decision process
- Fewer choices
- Less chance of multiple relevant choices
- Fewer semantically meaningless clusters
19Evaluation Real World Usability
- Performance better than indicated by external
evaluation - No penalty for overly specific clusters since
gold standard included them - External evaluation shows QDC clusters have
- Fewer irrelevant pages
- Cover more relevant pages
20Conclusion
- QDC New Web Page Clustering Algorithm
- 3 key innovations
- Query Directed Scoring
- Merging using cluster descriptions
- Improved ESTC cluster selection heuristic
- Vastly improved performance over other algorithms
- External evaluation
- Informal usability evaluation
21Future Work
- Use Phrases rather than just Words
- STC, Lingo show large improvement possible
- Improve cluster description similarity merging to
consider entire description - Formal usability evaluation