Improving Web Page Clustering with Global Document Analysis - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Improving Web Page Clustering with Global Document Analysis

Description:

QDC: a new web page clustering algorithm. Evaluation showing QDC is significantly ... Snippets and Full Page Text. 14. Evaluation: Quality and Coverage. 15 ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 22
Provided by: danielc72
Category:

less

Transcript and Presenter's Notes

Title: Improving Web Page Clustering with Global Document Analysis


1
Improving Web Page Clustering with Global
Document Analysis
  • Daniel Crabtree
  • Supervisors Dr Peter Andreae
  • Dr Xiaoying Gao

2
Outline
  • The problem with search engines
  • QDC a new web page clustering algorithm
  • Evaluation showing QDC is significantly better
    than other algorithms

3
Problem Find Relevant Web Pages
  • Current Approach Search Engines
  • Many Irrelevant Results
  • Long Result List

4
Solution - Organise result set
Search Results for Jaguar 1 6 of
70,000,000
1. Jaguar Official worldwide web site of Jaguar
Cars. 2. Apple - Mac OS X The Apple Mac OS X
product page. 3. Jaguar UK - R is for Racing
The essence of the Jaguar breed 4.
Jaguar General information from Big Cats Online.
5. Jaguar AU - Jaguar Cars Services and
news 6. Jaguar -- Defenders of Wildlife Size,
appearance, life span and diet.
4. Jaguar General information from Big Cats
Online. 6. Jaguar -- Defenders of
Wildlife Size, appearance, life span and diet.
Clusters 1. Car 2. Animal 3. Mac OS 4. Other
5
Related Work Web Page Clustering
  • All Standard Algorithms
  • partioning (k-means), hierarchical
    (agglomerative, divisive),
  • Web Features
  • structure, hyperlinks, colour
  • Textual Features
  • STC phrases, Lingo latent semantic indexing
  • Word Semantics
  • Global document analysis, co-occurrence
    statistics

6
QDC Query Directed Clustering

7
QDC 1 Construct Clusters
Score 3 f(base clusters)
Score 2
distance(cluster,query)
Score 1
  • Clean Pages
  • Identify Base Clusters
  • Prune 1
  • Prune 2
  • Subsume
  • Prune 3

8
QDC 1 Construct Clusters
  • Removes Many Base Clusters
  • Normally Negative Effect on Performance
  • Query Directed Score
  • Reliable Guide to Cluster Quality
  • Removes just Low Quality Clusters
  • Improves Performance

9
QDC 2 Merge Clusters
  • Single-link Clustering
  • Similarity Function
  • Extension (by page overlap)
  • Intension (by description similarity)
  • Global document analysis co-occurrence frequency
    relative to expected frequency if independent

10
QDC 2 Merge Clusters
  • Incorporating description similarity
  • Allows page overlap threshold to be reduced
  • More semantically related clusters merge
  • Increasing cluster coverage
  • Fewer semantically unrelated clusters merge
  • Increasing cluster quality

11
QDC 3 Select Clusters
  • ESTC cluster selection algorithm
  • Heuristic based hill-climbing search with
    look-ahead and advanced branch and bound pruning
  • Original heuristic
  • Page Coverage and Cluster Overlap
  • New heuristic
  • Page Coverage, Cluster Overlap, Pages Not
    Covered, Number of Clusters, and Cluster Quality

12
Evaluation
  • Algorithm Speed
  • Ten Times Faster than STC
  • One Hundred Times Faster than K-means
  • Algorithm Performance
  • External Evaluation against a rich gold standard
  • Real World Usability
  • Informal Usability Comparison with four
    algorithms
  • K-means, ESTC, Lingo, Vivisimo

13
Evaluation Algorithm Performance
  • External Evaluation against a rich gold standard
  • Four Algorithms
  • STC, ESTC, K-means, Random
  • Four Data Sets
  • Salsa, Jaguar, GP, Victoria University
  • Eleven Measurements
  • Average and Weighted Quality, Coverage,
    Precision, Recall, and Entropy Mutual
    Information
  • Snippets and Full Page Text

14
Evaluation Quality and Coverage
15
Evaluation Improvement over Random
16
Evaluation Precision and Recall
17
Evaluation Entropy and Mutual Information
18
Evaluation Real World Usability
  • QDC finds broader topics
  • Maximizes probability of refinement
  • Simplifies users decision process
  • Fewer choices
  • Less chance of multiple relevant choices
  • Fewer semantically meaningless clusters

19
Evaluation Real World Usability
  • Performance better than indicated by external
    evaluation
  • No penalty for overly specific clusters since
    gold standard included them
  • External evaluation shows QDC clusters have
  • Fewer irrelevant pages
  • Cover more relevant pages

20
Conclusion
  • QDC New Web Page Clustering Algorithm
  • 3 key innovations
  • Query Directed Scoring
  • Merging using cluster descriptions
  • Improved ESTC cluster selection heuristic
  • Vastly improved performance over other algorithms
  • External evaluation
  • Informal usability evaluation

21
Future Work
  • Use Phrases rather than just Words
  • STC, Lingo show large improvement possible
  • Improve cluster description similarity merging to
    consider entire description
  • Formal usability evaluation
Write a Comment
User Comments (0)
About PowerShow.com