Evaluation of Bipartite-graph-based Web Page Clustering - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Evaluation of Bipartite-graph-based Web Page Clustering

Description:

Due to these two properties of the Web. ... Constructing a Web page clustering system ... 70,000 pages departed from seed pages by 2 hops. Preprocess. Word ID ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 36
Provided by: edd5
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Bipartite-graph-based Web Page Clustering


1
Evaluation of Bipartite-graph-based Web Page
Clustering
  • Shim Wonbo
  • M1 Chikayama-Taura Lab

2
Background
3
Open Directory Project
  • Used by Google, Lycos, etc.
  • Categorizing Web pages by hand
  • Accurate
  • Lately updated
  • Unscalable

4
World Wide Web
  • Rapid increase ( of clusters changes)
  • Daily updated ( cluster centers move)
  • Due to these two properties of the Web..
  • A Web page clustering system without human effort
    is needed.

5
Purpose
  • Constructing a Web page clustering system which
  • finds clusters without human help
  • is scalable
  • clusters Web pages in high speed
  • clusters Web pages accurately

6
Agenda
  • Introduction
  • Related Work
  • Proposal
  • Comparison
  • Conclusion

7
Clustering Algorithm
  • Text-based clustering
  • Use of word as feature
  • Generally used algorithm
  • Link-based clustering
  • Focus on link structure
  • Especially used in clustering Web pages

8
k-means Algorithm
k 3 point vector expression of each document
9
Problems of k-means Algorithm
  • k depends on the data set.
  • Outliers sensitively effect clustering result.

10
Hierarchical Clustering
  • BIRCH Zhang 96, CURE Guha 98, Chameleon
    Karypis 99, ROCK Guha 00

11
Hierarchical Clustering
  • of clusters can be determined by condition.
  • Clustering a large number of points (pages)
    results in many I/O accesses.

12
Use of Link Structure
  • Web pages include not only text but also links.
  • People link Web pages to other related pages.

Linked Web pages may share the same topic
13
Extraction of Web Community based on Link Analysis
  • An Approach to Find Related Communities Based on
    Bipartite Graphs P.Krishna Reddy et al., 2001

14
Terminology
  • Fans and Centers
  • Bipartite Graph
  • Complete BG
  • Dense BG

Fan
Center
p
q
(b) DBG
(a) CBG
15
An Approach to Find Related Communities Based on
Bipartite Graphs
  • DefinitionThe set T contains the members of the
    community if there exist a dense bipartite graph
    DBG(T, I, p, q) where
  • T Fans
  • I Centers
  • p of out-link
  • q of in-link

p
q
DBG(T, I, 2, 3)
16
DBG Extraction Algorithm (pt 2, qt 3)
  1. Gathering related nodes

threshold 1
17
DBG Extraction Algorithm(pt 2, qt 3)
  1. Extracting a DBG

2
1
2
3
3
3
2
3
2
2
1
0
1
18
DBG-based Web Community
  • O High speed (O( links ))
  • O Finding out topics over the Web
  • X Possibility of extracting disrelated Web page
    group

19
Comparison
  • Text-based clustering
  • Accurate
  • Difficult to determine the center of cluster
  • Community topology based on DBG
  • Inaccurate
  • Can be used as topic selection

Refined Web Community
Center of Cluster
20
Agenda
  • Introduction
  • Related Word
  • Proposal
  • Comparison
  • Conclusion

21
Proposal
  1. Extract DBGs through link analysis
  2. Refine communities and fix centers with DBSCAN
  3. Partition other pages to the nearest center

22
Community Extraction
  • Extract DBGs from the Web Graph
  • Disallow the same page to be included in more
    than one Web community

Web Graph
23
Cluster Center Refinement
  • Find meaningful page sets
  • Does the DBGs really have a topic?
  • Is there any page in the community that is not
    related the topic?
  • Feature terms of extracted pages
  • DBSCAN Martin Easter et al., A Density-Based
    Algorithm for Discovering Clusters in Large
    Spatial Databases with Noise, 1999

24
DBSCAN
radius r minP m
Density reachable
Community (Center of cluster)
r
Core
25
Partitioning Remaining Pages
  • Feature terms appearance
  • Calculate distance between a remaining page and
    each center
  • If the distance to the nearest center is shorter
    than threshold, attach the page to that cluster
  • Otherwise, attach the page to Unclassified
    cluster

26
Agenda
  • Introduction
  • Related Word
  • Proposal
  • Experimental Result
  • Conclusion

27
Target
  • Seed 3,000 pages categorized to
    Computer/Software by ODP
  • 70,000 pages departed from seed pages by 2 hops

28
Preprocess
  • Word ID
  • Use words of a dictionary as base vectors
  • Attribute the same ID to words sharing the same
    derivation
  • Add terms which appear in many documents (IDF lt
    8)
  • Total 29347
  • Link Extraction
  • Elimination of links to pages which are not
    collected.

29
Communities
30
Community Members (pt3, qt3)
31
Community Members
32
Variance of Terms
33
After DBSCAN
34
Conclusion
35
Future Work
  • Applying to more large data set
  • This may need parallel processing
  • Analyzing with
Write a Comment
User Comments (0)
About PowerShow.com