Evaluation of Bipartite-graph-based Web Page Clustering

About This Presentation

Title:

Evaluation of Bipartite-graph-based Web Page Clustering

Description:

Due to these two properties of the Web. ... Constructing a Web page clustering system ... 70,000 pages departed from seed pages by 2 hops. Preprocess. Word ID ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 36

Provided by: edd5

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of Bipartite-graph-based Web Page Clustering

1
Evaluation of Bipartite-graph-based Web Page
Clustering

Shim Wonbo
M1 Chikayama-Taura Lab

2
Background
3
Open Directory Project

Used by Google, Lycos, etc.
Categorizing Web pages by hand
Accurate
Lately updated
Unscalable

4
World Wide Web

Rapid increase ( of clusters changes)
Daily updated ( cluster centers move)
Due to these two properties of the Web..
A Web page clustering system without human effort
is needed.

5
Purpose

Constructing a Web page clustering system which
finds clusters without human help
is scalable
clusters Web pages in high speed
clusters Web pages accurately

6
Agenda

Introduction
Related Work
Proposal
Comparison
Conclusion

7
Clustering Algorithm

Text-based clustering
Use of word as feature
Generally used algorithm
Link-based clustering
Focus on link structure
Especially used in clustering Web pages

8
k-means Algorithm
k 3 point vector expression of each document
9
Problems of k-means Algorithm

k depends on the data set.
Outliers sensitively effect clustering result.

10
Hierarchical Clustering

BIRCH Zhang 96, CURE Guha 98, Chameleon
Karypis 99, ROCK Guha 00

11
Hierarchical Clustering

of clusters can be determined by condition.
Clustering a large number of points (pages)
results in many I/O accesses.

12
Use of Link Structure

Web pages include not only text but also links.
People link Web pages to other related pages.

Linked Web pages may share the same topic
13
Extraction of Web Community based on Link Analysis

An Approach to Find Related Communities Based on
Bipartite Graphs P.Krishna Reddy et al., 2001

14
Terminology

Fans and Centers
Bipartite Graph
Complete BG
Dense BG

Fan
Center
p
q
(b) DBG
(a) CBG
15
An Approach to Find Related Communities Based on
Bipartite Graphs

DefinitionThe set T contains the members of the
community if there exist a dense bipartite graph
DBG(T, I, p, q) where
T Fans
I Centers
p of out-link
q of in-link

p
q
DBG(T, I, 2, 3)
16
DBG Extraction Algorithm (pt 2, qt 3)

Gathering related nodes

threshold 1
17
DBG Extraction Algorithm(pt 2, qt 3)

Extracting a DBG

2
1
2
3
3
3
2
3
2
2
1
0
1
18
DBG-based Web Community

O High speed (O( links ))
O Finding out topics over the Web
X Possibility of extracting disrelated Web page
group

19
Comparison

Text-based clustering
Accurate
Difficult to determine the center of cluster
Community topology based on DBG
Inaccurate
Can be used as topic selection

Refined Web Community
Center of Cluster
20
Agenda

Introduction
Related Word
Proposal
Comparison
Conclusion

21
Proposal

Extract DBGs through link analysis
Refine communities and fix centers with DBSCAN
Partition other pages to the nearest center

22
Community Extraction

Extract DBGs from the Web Graph
Disallow the same page to be included in more
than one Web community

Web Graph
23
Cluster Center Refinement

Find meaningful page sets
Does the DBGs really have a topic?
Is there any page in the community that is not
related the topic?
Feature terms of extracted pages
DBSCAN Martin Easter et al., A Density-Based
Algorithm for Discovering Clusters in Large
Spatial Databases with Noise, 1999

24
DBSCAN
radius r minP m
Density reachable
Community (Center of cluster)
r
Core
25
Partitioning Remaining Pages

Feature terms appearance
Calculate distance between a remaining page and
each center
If the distance to the nearest center is shorter
than threshold, attach the page to that cluster
Otherwise, attach the page to Unclassified
cluster

26
Agenda

Introduction
Related Word
Proposal
Experimental Result
Conclusion

27
Target

Seed 3,000 pages categorized to
Computer/Software by ODP
70,000 pages departed from seed pages by 2 hops

28
Preprocess

Word ID
Use words of a dictionary as base vectors
Attribute the same ID to words sharing the same
derivation
Add terms which appear in many documents (IDF lt
8)
Total 29347
Link Extraction
Elimination of links to pages which are not
collected.

29
Communities
30
Community Members (pt3, qt3)
31
Community Members
32
Variance of Terms
33
After DBSCAN
34
Conclusion
35
Future Work

Applying to more large data set
This may need parallel processing
Analyzing with

Write a Comment

User Comments (0)

About PowerShow.com

Evaluation of Bipartite-graph-based Web Page Clustering - PowerPoint PPT Presentation

Evaluation of Bipartite-graph-based Web Page Clustering

Due to these two properties of the Web. ... Constructing a Web page clustering system ... 70,000 pages departed from seed pages by 2 hops. Preprocess. Word ID ... – PowerPoint PPT presentation