Title: Evaluation of Bipartite-graph-based Web Page Clustering
1Evaluation of Bipartite-graph-based Web Page
Clustering
- Shim Wonbo
- M1 Chikayama-Taura Lab
2Background
3Open Directory Project
- Used by Google, Lycos, etc.
- Categorizing Web pages by hand
- Accurate
- Lately updated
- Unscalable
4World Wide Web
- Rapid increase ( of clusters changes)
- Daily updated ( cluster centers move)
- Due to these two properties of the Web..
- A Web page clustering system without human effort
is needed.
5Purpose
- Constructing a Web page clustering system which
- finds clusters without human help
- is scalable
- clusters Web pages in high speed
- clusters Web pages accurately
6Agenda
- Introduction
- Related Work
- Proposal
- Comparison
- Conclusion
7Clustering Algorithm
- Text-based clustering
- Use of word as feature
- Generally used algorithm
- Link-based clustering
- Focus on link structure
- Especially used in clustering Web pages
8k-means Algorithm
k 3 point vector expression of each document
9Problems of k-means Algorithm
- k depends on the data set.
- Outliers sensitively effect clustering result.
10Hierarchical Clustering
- BIRCH Zhang 96, CURE Guha 98, Chameleon
Karypis 99, ROCK Guha 00
11Hierarchical Clustering
- of clusters can be determined by condition.
- Clustering a large number of points (pages)
results in many I/O accesses.
12Use of Link Structure
- Web pages include not only text but also links.
- People link Web pages to other related pages.
Linked Web pages may share the same topic
13Extraction of Web Community based on Link Analysis
- An Approach to Find Related Communities Based on
Bipartite Graphs P.Krishna Reddy et al., 2001
14Terminology
- Fans and Centers
- Bipartite Graph
- Complete BG
- Dense BG
Fan
Center
p
q
(b) DBG
(a) CBG
15An Approach to Find Related Communities Based on
Bipartite Graphs
- DefinitionThe set T contains the members of the
community if there exist a dense bipartite graph
DBG(T, I, p, q) where - T Fans
- I Centers
- p of out-link
- q of in-link
p
q
DBG(T, I, 2, 3)
16DBG Extraction Algorithm (pt 2, qt 3)
- Gathering related nodes
threshold 1
17DBG Extraction Algorithm(pt 2, qt 3)
- Extracting a DBG
2
1
2
3
3
3
2
3
2
2
1
0
1
18DBG-based Web Community
- O High speed (O( links ))
- O Finding out topics over the Web
- X Possibility of extracting disrelated Web page
group
19Comparison
- Text-based clustering
- Accurate
- Difficult to determine the center of cluster
- Community topology based on DBG
- Inaccurate
- Can be used as topic selection
Refined Web Community
Center of Cluster
20Agenda
- Introduction
- Related Word
- Proposal
- Comparison
- Conclusion
21Proposal
- Extract DBGs through link analysis
- Refine communities and fix centers with DBSCAN
- Partition other pages to the nearest center
22Community Extraction
- Extract DBGs from the Web Graph
- Disallow the same page to be included in more
than one Web community
Web Graph
23Cluster Center Refinement
- Find meaningful page sets
- Does the DBGs really have a topic?
- Is there any page in the community that is not
related the topic? - Feature terms of extracted pages
- DBSCAN Martin Easter et al., A Density-Based
Algorithm for Discovering Clusters in Large
Spatial Databases with Noise, 1999
24DBSCAN
radius r minP m
Density reachable
Community (Center of cluster)
r
Core
25Partitioning Remaining Pages
- Feature terms appearance
- Calculate distance between a remaining page and
each center - If the distance to the nearest center is shorter
than threshold, attach the page to that cluster - Otherwise, attach the page to Unclassified
cluster
26Agenda
- Introduction
- Related Word
- Proposal
- Experimental Result
- Conclusion
27Target
- Seed 3,000 pages categorized to
Computer/Software by ODP - 70,000 pages departed from seed pages by 2 hops
28Preprocess
- Word ID
- Use words of a dictionary as base vectors
- Attribute the same ID to words sharing the same
derivation - Add terms which appear in many documents (IDF lt
8) - Total 29347
- Link Extraction
- Elimination of links to pages which are not
collected.
29 Communities
30 Community Members (pt3, qt3)
31 Community Members
32Variance of Terms
33After DBSCAN
34Conclusion
35Future Work
- Applying to more large data set
- This may need parallel processing
- Analyzing with