Title: Clustering of Web Documents
1- Clustering of Web Documents
- Jinfeng Chen
2- Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu
and Yuhen Hu, Correlation-based Document
Clustering using Web Logs, 2001. - Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and
Jinwen Ma,Learning to Cluster Web Search Results
3Correlation-based Document Clustering using Web
Logs
- Introduction
- Using web log data to construct clusters.
- Frequent simultaneous visits to two seemingly
unrelated documents should indicate that they are
in fact closely related. - Basic algorithm is DBSCAN, an algorithm to group
neighboring objects of the database into clusters
based on local distance information. -
4DBSCAN
- Does not require the user to pre-specify the
number of clusters. - Only one scan through the database.
- A radius value e and a value Mpts.
- e - distance measure (radius)
- Mpts number of minimal points that
should occur in around a dense object -
5DBSCAN algorithm (cond)
- Algorithm DBSCAN(DB, e,Minpts)
- for each o belong to DB do
- if o is not yet assigned to a
cluster - if o is a core-object then
- collect all objects
density-reachable form o - according to e
and MinPts - assign them to a new
cluster
6Limitations of DBSCAN in Clustering of web
document
- Performance clustering using a fixed threshold
value to determine dense regions in the
document space. - Thus the algorithm often cannot distinguish
between dense and loose points, often the entire
document space is lumped into a single cluster.
7RDBC algorithm(recursive density based
clustering)
- Key difference between RDBC and DBSCAN is that in
RDBC, the identification of core points are
performed separately from that of clustering each
individual data points. - Different values of e and Mpts are used in RDBC
to identify this core point set, Cset.
8RDBC algorithm (cond)
- For avoid connecting too many clusters
through bridge - Set initial value ee1 and MptsMpts1
- WebPageSetweb_log
- RDBC(e,Mpts, WebPageSet)
- use e, Mpts to get the core point
Cset - if size (Cset gt
size(webPageSet)/2 - DBSCAN(e,Mpts, WebPageSet)
- else
- e e/2 MptsMpts/4
- RDBC (e, Mpts, WebPageSet)
- Collect all other points in
(WebPageSet-Cset) - around clusters found in
last step according to e2 -
-
-
9Construct WebPageSet from web logs
- Step 1
- Step 2 Delete visit of image files.
- Step 3 Extract sessions from the data.
-
-
10Construct WebPageSet (cond)
- Step 4 Create a distance matrix
- 1) Determine the size of a moving window,
- within which URL requests
- will be regarded as co-occurrence.
- 2) Calculate the co-occurrence times Ni,,j,
and - Ni, Nj of this pair of URLs.
11Construct WebPageSet (cond)
- Step 4 Create a distance matrix
- 3) P(pi pj) Ni,j /Nj
- 4) Three Distance function
-
-
12Experimental Validation
13Conclusions
- A new algorithm for clustering web documents
based only on the log data. - It change the parameters intelligently during the
recursively process, RDBC can give clustering
results more superior than that of DBSCAN
14Learning to Cluster Web Search Results
- Introduction
- This algorithm based on salient phrase come from
documents contents. - Fast enough to be used in online calculation
engine.
15Characteristics of Cluster web search results
- Existing search engines such as Google ,Yahoo and
MSN often return long list of search results. - Clustering of similar search results helps users
find relevant results.
16Clustered Search results
17Conventional Search results
18Procedure of algorithm
- Step 1 Search result fetching
- Step 2 Document paring and Phrase property
calculation - Step 3 Salient phrase ranking
19Search result fetching
- Input a query to a conventional web search engine
- Getting the webpage of results returned by
engine. - Extracting the title and snippets.
20Document parsing
- Step 1 Cleaning
- Stemming (use Porter algorithm)
- Sentence boundary identification
- Step 2Post-processing
- Punctuation elimination
- Filter out stop-words, ex too are
- Filter out query word
- Ex Microsoft software is available to students.
21Phrase property calculation
- Five properties
- 1.Phrase Frequency/Inverted Document Frequency
-
-
-
- 2.Phrase Length
- LENn exLEN(big) 1
-
22Phrase property calculation (cond)
- 3.Intra-Cluster Similarity
- o centroid
- Here diTFIDF1,TFIDF2,,
- Each component of the vectors represents TFIDF
of a phrase
23Phrase property calculation (cond)
- 4. Cluster Entropy
-
- 5. Phrase Independence
- Ex three vectors has
- with some vectors
be
24Learning to rank key phrases
- Using Regression model to combine above five
properties, calculating a single salience score
for each phrase - Regression is a algorithm which tries to
determine the relationship between two random
variables X(x1,x2,xn) and y. - Here x(TFIDF,LEN,ICS,CE,IND)
25Learning to rank key phrases
- Three Regression
- Linear Regression
- Logistic Regression
- Support Vector Regression
26Evaluation
27Conclusions
- Change the search result clustering problem to be
a supervised salient phrase ranking problem. - Generate the correct clusters with short name,
thus could improve users browsing efficiency
through search result.
28Thanks!