Web Document Clustering presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web Document Clustering

1
Web Document Clustering

2
Why web document clustering is needed?

3
How to present a web document in a general model?

TF-IDF
Each web document is consisted by words.
The more words they share, the more likely they
are similar.
Each Web document D can be represented by the
following form
D d1,d2, dn
Where n means that there are totally n
different words in the document collection.
di represents the appearance of the ith word in
the document.(1 means exist, 0 means non-exist)
The order of di is determined by the weight.

4
How to calculate the weight?

tfij is number of occurrences of the word tj in
the Web document Di.
idfj is Inverse document frequency.
dfj is the number of Web documents in which word
tj occurs in the document collection.
n is the total number of Web documents in the
document collection.

5
How to calculate the similarity between two web
documents

6
Agglomerative Hierarchical clustering

Start with regarding each document as an
individual cluster
Merge the most similar pair of documents or
document clusters.(use the similarity measure)
Step 2 is iteratively executed until all objects
are contained within a single cluster, which
become the root of the tree.

7
K-means clustering

Arbitrary select K documents as seeds, they are
the initial centroids of each cluster.
Assign all other documents to the closest
centroid
Compute the centroid of each cluster again. Get
new centroid of each cluster
Repeat step2,3, until the centroid of each
cluster doesnt change.

8
Some other refinement algorithm using TF-IDF model

9
Bisecting K-means

1.Select a cluster to split (There are several
ways to select which cluster to split. No
significance difference exists in terms of
clustering accuracy). We normally choose the
largest cluster or the one with the least overall
similarity
2.Employ the basic k-means algorithm to subdivide
the chosen cluster.
3.Repeat step 2 for a constant number of times.
Then perform the split that produces clusters
with the highest overall similarity
4.Repeat the above step1,2,3, until the desired
number of clusters is reached

10
How to present a web document in STC model

What is STC?
Suffix Tree clustering
The whole web document is treated as a string
The identification of base clusters is the
creation of an inverted index of strings for the
web document collection

11
A suffix tree example(courtesy form zemair)

12
STC algorithm(cont)

1.Document cleaning
Delete the word prefix and suffix, reduce plural
to singular. Sentence boundaries are marked and
non-word tokens (such as numbers, HTML tags and
most punctuation) are stripped.
2.Identify Base Cluster.
Create an inverted index of strings from the web
document collection with using a suffix tree.
Each node of the suffix tree represents a group
of documents and a string that is common to all
of them. The label of the node represents the
common string. Each node represents a base
cluster.

13
STC algorithm(cont)

3.Score base clusters.
?Each base cluster is assigned a score
The score formula S(B)Bf(P)
B is the number of documents in base cluster B
P is the number of words in string P that has a
non-zero score
The function f penalizes single word, linear for
string that is two to six words long. And become
constant for longer string.

14
STC algorithm

4.Combine base clusters.
The similarity measure used to combine base
clusters is based on the overlap of their
document sets
?Bx and By with size Bx and By
?Bx?By represents the number of documents
common to both base clusters.
?Define the similarity of Bx and By to be 1 if
Bx ? By/Bxgt0.5 and Bx ? By/Bygt0.5.
Otherwise is 0.
?Two base clusters are connected if they have
similarity of 1. Using a single-link clustering
algorithm, all the connected base clusters are
clustering together. All the documents in these
base clusters constitute a web document cluster.

15
Link Based Model

Idea Web pages that share common links each
other are very likely to be tightly related
Each web document P is represented as 2 vectors
Pout(N-dimension) and Pin(M-dimension)
Pout,i represents whether the web document P has
a out-link in the ith item of vector Pout
Pin,j represents whether the web document P has a
in-link in the jth item of vector Pin
For example
Pout( link1, link2,,linkn) represents all the
out-link in web document collection.
Document Pout,2 1 means this document has link2
as out-link.

16
Link based algorithm

1.Filter irrelevant web documents
?A document is regarded irrelevant if the sum of
in-links and out-links less than 2
2.Use near-common link of cluster to grantee
intra-cluster cohesiveness
?Every cluster should have at least one 30 near
common link
3.Assign each web document to cluster, generate
base clusters.
? Similarity between the document and the
corresponding cluster is above the similarity
threshold
? The document has a link in common with near
common links of the corresponding cluster
4.Generate final clusters by merging base
clusters

17
How to evaluate the quality of the result
clusters (cont)

Entropy
1)For each cluster, the class distribution of the
data(we usually use TREC5,TREC6 document
collection) is calculated first.
2)Using this class distribution, the entropy of
each cluster j is calculated.
Ej -Spijlog(pij)
3) The best quality is that all the documents in
the cluster fall into the same class that is
known before clustering

18
How to evaluate the quality of the result clusters

19
Algorithm evaluation and comparison

20
Future work

Each algorithm has its advantage and
disadvantage. We need to refine these algorithms.
Sometime we need trade off.
Still some room to make it better.
1.increase the entropy or F-measure value of the
result clusters(The evaluation value is under 0.6
in almost all algorithm,while the best is 1)
2.decrease the response time(we often need to
process a large document collection. We need a
fast algorithm)

21
End

Write a Comment

User Comments (0)

About PowerShow.com

Web Document Clustering PowerPoint PPT Presentation