Clustering

About This Presentation

Title:

Clustering

Description:

Title: Data Miing and Knowledge Discvoery - Web Data Mining Author: Bamshad Mobasher Last modified by: Bamshad Mobasher Created Date: 3/29/1999 8:01:23 PM – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 49

Provided by: Bamsh6

Learn more at: http://facweb.cs.depaul.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
ClusteringMemory-Based Reasoning
Bamshad Mobasher DePaul University
2
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of
data (or objects) in a set of meaningful
sub-classes, called clusters
Helps users understand the natural grouping or
structure in a data set

Cluster
a collection of data objects that are similar
to one another and thus can be treated
collectively as one group
but as a collection, they are sufficiently
different from other groups
Clustering
unsupervised classification
no predefined classes

3
Distance or Similarity Measures

Measuring Distance
In order to group similar items, we need a way to
measure the distance between objects (e.g.,
records)
Note distance inverse of similarity
Often based on the representation of objects as
feature vectors

Term Frequencies for Documents
An Employee DB
Which objects are more similar?
4
Distance or Similarity Measures

Properties of Distance Measures
for all objects A and B, dist(A, B) ³ 0, and
dist(A, B) dist(B, A)
for any object A, dist(A, A) 0
dist(A, C) dist(A, B) dist (B, C)
Representation of objects as vectors
Each data object (item) can be viewed as an
n-dimensional vector, where the dimensions are
the attributes (features) in the data
Example (employee DB) Emp. ID 2 ltM, 51,
64000gt
Example (Documents) DOC2 lt3, 1, 4, 3, 1, 2gt
The vector representation allows us to compute
distance or similarity between pairs of items
using standard vector operations, e.g.,
Dot product
Cosine of the angle between vectors
Euclidean distance

5
Distance or Similarity Measures

Common Distance Measures
Manhattan distance
Euclidean distance
Cosine similarity

6
Distance (Similarity) Matrix

Similarity (Distance) Matrix
based on the distance or similarity measure we
can construct a symmetric matrix of distance (or
similarity values)
(i, j) entry in the matrix is the distance
(similarity) between items i and j

Note that dij dji (i.e., the matrix is
symmetric. So, we only need the lower triangle
part of the matrix. The diagonal is all 1s
(similarity) or all 0s (distance)
7
Example Term Similarities in Documents

Suppose we want to cluster terms that appear in a
collection of documents with different
frequencies
We need to compute a term-term similarity matrix
For simplicity we use the dot product as
similarity measure (note that this is the
non-normalized version of cosine similarity)
Example

Each term can be viewed as a vector of term
frequencies (weights)
N total number of dimensions (in this case
documents) wik weight of term i in document k.
Sim(T1,T2) lt0,3,3,0,2gt lt4,1,0,1,2gt 0x4
3x1 3x0 0x1 2x2 7
8
Example Term Similarities in Documents
Term-Term Similarity Matrix
9
Similarity (Distance) Thresholds

A similarity (distance) threshold may be used to
mark pairs that are sufficiently similar

Using a threshold value of 10 in the previous
example
10
Graph Representation

The similarity matrix can be visualized as an
undirected graph
each item is represented by a node, and edges
represent the fact that two items are similar (a
one in the similarity threshold matrix)

If no threshold is used, then matrix can be
represented as a weighted graph
11
Simple Clustering Algorithms

If we are interested only in threshold (and not
the degree of similarity or distance), we can use
the graph directly for clustering
Clique Method (complete link)
all items within a cluster must be within the
similarity threshold of all other items in that
cluster
clusters may overlap
generally produces small but very tight clusters
Single Link Method
any item in a cluster must be within the
similarity threshold of at least one other item
in that cluster
produces larger but weaker clusters
Other methods
star method - start with an item and place all
related items in that cluster
string method - start with an item place one
related item in that cluster then place anther
item related to the last item entered, and so on

12
Simple Clustering Algorithms

Clique Method
a clique is a completely connected subgraph of a
graph
in the clique method, each maximal clique in the
graph becomes a cluster

T3
T1
Maximal cliques (and therefore the clusters) in
the previous example are T1, T3, T4,
T6 T2, T4, T6 T2, T6, T8 T1,
T5 T7 Note that, for example, T1, T3, T4
is also a clique, but is not maximal.
T5
T4
T2
T7
T6
T8
13
Simple Clustering Algorithms

Single Link Method
selected an item not in a cluster and place it in
a new cluster
place all other similar item in that cluster
repeat step 2 for each item in the cluster until
nothing more can be added
repeat steps 1-3 for each item that remains
unclustered

T3
T1
In this case the single link method produces only
two clusters T1, T3, T4, T5, T6, T2,
T8 T7 Note that the single link method
does not allow overlapping clusters, thus
partitioning the set of items.
T5
T4
T2
T7
T6
T8
14
Clustering with Existing Clusters

The notion of comparing item similarities can be
extended to clusters themselves, by focusing on a
representative vector for each cluster
cluster representatives can be actual items in
the cluster or other virtual representatives
such as the centroid
this methodology reduces the number of similarity
computations in clustering
clusters are revised successively until a
stopping condition is satisfied, or until no more
changes to clusters can be made
Partitioning Methods
reallocation method - start with an initial
assignment of items to clusters and then move
items from cluster to cluster to obtain an
improved partitioning
Single pass method - simple and efficient, but
produces large clusters, and depends on order in
which items are processed
Hierarchical Agglomerative Methods
starts with individual items and combines into
clusters
then successively combine smaller clusters to
form larger ones
grouping of individual items can be based on any
of the methods discussed earlier

15
K-Means Algorithm

The basic algorithm (based on reallocation
method)
1. Select K initial clusters by (possibly) random
assignment of some items to clusters and compute
each of the cluster centroids.
2. Compute the similarity of each item xi to each
cluster centroid and (re-)assign each item to the
cluster whose centroid is most similar to xi.
3. Re-compute the cluster centroids based on the
new assignments.
4. Repeat steps 2 and 3 until three is no change
in clusters from one iteration to the next.

Example Clustering Documents
Initial (arbitrary) assignment C1 D1,D2, C2
D3,D4, C3 D5,D6
Cluster Centroids
16
Example K-Means
Now compute the similarity (or distance) of each
item with each cluster, resulting a
cluster-document similarity matrix (here we use
dot product as the similarity measure).
For each document, reallocate the document to the
cluster to which it has the highest similarity
(shown in red in the above table). After the
reallocation we have the following new clusters.
Note that the previously unassigned D7 and D8
have been assigned, and that D1 and D6 have been
reallocated from their original assignment. C1
D2,D7,D8, C2 D1,D3,D4,D6, C3 D5
This is the end of first iteration (i.e., the
first reallocation). Next, we repeat the process
for another reallocation
17
Example K-Means
C1 D2,D7,D8, C2 D1,D3,D4,D6, C3
D5
Now compute new cluster centroids using the
original document-term matrix
This will lead to a new cluster-doc similarity
matrix similar to previous slide. Again, the
items are reallocated to clusters with highest
similarity.
C1 D2,D6,D8, C2 D1,D3,D4, C3
D5,D7
New assignment ?
Note This process is now repeated with new
clusters. However, the next iteration in this
example Will show no change to the clusters, thus
terminating the algorithm.
18
Single Pass Method

The basic algorithm
1. Assign the first item T1 as representative for
C1
2. for item Ti calculate similarity S with
centroid for each existing cluster
3. If Smax is greater than threshold value, add
item to corresponding cluster and recalculate
centroid otherwise use item to initiate new
cluster
4. If another item remains unclustered, go to
step 2
See Example of Single Pass Clustering Technique
This algorithm is simple and efficient, but has
some problems
generally does not produce optimum clusters
order dependent - using a different order of
processing items will result in a different
clustering

19
K-Means Algorithm

Strength of the k-means
Relatively efficient O(tkn), where n is of
objects, k is of clusters, and t is of
iterations. Normally, k, t ltlt n
Often terminates at a local optimum
Weakness of the k-means
Applicable only when mean is defined what about
categorical data?
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Variations of K-Means usually differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means

20
Hierarchical Algorithms

Use distance matrix as clustering criteria
does not require the no. of clusters as input,
but needs a termination condition

Step 0
Step 1
Step 2
Step 3
Step 4
Agglomerative
a
ab
b
abcde
c
cd
d
cde
e
Divisive
Step 4
Step 3
Step 2
Step 1
Step 0
21
Hierarchical Agglomerative Clustering

Dendrogram for a hierarchy of clusters

A B C D E F G H I
22
Clustering Application Web Usage Mining

Discovering Aggregate Usage Profiles
Goal to effectively capture user segments
based on their common usage patterns from
potentially anonymous click-stream data
Method Cluster user transactions to obtain user
segments automatically, then represent each
cluster by its centroid
Aggregate profiles are obtained from each
centroid after sorting by weight and filtering
out low-weight items in each centroid
Note that profiles are represented as weighted
collections of pageviews
weights represent the significance of pageviews
within each cluster
profiles are overlapping, so they capture common
interests among different groups/types of users
(e.g., customer segments)

23
Profile Aggregation Based on Clustering
Transactions (PACT)

Discovery of Profiles Based on Transaction
Clusters
cluster user transactions - features are
significant pageviews identified in the
preprocessing stage
derive usage profiles (set of pageview-weight
pairs) based on characteristics of each
transaction cluster
Deriving Usage Profiles from Transaction Clusters
each cluster contains a set of user transactions
(vectors)
for each cluster compute centroid as cluster
representative
a set of pageview-weight pairs for transaction
cluster C, select each pageview pi such that
(in the cluster centroid) is greater than a
pre-specified threshold

24
PACT - An Example
Original Session/user data
Given an active session A ? B, the best matching
profile is Profile 1. This may result in a
recommendation for page F.html, since it appears
with high weight in that profile.
Result of Clustering
PROFILE 0 (Cluster Size 3) ---------------------
----------------- 1.00 C.html 1.00 D.html PROFILE
1 (Cluster Size 4) ----------------------------
---------- 1.00 B.html 1.00 F.html 0.75 A.html 0.2
5 C.html PROFILE 2 (Cluster Size
3) -------------------------------------- 1.00 A.h
tml 1.00 D.html 1.00 E.html 0.33 C.html
25
Web Usage Mining clustering example

Transaction Clusters
Clustering similar user transactions and using
centroid of each cluster as an aggregate usage
profile (representative for a user segment)

Sample cluster centroid from dept. Web site
(cluster size 330)
Support URL Pageview Description
1.00 /courses/syllabus.asp?course450-96-303q3y2002id290 SE 450 Object-Oriented Development class syllabus
0.97 /people/facultyinfo.asp?id290 Web page of a lecturer who thought the above course
0.88 /programs/ Current Degree Descriptions 2002
0.85 /programs/courses.asp?depcode96deptmnesecourseid450 SE 450 course description in SE program
0.82 /programs/2002/gradds2002.asp M.S. in Distributed Systems program description
26
Clustering Application Discovery of Content
Profiles

Content Profiles
Goal automatically group together pages which
partially deal with similar concepts
Method
identify concepts by clustering feature
(keywords) based on their common occurrences
among pages (can also be done using association
discovery or correlation analysis)
cluster centroids represent pages in which
features in the cluster appear frequently
Content profiles are derived from centroids after
filtering out low-weight pageviews in each
centroid
Note that each content profile is represented as
a collections of pageview-weight pairs (similar
to usage profiles)
however, the weight of a pageview in a profile
represents the degree to which features in the
corresponding cluster appear in that pageview.

27
Content Profiles An Example
PROFILE 0 (Cluster Size 3) ---------------------
--------------------------------------------------
--------------------------------------- 1.00 C.htm
l (web, data, mining) 1.00 D.html (web, data,
mining) 0.67 B.html (data, mining) PROFILE 1
(Cluster Size 4) -------------------------------
--------------------------------------------------
---------------------------- 1.00 B.html (business
, intelligence, marketing, ecommerce) 1.00 F.html
(business, intelligence, marketing,
ecommerce) 0.75 A.html (business, intelligence,
marketing) 0.50 C.html (marketing,
ecommerce) 0.50 E.html (intelligence,
marketing) PROFILE 2 (Cluster Size
3) -----------------------------------------------
--------------------------------------------------
------------ 1.00 A.html (search, information,
retrieval) 1.00 E.html (search, information,
retrieval) 0.67 C.html (information,
retrieval) 0.67 D.html (information, retireval)
Filtering threshold 0.5
28
What is Memory-Based Reasoning?

Basic Idea classify new instances based on their
similarity to instances we have seen before
also called instance-based learning
Simplest form of MBR Rote Learning
learning by memorization
save all previously encountered instance given a
new instance, find one from the memorized set
that most closely resembles the new one assign
new instance to the same class as the nearest
neighbor
more general methods try to find k nearest
neighbors rather than just one
but, how do we define resembles?
MBR is lazy
defers all of the real work until new instance is
obtained no attempts are made to learn a
generalized model from the training set
less data preprocessing and model evaluation, but
more work has to be done at classification time

29
What is Memory-Based Reasoning?

MBR is simple to implement and usually works
well, but has some problems
may not be scalable if the number of instances
become very large
outliers and noise may adversely affect accuracy
there are no explicit structures or models that
are learned (instances by themselves dont
describe the patterns in data)
Improving MBR Effectiveness
keep only some of the instances for each class
find stable regions surrounding instances with
similar classes
however, discarded instance may later prove to be
important
keep prototypical examples of each class to use a
a sort of explicit knowledge representation
increase the value of k trade-off is in run-time
efficiency
can use cross-validation to determine best value
for k
some good results are being reported on combining
clustering with MBR (in collaborative filtering
research)

30
Basic Issues in Applying MBR

Choosing the right set of instances
can do just random sampling since unusual
records may be missed (e.g., in the movie
database, poplar movies will dominate the random
sample)
usual practice is to keep roughly the same number
of records for each class
Computing Distance
general distance functions like those discussed
before can be used
issues are how to normalize and what to do with
missing values
Finding the right combination function
how many nearest neighbors need to be used
how to combine answers from nearest neighbors
basic approaches democracy, weighted voting

31
Combination Functions

Voting the democracy approach
poll the neighbors for the answer and use the
majority vote
the number of neighbors (k) is often taken to be
odd in order to avoid ties
works when the number of classes is two
if there are more than two classes, take k to be
the number of classes plus 1
Impact of k on predictions
in general different values of k affect the
outcome of classification
we can associate a confidence level with
predictions (this can be the of neighbors that
are in agreement)
problem is that no single category may get a
majority vote
if there is strong variations in results for
different choices of k, this an indication that
the training set is not large enough

32
Voting Approach - Example
Will a new customer respond to solicitation?
Using the voting method without confidence
Using the voting method with a confidence
33
Combination Functions

Weighted Voting not so democratic
similar to voting, but the vote some neighbors
counts more
shareholder democracy?
question is which neighbors vote counts more?
How can weights be obtained?
Distance-based
closer neighbors get higher weights
value of the vote is the inverse of the
distance (may need to add a small constant)
the weighted sum for each class gives the
combined score for that class
to compute confidence, need to take weighted
average
Heuristic
weight for each neighbor is based on
domain-specific characteristics of that neighbor
Advantage of weighted voting
introduces enough variation to prevent ties in
most cases
helps distinguish between competing neighbors

34
Dealing with Numerical Values

Voting schemes only work well for categorical
attributes what if we want to predict the
numerical value of the class?
Interpolation
simplest approach is to take the average value of
class attribute for all of the nearest neighbors
this will be the predicted value for the new
instance
Regression
basic idea is to fit a function (e.g., a line) to
a number of points
usually we can get the nest results by first
computing the nearest neighbors and then doing
regression on the neighbors
this has the advantage of better capturing
localized variations in the data
the predicted class value for the new instance
will be the value of the function applied to the
instance attribute values

35
MBR Collaborative Filtering

Collaborative Filtering Example
A movie rating system
Ratings scale 1 detest 7 love it
Historical DB of users includes ratings of movies
by Sally, Bob, Chris, and Lynn
Karen is a new user who has rated 3 movies, but
has not yet seen Independence Day should we
recommend it to her?

Will Karen like Independence Day?
36
MBR Collaborative Filtering

Collaborative Filtering or Social Learning
idea is to give recommendations to a user based
on the ratings of objects by other users
usually assumes that features in the data are
similar objects (e.g., Web pages, music, movies,
etc.)
usually requires explicit ratings of objects by
users based on a rating scale
there have been some attempts to obtain ratings
implicitly based on user behavior (mixed results
problem is that implicit ratings are often
binary)
Nearest Neighbors Strategy
Find similar users and predicted (weighted)
average of user ratings
We can use any distance or similarity measure to
compute similarity among users (user ratings on
items viewed as a vector)
In case of ratings, often the Pearson r algorithm
is used to compute correlations

37
MBR Collaborative Filtering

Pearson Correlation
weight by degree of correlation between user U
and user J
1 means very similar, 0 means no correlation, -1
means dissimilar
Works well in case of user ratings (where there
is at least a range of 1-5)
Not always possible (in some situations we may
only have implicit binary values, e.g., whether a
user did or did not select a document)
Alternatively, a variety of distance or
similarity measures can be used

Average rating of user J on all items.
38
Collaborative Filtering (k Nearest Neighbor
Example)
Prediction
K is the number of nearest neighbors used in to
find the average predicted ratings of Karen on
Indep. Day.
Example computation Pearson(Sally, Karen)
( (7-5.33)(7-4.67) (6-5.33)(4-4.67)
(3-5.33)(3-4.67) ) / SQRT( ((7-5.33)2
(6-5.33)2 (3-5.33)2) ((7- 4.67)2 (4- 4.67)2
(3- 4.67)2)) 0.85
Note in MS Excel, Pearson Correlation can be
computed using the CORREL function, e.g.,
CORREL(B7D7,B2D2).
39
Collaborative Filtering(k Nearest Neighbor)

In practice a more sophisticated approach is used
to generate the predictions based on the nearest
neighbors
To generate predictions for a target user a on an
item i
ra mean rating for user a
u1, , uk are the k-nearest-neighbors to a
ru,i rating of user u on item I
sim(a,u) Pearson correlation between a and u
This is a weighted average of deviations from the
neighbors mean ratings (and closer neighbors
count more)

40
Item-based Collaborative Filtering

Find similarities among the items based on
ratings across users
Often measured based on a variation of Cosine
measure
Prediction of item I for user a is based on the
past ratings of user a on items similar to i.
Suppose
Then, the predicted rating for Karen on Indep.
Day will be 7, because she rated Star Wars 7
That is if we only use the most similar item
Otherwise, we can use the k-most similar items
and again use a weighted average

sim(Star Wars, Indep. Day) gt sim(Jur. Park,
Indep. Day) gt sim(Termin., Indep. Day)
41
Collaborative Filtering Pros Cons

Advantages
Ignores the content, only looks at who judges
things similarly
If Pam liked the paper, Ill like the paper
If you liked Star Wars, youll like Independence
Day
Rating based on ratings of similar people
Works well on data relating to taste
Something that people are good at predicting
about each other too
can be combined with meta information about
objects to increase accuracy
Disadvantages
early ratings by users can bias ratings of future
users
small number of users relative to number of items
may result in poor performance
scalability problems as number of users
increase, nearest neighbor calculations become
computationally intensive
because of the (dynamic) nature of the
application, it is difficult to select only a
portion instances as the training set.

42
Profile Injection Attacks

Consists of a number of "attack profiles"
profiles engineered to bias the system's
recommendations
Called Shilling in some previous work
"Push attack"
designed to promote a particular product
attack profiles give a high rating to the pushed
item
includes other ratings as necessary
Other attack types
nuke attacks
system-wide attacks

43
Amazon blushes over sex link gaffeBy Stefanie
Olsen
http//news.com.com/Amazonblushesoversexlinkg
affe/2100-1023_3-976435.html Story last modified
Mon Dec 09 134631 PST 2002 In a incident that
highlights the pitfalls of online recommendation
systems, Amazon.com on Friday removed a link to a
sex manual that appeared next to a listing for a
spiritual guide by well-known Christian
televangelist Pat Robertson. The two titles
were temporarily linked as a result of technology
that tracks and displays lists of merchandise
perused and purchased by Amazon visitors. Such
promotions appear below the main description for
products under the title, "Customers who shopped
for this item also shopped for these items.
Amazon's automated results for Robertson's "Six
Steps to Spiritual Revival included a second
title by Robertson as well as a book about anal
sex for men. Amazon conducted an investigation
and determined hundreds of customers going to
the same items while they were shopping on the
site.
44
Amazon.com and Pat Robertson

It turned out that a loosely organized group who
didn't like the right wing evangelist Pat
Robertson managed to trick the Amazon recommender
into linking his book "Six Steps to a Spiritual
Life" with a book on anal sex for men
Roberstons book was the target of a profile
injection attack.

45
Attacking Collaborative Filtering Systems
Item1 Item 2 Item 3 Item 4 Item 5 Item 6 Correlation with Alice
Alice 5 2 3 3 ?
User 1 2 4 4 1 -1.00
User 2 2 1 3 1 2 0.33
User 3 4 2 3 2 1 .90
User 4 3 3 2 3 1 0.19
User 5 3 2 2 2 -1.00
User 6 5 3 1 3 2 0.65
User 7 5 1 5 1 -1.00
Bestmatch
Prediction ?
Using k-nearest neighbor with k 1
46
A Successful Push Attack
Item1 Item 2 Item 3 Item 4 Item 5 Item 6 Correlation with Alice
Alice 5 2 3 3 ?
User 1 2 4 4 1 -1.00
User 2 2 1 3 1 2 0.33
User 3 4 2 3 2 1 .90
User 4 3 3 2 3 1 0.19
User 5 3 2 2 2 -1.00
User 6 5 3 1 3 2 0.65
User 7 5 1 5 1 -1.00
Attack 1 2 3 2 5 -1.00
Attack 2 3 2 3 2 5 0.76
Attack 3 3 2 2 2 5 0.93
BestMatch
Prediction ?
user-based algorithm using k-nearest neighbor
with k 1
47
Item-Based Collaborative Filtering
Prediction ?
Item1 Item 2 Item 3 Item 4 Item 5 Item 6
Alice 5 2 3 3 ?
User 1 2 4 4 1
User 2 2 1 3 1 2
User 3 4 2 3 2 1
User 4 3 3 2 3 1
User 5 3 2 2 2
User 6 5 3 1 3 2
User 7 5 1 5 1
Item similarity 0.76 0.79 0.60 0.71 0.75
Bestmatch
But, what if the attacker knows, independently,
that Item1 is generally popular?
48
A Push Attack Against Item-Based Algorithm
Prediction ?
Item1 Item 2 Item 3 Item 4 Item 5 Item 6
Alice 5 2 3 3 ?
User 1 2 4 4 1
User 2 2 1 3 1 2
User 3 4 2 3 2 1
User 4 3 3 2 3 1
User 5 3 2 2 2
User 6 5 3 1 3 2
User 7 5 1 5 1
Attack 1 5 1 1 1 1 5
Attack 2 5 1 1 1 1 5
Attack 3 5 1 1 1 1 5
Item similarity 0.89 0.53 0.49 0.70 0.50
BestMatch

Write a Comment

User Comments (0)