Title: Clustering
1ClusteringMemory-Based Reasoning
Bamshad Mobasher DePaul University
2What is Clustering in Data Mining?
Clustering is a process of partitioning a set of
data (or objects) in a set of meaningful
sub-classes, called clusters
Helps users understand the natural grouping or
structure in a data set
- Cluster
- a collection of data objects that are similar
to one another and thus can be treated
collectively as one group - but as a collection, they are sufficiently
different from other groups - Clustering
- unsupervised classification
- no predefined classes
3Distance or Similarity Measures
- Measuring Distance
- In order to group similar items, we need a way to
measure the distance between objects (e.g.,
records) - Note distance inverse of similarity
- Often based on the representation of objects as
feature vectors
Term Frequencies for Documents
An Employee DB
Which objects are more similar?
4Distance or Similarity Measures
- Properties of Distance Measures
- for all objects A and B, dist(A, B) ³ 0, and
dist(A, B) dist(B, A) - for any object A, dist(A, A) 0
- dist(A, C) dist(A, B) dist (B, C)
- Representation of objects as vectors
- Each data object (item) can be viewed as an
n-dimensional vector, where the dimensions are
the attributes (features) in the data - Example (employee DB) Emp. ID 2 ltM, 51,
64000gt - Example (Documents) DOC2 lt3, 1, 4, 3, 1, 2gt
- The vector representation allows us to compute
distance or similarity between pairs of items
using standard vector operations, e.g., - Dot product
- Cosine of the angle between vectors
- Euclidean distance
5Distance or Similarity Measures
- Common Distance Measures
- Manhattan distance
- Euclidean distance
- Cosine similarity
6Distance (Similarity) Matrix
- Similarity (Distance) Matrix
- based on the distance or similarity measure we
can construct a symmetric matrix of distance (or
similarity values) - (i, j) entry in the matrix is the distance
(similarity) between items i and j
Note that dij dji (i.e., the matrix is
symmetric. So, we only need the lower triangle
part of the matrix. The diagonal is all 1s
(similarity) or all 0s (distance)
7Example Term Similarities in Documents
- Suppose we want to cluster terms that appear in a
collection of documents with different
frequencies - We need to compute a term-term similarity matrix
- For simplicity we use the dot product as
similarity measure (note that this is the
non-normalized version of cosine similarity) - Example
Each term can be viewed as a vector of term
frequencies (weights)
N total number of dimensions (in this case
documents) wik weight of term i in document k.
Sim(T1,T2) lt0,3,3,0,2gt lt4,1,0,1,2gt 0x4
3x1 3x0 0x1 2x2 7
8Example Term Similarities in Documents
Term-Term Similarity Matrix
9Similarity (Distance) Thresholds
- A similarity (distance) threshold may be used to
mark pairs that are sufficiently similar
Using a threshold value of 10 in the previous
example
10Graph Representation
- The similarity matrix can be visualized as an
undirected graph - each item is represented by a node, and edges
represent the fact that two items are similar (a
one in the similarity threshold matrix)
If no threshold is used, then matrix can be
represented as a weighted graph
11Simple Clustering Algorithms
- If we are interested only in threshold (and not
the degree of similarity or distance), we can use
the graph directly for clustering - Clique Method (complete link)
- all items within a cluster must be within the
similarity threshold of all other items in that
cluster - clusters may overlap
- generally produces small but very tight clusters
- Single Link Method
- any item in a cluster must be within the
similarity threshold of at least one other item
in that cluster - produces larger but weaker clusters
- Other methods
- star method - start with an item and place all
related items in that cluster - string method - start with an item place one
related item in that cluster then place anther
item related to the last item entered, and so on
12Simple Clustering Algorithms
- Clique Method
- a clique is a completely connected subgraph of a
graph - in the clique method, each maximal clique in the
graph becomes a cluster
T3
T1
Maximal cliques (and therefore the clusters) in
the previous example are T1, T3, T4,
T6 T2, T4, T6 T2, T6, T8 T1,
T5 T7 Note that, for example, T1, T3, T4
is also a clique, but is not maximal.
T5
T4
T2
T7
T6
T8
13Simple Clustering Algorithms
- Single Link Method
- selected an item not in a cluster and place it in
a new cluster - place all other similar item in that cluster
- repeat step 2 for each item in the cluster until
nothing more can be added - repeat steps 1-3 for each item that remains
unclustered
T3
T1
In this case the single link method produces only
two clusters T1, T3, T4, T5, T6, T2,
T8 T7 Note that the single link method
does not allow overlapping clusters, thus
partitioning the set of items.
T5
T4
T2
T7
T6
T8
14Clustering with Existing Clusters
- The notion of comparing item similarities can be
extended to clusters themselves, by focusing on a
representative vector for each cluster - cluster representatives can be actual items in
the cluster or other virtual representatives
such as the centroid - this methodology reduces the number of similarity
computations in clustering - clusters are revised successively until a
stopping condition is satisfied, or until no more
changes to clusters can be made - Partitioning Methods
- reallocation method - start with an initial
assignment of items to clusters and then move
items from cluster to cluster to obtain an
improved partitioning - Single pass method - simple and efficient, but
produces large clusters, and depends on order in
which items are processed - Hierarchical Agglomerative Methods
- starts with individual items and combines into
clusters - then successively combine smaller clusters to
form larger ones - grouping of individual items can be based on any
of the methods discussed earlier
15K-Means Algorithm
- The basic algorithm (based on reallocation
method) - 1. Select K initial clusters by (possibly) random
assignment of some items to clusters and compute
each of the cluster centroids. - 2. Compute the similarity of each item xi to each
cluster centroid and (re-)assign each item to the
cluster whose centroid is most similar to xi. - 3. Re-compute the cluster centroids based on the
new assignments. - 4. Repeat steps 2 and 3 until three is no change
in clusters from one iteration to the next.
Example Clustering Documents
Initial (arbitrary) assignment C1 D1,D2, C2
D3,D4, C3 D5,D6
Cluster Centroids
16Example K-Means
Now compute the similarity (or distance) of each
item with each cluster, resulting a
cluster-document similarity matrix (here we use
dot product as the similarity measure).
For each document, reallocate the document to the
cluster to which it has the highest similarity
(shown in red in the above table). After the
reallocation we have the following new clusters.
Note that the previously unassigned D7 and D8
have been assigned, and that D1 and D6 have been
reallocated from their original assignment. C1
D2,D7,D8, C2 D1,D3,D4,D6, C3 D5
This is the end of first iteration (i.e., the
first reallocation). Next, we repeat the process
for another reallocation
17Example K-Means
C1 D2,D7,D8, C2 D1,D3,D4,D6, C3
D5
Now compute new cluster centroids using the
original document-term matrix
This will lead to a new cluster-doc similarity
matrix similar to previous slide. Again, the
items are reallocated to clusters with highest
similarity.
C1 D2,D6,D8, C2 D1,D3,D4, C3
D5,D7
New assignment ?
Note This process is now repeated with new
clusters. However, the next iteration in this
example Will show no change to the clusters, thus
terminating the algorithm.
18Single Pass Method
- The basic algorithm
- 1. Assign the first item T1 as representative for
C1 - 2. for item Ti calculate similarity S with
centroid for each existing cluster - 3. If Smax is greater than threshold value, add
item to corresponding cluster and recalculate
centroid otherwise use item to initiate new
cluster - 4. If another item remains unclustered, go to
step 2 - See Example of Single Pass Clustering Technique
- This algorithm is simple and efficient, but has
some problems - generally does not produce optimum clusters
- order dependent - using a different order of
processing items will result in a different
clustering
19K-Means Algorithm
- Strength of the k-means
- Relatively efficient O(tkn), where n is of
objects, k is of clusters, and t is of
iterations. Normally, k, t ltlt n - Often terminates at a local optimum
- Weakness of the k-means
- Applicable only when mean is defined what about
categorical data? - Need to specify k, the number of clusters, in
advance - Unable to handle noisy data and outliers
- Variations of K-Means usually differ in
- Selection of the initial k means
- Dissimilarity calculations
- Strategies to calculate cluster means
20Hierarchical Algorithms
- Use distance matrix as clustering criteria
- does not require the no. of clusters as input,
but needs a termination condition
Step 0
Step 1
Step 2
Step 3
Step 4
Agglomerative
a
ab
b
abcde
c
cd
d
cde
e
Divisive
Step 4
Step 3
Step 2
Step 1
Step 0
21Hierarchical Agglomerative Clustering
- Dendrogram for a hierarchy of clusters
A B C D E F G H I
22Clustering Application Web Usage Mining
- Discovering Aggregate Usage Profiles
- Goal to effectively capture user segments
based on their common usage patterns from
potentially anonymous click-stream data - Method Cluster user transactions to obtain user
segments automatically, then represent each
cluster by its centroid - Aggregate profiles are obtained from each
centroid after sorting by weight and filtering
out low-weight items in each centroid - Note that profiles are represented as weighted
collections of pageviews - weights represent the significance of pageviews
within each cluster - profiles are overlapping, so they capture common
interests among different groups/types of users
(e.g., customer segments)
23Profile Aggregation Based on Clustering
Transactions (PACT)
- Discovery of Profiles Based on Transaction
Clusters - cluster user transactions - features are
significant pageviews identified in the
preprocessing stage - derive usage profiles (set of pageview-weight
pairs) based on characteristics of each
transaction cluster - Deriving Usage Profiles from Transaction Clusters
- each cluster contains a set of user transactions
(vectors) - for each cluster compute centroid as cluster
representative - a set of pageview-weight pairs for transaction
cluster C, select each pageview pi such that
(in the cluster centroid) is greater than a
pre-specified threshold
24PACT - An Example
Original Session/user data
Given an active session A ? B, the best matching
profile is Profile 1. This may result in a
recommendation for page F.html, since it appears
with high weight in that profile.
Result of Clustering
PROFILE 0 (Cluster Size 3) ---------------------
----------------- 1.00 C.html 1.00 D.html PROFILE
1 (Cluster Size 4) ----------------------------
---------- 1.00 B.html 1.00 F.html 0.75 A.html 0.2
5 C.html PROFILE 2 (Cluster Size
3) -------------------------------------- 1.00 A.h
tml 1.00 D.html 1.00 E.html 0.33 C.html
25Web Usage Mining clustering example
- Transaction Clusters
- Clustering similar user transactions and using
centroid of each cluster as an aggregate usage
profile (representative for a user segment)
Sample cluster centroid from dept. Web site
(cluster size 330)
Support URL Pageview Description
1.00 /courses/syllabus.asp?course450-96-303q3y2002id290 SE 450 Object-Oriented Development class syllabus
0.97 /people/facultyinfo.asp?id290 Web page of a lecturer who thought the above course
0.88 /programs/ Current Degree Descriptions 2002
0.85 /programs/courses.asp?depcode96deptmnesecourseid450 SE 450 course description in SE program
0.82 /programs/2002/gradds2002.asp M.S. in Distributed Systems program description
26Clustering Application Discovery of Content
Profiles
- Content Profiles
- Goal automatically group together pages which
partially deal with similar concepts - Method
- identify concepts by clustering feature
(keywords) based on their common occurrences
among pages (can also be done using association
discovery or correlation analysis) - cluster centroids represent pages in which
features in the cluster appear frequently - Content profiles are derived from centroids after
filtering out low-weight pageviews in each
centroid - Note that each content profile is represented as
a collections of pageview-weight pairs (similar
to usage profiles) - however, the weight of a pageview in a profile
represents the degree to which features in the
corresponding cluster appear in that pageview.
27Content Profiles An Example
PROFILE 0 (Cluster Size 3) ---------------------
--------------------------------------------------
--------------------------------------- 1.00 C.htm
l (web, data, mining) 1.00 D.html (web, data,
mining) 0.67 B.html (data, mining) PROFILE 1
(Cluster Size 4) -------------------------------
--------------------------------------------------
---------------------------- 1.00 B.html (business
, intelligence, marketing, ecommerce) 1.00 F.html
(business, intelligence, marketing,
ecommerce) 0.75 A.html (business, intelligence,
marketing) 0.50 C.html (marketing,
ecommerce) 0.50 E.html (intelligence,
marketing) PROFILE 2 (Cluster Size
3) -----------------------------------------------
--------------------------------------------------
------------ 1.00 A.html (search, information,
retrieval) 1.00 E.html (search, information,
retrieval) 0.67 C.html (information,
retrieval) 0.67 D.html (information, retireval)
Filtering threshold 0.5
28What is Memory-Based Reasoning?
- Basic Idea classify new instances based on their
similarity to instances we have seen before - also called instance-based learning
- Simplest form of MBR Rote Learning
- learning by memorization
- save all previously encountered instance given a
new instance, find one from the memorized set
that most closely resembles the new one assign
new instance to the same class as the nearest
neighbor - more general methods try to find k nearest
neighbors rather than just one - but, how do we define resembles?
- MBR is lazy
- defers all of the real work until new instance is
obtained no attempts are made to learn a
generalized model from the training set - less data preprocessing and model evaluation, but
more work has to be done at classification time
29What is Memory-Based Reasoning?
- MBR is simple to implement and usually works
well, but has some problems - may not be scalable if the number of instances
become very large - outliers and noise may adversely affect accuracy
- there are no explicit structures or models that
are learned (instances by themselves dont
describe the patterns in data) - Improving MBR Effectiveness
- keep only some of the instances for each class
- find stable regions surrounding instances with
similar classes - however, discarded instance may later prove to be
important - keep prototypical examples of each class to use a
a sort of explicit knowledge representation - increase the value of k trade-off is in run-time
efficiency - can use cross-validation to determine best value
for k - some good results are being reported on combining
clustering with MBR (in collaborative filtering
research)
30Basic Issues in Applying MBR
- Choosing the right set of instances
- can do just random sampling since unusual
records may be missed (e.g., in the movie
database, poplar movies will dominate the random
sample) - usual practice is to keep roughly the same number
of records for each class - Computing Distance
- general distance functions like those discussed
before can be used - issues are how to normalize and what to do with
missing values - Finding the right combination function
- how many nearest neighbors need to be used
- how to combine answers from nearest neighbors
- basic approaches democracy, weighted voting
31Combination Functions
- Voting the democracy approach
- poll the neighbors for the answer and use the
majority vote - the number of neighbors (k) is often taken to be
odd in order to avoid ties - works when the number of classes is two
- if there are more than two classes, take k to be
the number of classes plus 1 - Impact of k on predictions
- in general different values of k affect the
outcome of classification - we can associate a confidence level with
predictions (this can be the of neighbors that
are in agreement) - problem is that no single category may get a
majority vote - if there is strong variations in results for
different choices of k, this an indication that
the training set is not large enough
32Voting Approach - Example
Will a new customer respond to solicitation?
Using the voting method without confidence
Using the voting method with a confidence
33Combination Functions
- Weighted Voting not so democratic
- similar to voting, but the vote some neighbors
counts more - shareholder democracy?
- question is which neighbors vote counts more?
- How can weights be obtained?
- Distance-based
- closer neighbors get higher weights
- value of the vote is the inverse of the
distance (may need to add a small constant) - the weighted sum for each class gives the
combined score for that class - to compute confidence, need to take weighted
average - Heuristic
- weight for each neighbor is based on
domain-specific characteristics of that neighbor - Advantage of weighted voting
- introduces enough variation to prevent ties in
most cases - helps distinguish between competing neighbors
34Dealing with Numerical Values
- Voting schemes only work well for categorical
attributes what if we want to predict the
numerical value of the class? - Interpolation
- simplest approach is to take the average value of
class attribute for all of the nearest neighbors - this will be the predicted value for the new
instance - Regression
- basic idea is to fit a function (e.g., a line) to
a number of points - usually we can get the nest results by first
computing the nearest neighbors and then doing
regression on the neighbors - this has the advantage of better capturing
localized variations in the data - the predicted class value for the new instance
will be the value of the function applied to the
instance attribute values
35MBR Collaborative Filtering
- Collaborative Filtering Example
- A movie rating system
- Ratings scale 1 detest 7 love it
- Historical DB of users includes ratings of movies
by Sally, Bob, Chris, and Lynn - Karen is a new user who has rated 3 movies, but
has not yet seen Independence Day should we
recommend it to her?
Will Karen like Independence Day?
36MBR Collaborative Filtering
- Collaborative Filtering or Social Learning
- idea is to give recommendations to a user based
on the ratings of objects by other users - usually assumes that features in the data are
similar objects (e.g., Web pages, music, movies,
etc.) - usually requires explicit ratings of objects by
users based on a rating scale - there have been some attempts to obtain ratings
implicitly based on user behavior (mixed results
problem is that implicit ratings are often
binary) - Nearest Neighbors Strategy
- Find similar users and predicted (weighted)
average of user ratings - We can use any distance or similarity measure to
compute similarity among users (user ratings on
items viewed as a vector) - In case of ratings, often the Pearson r algorithm
is used to compute correlations
37MBR Collaborative Filtering
- Pearson Correlation
- weight by degree of correlation between user U
and user J - 1 means very similar, 0 means no correlation, -1
means dissimilar - Works well in case of user ratings (where there
is at least a range of 1-5) - Not always possible (in some situations we may
only have implicit binary values, e.g., whether a
user did or did not select a document) - Alternatively, a variety of distance or
similarity measures can be used
Average rating of user J on all items.
38Collaborative Filtering (k Nearest Neighbor
Example)
Prediction
K is the number of nearest neighbors used in to
find the average predicted ratings of Karen on
Indep. Day.
Example computation Pearson(Sally, Karen)
( (7-5.33)(7-4.67) (6-5.33)(4-4.67)
(3-5.33)(3-4.67) ) / SQRT( ((7-5.33)2
(6-5.33)2 (3-5.33)2) ((7- 4.67)2 (4- 4.67)2
(3- 4.67)2)) 0.85
Note in MS Excel, Pearson Correlation can be
computed using the CORREL function, e.g.,
CORREL(B7D7,B2D2).
39Collaborative Filtering(k Nearest Neighbor)
- In practice a more sophisticated approach is used
to generate the predictions based on the nearest
neighbors - To generate predictions for a target user a on an
item i - ra mean rating for user a
- u1, , uk are the k-nearest-neighbors to a
- ru,i rating of user u on item I
- sim(a,u) Pearson correlation between a and u
- This is a weighted average of deviations from the
neighbors mean ratings (and closer neighbors
count more)
40Item-based Collaborative Filtering
- Find similarities among the items based on
ratings across users - Often measured based on a variation of Cosine
measure - Prediction of item I for user a is based on the
past ratings of user a on items similar to i. - Suppose
- Then, the predicted rating for Karen on Indep.
Day will be 7, because she rated Star Wars 7 - That is if we only use the most similar item
- Otherwise, we can use the k-most similar items
and again use a weighted average
sim(Star Wars, Indep. Day) gt sim(Jur. Park,
Indep. Day) gt sim(Termin., Indep. Day)
41Collaborative Filtering Pros Cons
- Advantages
- Ignores the content, only looks at who judges
things similarly - If Pam liked the paper, Ill like the paper
- If you liked Star Wars, youll like Independence
Day - Rating based on ratings of similar people
- Works well on data relating to taste
- Something that people are good at predicting
about each other too - can be combined with meta information about
objects to increase accuracy - Disadvantages
- early ratings by users can bias ratings of future
users - small number of users relative to number of items
may result in poor performance - scalability problems as number of users
increase, nearest neighbor calculations become
computationally intensive - because of the (dynamic) nature of the
application, it is difficult to select only a
portion instances as the training set.
42Profile Injection Attacks
- Consists of a number of "attack profiles"
- profiles engineered to bias the system's
recommendations - Called Shilling in some previous work
- "Push attack"
- designed to promote a particular product
- attack profiles give a high rating to the pushed
item - includes other ratings as necessary
- Other attack types
- nuke attacks
- system-wide attacks
43Amazon blushes over sex link gaffeBy Stefanie
Olsen
http//news.com.com/Amazonblushesoversexlinkg
affe/2100-1023_3-976435.html Story last modified
Mon Dec 09 134631 PST 2002 In a incident that
highlights the pitfalls of online recommendation
systems, Amazon.com on Friday removed a link to a
sex manual that appeared next to a listing for a
spiritual guide by well-known Christian
televangelist Pat Robertson. The two titles
were temporarily linked as a result of technology
that tracks and displays lists of merchandise
perused and purchased by Amazon visitors. Such
promotions appear below the main description for
products under the title, "Customers who shopped
for this item also shopped for these items.
Amazon's automated results for Robertson's "Six
Steps to Spiritual Revival included a second
title by Robertson as well as a book about anal
sex for men. Amazon conducted an investigation
and determined hundreds of customers going to
the same items while they were shopping on the
site.
44Amazon.com and Pat Robertson
- It turned out that a loosely organized group who
didn't like the right wing evangelist Pat
Robertson managed to trick the Amazon recommender
into linking his book "Six Steps to a Spiritual
Life" with a book on anal sex for men - Roberstons book was the target of a profile
injection attack.
45Attacking Collaborative Filtering Systems
Item1 Item 2 Item 3 Item 4 Item 5 Item 6 Correlation with Alice
Alice 5 2 3 3 ?
User 1 2 4 4 1 -1.00
User 2 2 1 3 1 2 0.33
User 3 4 2 3 2 1 .90
User 4 3 3 2 3 1 0.19
User 5 3 2 2 2 -1.00
User 6 5 3 1 3 2 0.65
User 7 5 1 5 1 -1.00
Bestmatch
Prediction ?
Using k-nearest neighbor with k 1
46A Successful Push Attack
Item1 Item 2 Item 3 Item 4 Item 5 Item 6 Correlation with Alice
Alice 5 2 3 3 ?
User 1 2 4 4 1 -1.00
User 2 2 1 3 1 2 0.33
User 3 4 2 3 2 1 .90
User 4 3 3 2 3 1 0.19
User 5 3 2 2 2 -1.00
User 6 5 3 1 3 2 0.65
User 7 5 1 5 1 -1.00
Attack 1 2 3 2 5 -1.00
Attack 2 3 2 3 2 5 0.76
Attack 3 3 2 2 2 5 0.93
BestMatch
Prediction ?
user-based algorithm using k-nearest neighbor
with k 1
47Item-Based Collaborative Filtering
Prediction ?
Item1 Item 2 Item 3 Item 4 Item 5 Item 6
Alice 5 2 3 3 ?
User 1 2 4 4 1
User 2 2 1 3 1 2
User 3 4 2 3 2 1
User 4 3 3 2 3 1
User 5 3 2 2 2
User 6 5 3 1 3 2
User 7 5 1 5 1
Item similarity 0.76 0.79 0.60 0.71 0.75
Bestmatch
But, what if the attacker knows, independently,
that Item1 is generally popular?
48A Push Attack Against Item-Based Algorithm
Prediction ?
Item1 Item 2 Item 3 Item 4 Item 5 Item 6
Alice 5 2 3 3 ?
User 1 2 4 4 1
User 2 2 1 3 1 2
User 3 4 2 3 2 1
User 4 3 3 2 3 1
User 5 3 2 2 2
User 6 5 3 1 3 2
User 7 5 1 5 1
Attack 1 5 1 1 1 1 5
Attack 2 5 1 1 1 1 5
Attack 3 5 1 1 1 1 5
Item similarity 0.89 0.53 0.49 0.70 0.50
BestMatch