Title: Mining di dati web
1Mining di dati web
- Clustering di Documenti Web
- Le metriche di similarità
- A.A 2006/2007
2Learning What does it Means?
- Definition from Wikipedia
- Machine learning is an area of artificial
intelligence concerned with the development of
techniques which allow computers to "learn". - More specifically, machine learning is a method
for creating computer programs by the analysis of
data sets. - Machine learning overlaps heavily with
statistics, since both fields study the analysis
of data, but unlike statistics, machine learning
is concerned with the algorithmic complexity of
computational implementations.
3And What about Clustering?
- Again from Wikipedia
- Data clustering is a common technique for
statistical data analysis, which is used in many
fields, including machine learning, data mining,
pattern recognition, image analysis and
bioinformatics. Clustering is the classification
of similar objects into different groups, or more
precisely, the partitioning of a data set into
subsets (clusters), so that the data in each
subset (ideally) share some common trait - often
proximity according to some defined distance
measure. - Machine learning typically regards data
clustering as a form of unsupervised learning.
4What clustering is?
5Clustering
- Example where euclidean distance is the distance
metric - hierarchical clustering dendrogram
6Clustering K-Means
- Randomly generate k clusters and determine the
cluster centers or directly generate k seed
points as cluster centers - Assign each point to the nearest cluster center.
- Recompute the new cluster centers.
- Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).
7K-means Clustering
- Each point is assigned to the cluster with the
closest center point - The number K, must be specified
- Basic algorithm
- Select K points as the initial center points.
- Repeat
- 2. Form K clusters by assigning all points to the
closet center point. - 3. Recompute the center point of each cluster
- Until the center points dont change
8Example 1 data points
image the k-means clustering result.
9K-means clustering result
10Importance of Choosing Initial Guesses (1)
11Importance of Choosing Initial Guesses (2)
12Local optima of K-means
Supplement to K-means
- The local optima of K-means
- Kmeans is to minimize the sum of point-centroid
distances. The optimization is difficult. - After each iteration of K-means the MSE (mean
square error) decreases. But K-means may converge
to a local optimum. So K-means is sensitive to
initial guesses.
13Example 2 data points
14Image the clustering result
15Example 2 K-means clustering result
16Limitations of K-means
- K-means has problem when clusters are of
differing - Sizes
- Densities
- Non-globular shapes
- K-means has problems when the data contains
outliers - One solution is to use many clusters
- Find parts of clusters, but need to put together
17Clustering K-Means
- The main advantages of this algorithm are its
simplicity and speed which allows it to run on
large datasets. Its disadvantage is that it does
not yield the same result with each run, since
the resulting clusters depend on the initial
random assignments. - It maximizes inter-cluster (or minimizes
intra-cluster) variance, but does not ensure that
the result has a global minimum of variance.
18d_intra or d_inter? That is the question!
- d_intra is the distance among elements (points or
objects, or whatever) of the same cluster. - d_inter is the distance among clusters.
- Questions
- Should we use distance or similarity?
- Should we care about inter cluster distance?
- Should we care about cluster shape?
- Should we care about clustering?
19Distance Functions
- Informal definition
- The distance between two points is the length of
a straight line segment between them. - A more formal definition
- A distance between two points P and Q in a metric
space is d(P,Q), where d is the distance function
that defines the given metric space. - We can also define the distance between two sets
A and B in a metric space as being the minimum
(or infimum) of distances between any two points
P in A and Q in B.
20Distance or Similarity?
- In a very straightforward way we can define the
Similarity functionsim SxS ? 0,1 as
sim(o1,o2) 1 - d(o1,o2)where o1 and o2 are
elements of the space S.
21What does similar (or distant) really mean?
- Learning (either supervised or unsupervised) is
impossible without ASSUMPTIONS - Watanabes Ugly Duckling theorem
- Wolperts No Free Lunch theorem
- Learning is impossible without some sort of bias.
22The Ugly Duckling theorems
The theorem gets its fanciful name from the
following counter-intuitive statement assuming
similarity is based on the number of shared
predicates, an ugly duckling is as similar to a
beautiful swan A as a beautiful swan B is to A,
given that A and B differ at all. It was proposed
and proved by Satosi Watanabe in 1969.
23Satosis Theorem
- Let n be the cardinality of the Universal set S.
- We have to classify them without prior knowledge
on the essence of categories. - The number of different classes, i.e. the
different way to group the objects into clusters,
is given by the cardinality of the Power Set of
S - Pow(S)2n
- Without any prior information, the most natural
way to measure the similarity among two distinct
objects we can measure the number of classes they
share. - Oooops They share exactly the same number of
classes, namely 2n-2.
24The ugly duckling and 3 beautiful swans
- So1,o2,o3,o4
- Pow(S) ,o1,o2,o3,o4, o1,o2,o1,
o3,o1,o4, o2,o3,o2,o4,o3,o4, o1,o2
,o3,o1,o2,o4, o1,o3,o4,o2,o3,o4, o1,
o2,o3,o4 - How many classes have in common oi, oj iltgtj?
- o1 and o3
- 4
- o1 and o4
- 4
25The ugly duckling and 3 beautiful swans
- In binary 0000, 0001, 0010, 0100, 1000,
- Chose two objects
- Reorder the bits so that the chosen object are
represented by the first two bits. - How many strings share the first two bits set to
1? - 2n-2
26Wolperts No Free Lunch Theorem
- For any two algorithms, A and B, there exist
datasets for which algorithm A outperform
algorithm B in prediction accuracy on unseen
instances. - Proof Take any Boolean concept. If A outperforms
B on unseen instances, reverse the labels and B
will outperforms A.
27So Lets Get Back to Distances
- In a metric space a distance is a function
dSxS-gtR so that if a,b,c are elements of S - d(a,b) 0
- d(a,b) 0 iff ab
- d(a,b) d(b,a)
- d(a,c) d(a,b) d(b,c)
- The fourth property (triangular inequality) holds
only if we are in a metric space.
28Minkowski Distance
- Lets consider two elements of a set S described
by their feature vectors - x(x1, x2, , xn)
- y(y1, y2, , yn)
- The Minkowski Distance is parametric in pgt1
29p1. Manhattan Distance
- If p 1 the distance is called Manhattan
Distance. - It is also called taxicab distance because it is
the distance a car would drive in a city laid out
in square blocks (if there are no one-way
streets).
30p2. Euclidean Distance
- If p 2 the distance is the well known Euclidean
Distance.
31p?. Chebyshev Distance
- If p ? then we must take the limit.
- It is also called chessboard Distance.
322D Cosine Similarity
- Its easy to explain in 2D.
- Lets consider a(x1,y1) and b(x2,y2).
a
b
?
33Cosine Similarity
- Lets consider two points x, y in Rn.
34Jaccard Distance
- Another commonly used distance is the Jaccard
Distance
35Binary Jaccard Distance
- In the case of binary feature vector the Jaccard
Distance could be simplified to
36Edit Distance
- The Levenshtein distance or edit distance between
two strings is given by the minimum number of
operations needed to transform one string into
the other, where an operation is an insertion,
deletion, or substitution of a single character
37Binary Edit Distance
- The binary edit distance, d(x,y), from a binary
vector x to a binary vector y is the minimum
number of simple flips required to transform one
vector to the other
x(0,1,0,0,1,1) y(1,1,0,1,0,1) d(x,y)3
The binary edit distance is equivalent to the
Manhattan distance (Minkowski p1) for binary
features vectors.
38The Curse of High Dimensionality
- The dimensionality is one of the main problem to
face when clustering data. - Roughly speaking the higher the dimensionality
the lower the power of recognizing similar
objects.
39Volume of theUnit-Radius Sphere
40Sphere/CubeVolume Ratio
- Unit-radius Sphere / Cube whose edge lenghts is 2.
41Sphere/Sphere Volume Ratio
- Two embedded spheres. Radiuses 1 and 0.9.
42Concentration of the Norm Phenomenon
- Gaussian distributions of points (std. dev. 1).
Dimensions (1,2,3,5,10, and 20). - Probability Density Functions to find a point
drawn according to a Gaussian distribution, at
distance r from the center of that distribution.
43Web Document Representation
- The Web can be characterized in three different
ways - Content.
- Structure.
- Usage.
- We are concerned with Web Content information.
44Bag-of-Words vs. Vector-Space
- Let C be a collection of N documentsd1, d2, ,
dN - Each document is composed of terms drawn from a
term-set of dimension T. - Each document can be represented in two different
ways - Bag-of-Words.
- Vector-Space.
45Bag-of-Words
- In the bag-of-words model the document is
represented asdApple,Banana,Coffee,Peach - Each term is represented.
- No information on frequency.
- Binary encoding. t-dimensional Bitvector
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
d1,0,1,1,0,0,0,1.
d1,0,1,1,0,0,0,1.
46Vector-Space
- In the vector-space model the document is
represented asdltApple,4gt,ltBanana,2gt,ltCoffee,2gt,
ltPeach,2gt - Information about frequency are recorded.
- t-dimensional vectors.
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
d4,0,2,2,0,0,0,1.
d4,0,2,2,0,0,0,1.
47Typical WebCollection Dimensions
- No. of documents 20B
- No of terms is approx 150,000,000!!!
- Very high dimensionality
- Each document contains from 100 to 1000 terms.
- Classical clustering algorithm cannot be used.
- Each document is very similar to the other due to
the geometric properties just seen.
48We Need to Cope with High Dimensionality
- Possible solutions
- JUST ONE Reduce Dimensions!!!!
49Dimensionality reduction
- dimensionality reduction approaches can be
divided into two categories - feature selection approaches try to find a subset
of the original features. Optimal feature
selection for supervised learning problems
requires an exhaustive search of all possible
subsets of features of the chosen cardinality - feature extraction is applying a mapping of the
multidimensional space into a space of fewer
dimensions. This means that the original feature
space is transformed by applying e.g. a linear
transformation via a principal components analysis
50PCA - Principal Component Analysis
- In statistics, principal components analysis
(PCA) is a technique that can be used to simplify
a dataset. - More formally it is a linear transformation that
chooses a new coordinate system for the data set
such that the greatest variance by any projection
of the data set comes to lie on the first axis
(then called the first principal component), the
second greatest variance on the second axis, and
so on. - PCA can be used for reducing dimensionality in a
dataset while retaining those characteristics of
the dataset that contribute most to its variance
by eliminating the later principal components (by
a more or less heuristic decision).
51The Method
- Suppose you have a random vector population x
- x(x1,x2, , xn)T
- and the mean
- ?xEx
- and the covariance matrix
- CxE(x- ?x)(x- ?x)T
52The Method
- The components of Cx, denoted by cij, represent
the covariances between the random variable
components xi and xj. The component cii is the
variance of the component xi. The variance of a
component indicates the spread of the component
values around its mean value. If two components
xi and xj of the data are uncorrelated, their
covariance is zero (cij cji 0). - The covariance matrix is, by definition, always
symmetric. - Take a sample of vectors x1, x2, , xM we can
calculate the sample mean and the sample
covariance matrix as the estimates of the mean
and the covariance matrix.
53The Method
- From a symmetric matrix such as the covariance
matrix, we can calculate an orthogonal basis by
finding its eigenvalues and eigenvectors. The
eigenvectors ei and the corresponding eigenvalues
?i are the solutions of the equation Cxei ?
iei gt Cx- ?I 0 - By ordering the eigenvectors in the order of
descending eigenvalues (largest first), one can
create an ordered orthogonal basis with the first
eigenvector having the direction of largest
variance of the data. In this way, we can find
directions in which the data set has the most
significant amounts of energy.
54The Method
- By ordering the eigenvectors in the order of
descending eigenvalues (largest first), one can
create an ordered orthogonal basis with the first
eigenvector having the direction of largest
variance of the data. In this way, we can find
directions in which the data set has the most
significant amounts of energy. - To reduce n to k take the first k eigenvectors.
55A Graphical Example
56PCA Eigenvectors Projection
- Project the data onto the selected eigenvectors
57Another Example
58Singular Value Decomposition
- Is a technique used for reducing dimensionality
based on some properties of symmetric matrices. - Will be the subject for a talk given by one of
you!!!!
59Locality-Sensitive Hashing
- The key idea of this approach is to create a
small signature for each documents, to ensure
that similar documents have similar signatures. - There exists a family H of hash functions such
that for each pair of pages u, v we have Prmh(u)
mh(v) sim(u,v), where the hash function mh
is chosen at random from the family H.
60Locality-Sensitive Hashing
- Will be a subject of a talk given by one of
you!!!!!!!