Mining di dati web

About This Presentation

Title:

Mining di dati web

Description:

Watanabe's Ugly Duckling theorem. Wolpert's ... The Ugly Duckling theorems ... The ugly duckling and 3 beautiful swans. In binary 0000, 0001, 0010, 0100, 1000, ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 61

Provided by: malvasia

Category:

more less

Transcript and Presenter's Notes

Title: Mining di dati web

1
Mining di dati web

Clustering di Documenti Web
Le metriche di similarità
A.A 2006/2007

2
Learning What does it Means?

Definition from Wikipedia
Machine learning is an area of artificial
intelligence concerned with the development of
techniques which allow computers to "learn".
More specifically, machine learning is a method
for creating computer programs by the analysis of
data sets.
Machine learning overlaps heavily with
statistics, since both fields study the analysis
of data, but unlike statistics, machine learning
is concerned with the algorithmic complexity of
computational implementations.

3
And What about Clustering?

Again from Wikipedia
Data clustering is a common technique for
statistical data analysis, which is used in many
fields, including machine learning, data mining,
pattern recognition, image analysis and
bioinformatics. Clustering is the classification
of similar objects into different groups, or more
precisely, the partitioning of a data set into
subsets (clusters), so that the data in each
subset (ideally) share some common trait - often
proximity according to some defined distance
measure.
Machine learning typically regards data
clustering as a form of unsupervised learning.

4
What clustering is?
5
Clustering

Example where euclidean distance is the distance
metric
hierarchical clustering dendrogram

6
Clustering K-Means

Randomly generate k clusters and determine the
cluster centers or directly generate k seed
points as cluster centers
Assign each point to the nearest cluster center.
Recompute the new cluster centers.
Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).

7
K-means Clustering

Each point is assigned to the cluster with the
closest center point
The number K, must be specified
Basic algorithm

Select K points as the initial center points.
Repeat
2. Form K clusters by assigning all points to the
closet center point.
3. Recompute the center point of each cluster
Until the center points dont change

8
Example 1 data points
image the k-means clustering result.
9
K-means clustering result
10
Importance of Choosing Initial Guesses (1)
11
Importance of Choosing Initial Guesses (2)
12
Local optima of K-means
Supplement to K-means

The local optima of K-means
Kmeans is to minimize the sum of point-centroid
distances. The optimization is difficult.
After each iteration of K-means the MSE (mean
square error) decreases. But K-means may converge
to a local optimum. So K-means is sensitive to
initial guesses.

13
Example 2 data points
14
Image the clustering result
15
Example 2 K-means clustering result
16
Limitations of K-means

K-means has problem when clusters are of
differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains
outliers
One solution is to use many clusters
Find parts of clusters, but need to put together

17
Clustering K-Means

The main advantages of this algorithm are its
simplicity and speed which allows it to run on
large datasets. Its disadvantage is that it does
not yield the same result with each run, since
the resulting clusters depend on the initial
random assignments.
It maximizes inter-cluster (or minimizes
intra-cluster) variance, but does not ensure that
the result has a global minimum of variance.

18
d_intra or d_inter? That is the question!

d_intra is the distance among elements (points or
objects, or whatever) of the same cluster.
d_inter is the distance among clusters.
Questions
Should we use distance or similarity?
Should we care about inter cluster distance?
Should we care about cluster shape?
Should we care about clustering?

19
Distance Functions

Informal definition
The distance between two points is the length of
a straight line segment between them.
A more formal definition
A distance between two points P and Q in a metric
space is d(P,Q), where d is the distance function
that defines the given metric space.
We can also define the distance between two sets
A and B in a metric space as being the minimum
(or infimum) of distances between any two points
P in A and Q in B.

20
Distance or Similarity?

In a very straightforward way we can define the
Similarity functionsim SxS ? 0,1 as
sim(o1,o2) 1 - d(o1,o2)where o1 and o2 are
elements of the space S.

21
What does similar (or distant) really mean?

Learning (either supervised or unsupervised) is
impossible without ASSUMPTIONS
Watanabes Ugly Duckling theorem
Wolperts No Free Lunch theorem
Learning is impossible without some sort of bias.

22
The Ugly Duckling theorems
The theorem gets its fanciful name from the
following counter-intuitive statement assuming
similarity is based on the number of shared
predicates, an ugly duckling is as similar to a
beautiful swan A as a beautiful swan B is to A,
given that A and B differ at all. It was proposed
and proved by Satosi Watanabe in 1969.
23
Satosis Theorem

Let n be the cardinality of the Universal set S.
We have to classify them without prior knowledge
on the essence of categories.
The number of different classes, i.e. the
different way to group the objects into clusters,
is given by the cardinality of the Power Set of
S
Pow(S)2n
Without any prior information, the most natural
way to measure the similarity among two distinct
objects we can measure the number of classes they
share.
Oooops They share exactly the same number of
classes, namely 2n-2.

24
The ugly duckling and 3 beautiful swans

So1,o2,o3,o4
Pow(S) ,o1,o2,o3,o4, o1,o2,o1,
o3,o1,o4, o2,o3,o2,o4,o3,o4, o1,o2
,o3,o1,o2,o4, o1,o3,o4,o2,o3,o4, o1,
o2,o3,o4
How many classes have in common oi, oj iltgtj?
o1 and o3
4
o1 and o4
4

25
The ugly duckling and 3 beautiful swans

In binary 0000, 0001, 0010, 0100, 1000,
Chose two objects
Reorder the bits so that the chosen object are
represented by the first two bits.
How many strings share the first two bits set to
1?
2n-2

26
Wolperts No Free Lunch Theorem

For any two algorithms, A and B, there exist
datasets for which algorithm A outperform
algorithm B in prediction accuracy on unseen
instances.
Proof Take any Boolean concept. If A outperforms
B on unseen instances, reverse the labels and B
will outperforms A.

27
So Lets Get Back to Distances

In a metric space a distance is a function
dSxS-gtR so that if a,b,c are elements of S
d(a,b) 0
d(a,b) 0 iff ab
d(a,b) d(b,a)
d(a,c) d(a,b) d(b,c)
The fourth property (triangular inequality) holds
only if we are in a metric space.

28
Minkowski Distance

Lets consider two elements of a set S described
by their feature vectors
x(x1, x2, , xn)
y(y1, y2, , yn)
The Minkowski Distance is parametric in pgt1

29
p1. Manhattan Distance

If p 1 the distance is called Manhattan
Distance.
It is also called taxicab distance because it is
the distance a car would drive in a city laid out
in square blocks (if there are no one-way
streets).

30
p2. Euclidean Distance

If p 2 the distance is the well known Euclidean
Distance.

31
p?. Chebyshev Distance

If p ? then we must take the limit.
It is also called chessboard Distance.

32
2D Cosine Similarity

Its easy to explain in 2D.
Lets consider a(x1,y1) and b(x2,y2).

a
b
?
33
Cosine Similarity

Lets consider two points x, y in Rn.

34
Jaccard Distance

Another commonly used distance is the Jaccard
Distance

35
Binary Jaccard Distance

In the case of binary feature vector the Jaccard
Distance could be simplified to

36
Edit Distance

The Levenshtein distance or edit distance between
two strings is given by the minimum number of
operations needed to transform one string into
the other, where an operation is an insertion,
deletion, or substitution of a single character

37
Binary Edit Distance

The binary edit distance, d(x,y), from a binary
vector x to a binary vector y is the minimum
number of simple flips required to transform one
vector to the other

x(0,1,0,0,1,1) y(1,1,0,1,0,1) d(x,y)3
The binary edit distance is equivalent to the
Manhattan distance (Minkowski p1) for binary
features vectors.
38
The Curse of High Dimensionality

The dimensionality is one of the main problem to
face when clustering data.
Roughly speaking the higher the dimensionality
the lower the power of recognizing similar
objects.

39
Volume of theUnit-Radius Sphere
40
Sphere/CubeVolume Ratio

Unit-radius Sphere / Cube whose edge lenghts is 2.

41
Sphere/Sphere Volume Ratio

Two embedded spheres. Radiuses 1 and 0.9.

42
Concentration of the Norm Phenomenon

Gaussian distributions of points (std. dev. 1).
Dimensions (1,2,3,5,10, and 20).
Probability Density Functions to find a point
drawn according to a Gaussian distribution, at
distance r from the center of that distribution.

43
Web Document Representation

The Web can be characterized in three different
ways
Content.
Structure.
Usage.
We are concerned with Web Content information.

44
Bag-of-Words vs. Vector-Space

Let C be a collection of N documentsd1, d2, ,
dN
Each document is composed of terms drawn from a
term-set of dimension T.
Each document can be represented in two different
ways
Bag-of-Words.
Vector-Space.

45
Bag-of-Words

In the bag-of-words model the document is
represented asdApple,Banana,Coffee,Peach
Each term is represented.
No information on frequency.
Binary encoding. t-dimensional Bitvector

Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
d1,0,1,1,0,0,0,1.
d1,0,1,1,0,0,0,1.
46
Vector-Space

In the vector-space model the document is
represented asdltApple,4gt,ltBanana,2gt,ltCoffee,2gt,
ltPeach,2gt
Information about frequency are recorded.
t-dimensional vectors.

Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
Apple Peach Apple Banana Apple Banana Coffee Apple
Coffee
d4,0,2,2,0,0,0,1.
d4,0,2,2,0,0,0,1.
47
Typical WebCollection Dimensions

No. of documents 20B
No of terms is approx 150,000,000!!!
Very high dimensionality
Each document contains from 100 to 1000 terms.
Classical clustering algorithm cannot be used.
Each document is very similar to the other due to
the geometric properties just seen.

48
We Need to Cope with High Dimensionality

Possible solutions
JUST ONE Reduce Dimensions!!!!

49
Dimensionality reduction

dimensionality reduction approaches can be
divided into two categories
feature selection approaches try to find a subset
of the original features. Optimal feature
selection for supervised learning problems
requires an exhaustive search of all possible
subsets of features of the chosen cardinality
feature extraction is applying a mapping of the
multidimensional space into a space of fewer
dimensions. This means that the original feature
space is transformed by applying e.g. a linear
transformation via a principal components analysis

50
PCA - Principal Component Analysis

In statistics, principal components analysis
(PCA) is a technique that can be used to simplify
a dataset.
More formally it is a linear transformation that
chooses a new coordinate system for the data set
such that the greatest variance by any projection
of the data set comes to lie on the first axis
(then called the first principal component), the
second greatest variance on the second axis, and
so on.
PCA can be used for reducing dimensionality in a
dataset while retaining those characteristics of
the dataset that contribute most to its variance
by eliminating the later principal components (by
a more or less heuristic decision).

51
The Method

Suppose you have a random vector population x
x(x1,x2, , xn)T
and the mean
?xEx
and the covariance matrix
CxE(x- ?x)(x- ?x)T

52
The Method

The components of Cx, denoted by cij, represent
the covariances between the random variable
components xi and xj. The component cii is the
variance of the component xi. The variance of a
component indicates the spread of the component
values around its mean value. If two components
xi and xj of the data are uncorrelated, their
covariance is zero (cij cji 0).
The covariance matrix is, by definition, always
symmetric.
Take a sample of vectors x1, x2, , xM we can
calculate the sample mean and the sample
covariance matrix as the estimates of the mean
and the covariance matrix.

53
The Method

From a symmetric matrix such as the covariance
matrix, we can calculate an orthogonal basis by
finding its eigenvalues and eigenvectors. The
eigenvectors ei and the corresponding eigenvalues
?i are the solutions of the equation Cxei ?
iei gt Cx- ?I 0
By ordering the eigenvectors in the order of
descending eigenvalues (largest first), one can
create an ordered orthogonal basis with the first
eigenvector having the direction of largest
variance of the data. In this way, we can find
directions in which the data set has the most
significant amounts of energy.

54
The Method

By ordering the eigenvectors in the order of
descending eigenvalues (largest first), one can
create an ordered orthogonal basis with the first
eigenvector having the direction of largest
variance of the data. In this way, we can find
directions in which the data set has the most
significant amounts of energy.
To reduce n to k take the first k eigenvectors.

55
A Graphical Example
56
PCA Eigenvectors Projection

Project the data onto the selected eigenvectors

57
Another Example
58
Singular Value Decomposition

Is a technique used for reducing dimensionality
based on some properties of symmetric matrices.
Will be the subject for a talk given by one of
you!!!!

59
Locality-Sensitive Hashing

The key idea of this approach is to create a
small signature for each documents, to ensure
that similar documents have similar signatures.
There exists a family H of hash functions such
that for each pair of pages u, v we have Prmh(u)
mh(v) sim(u,v), where the hash function mh
is chosen at random from the family H.