Title: Similarity and Dissimilarity ,Biclustering
1UNIT-I V
2- Measuring (dis)similarity -Evaluating output of
clustering methods Spectral clustering-
Hierarchical clustering- Agglomerative
clustering- Divisive clustering- Choosing the
number of clusters- Clustering data points and
features- Bi-clustering- Multi-view clustering
-K-Means clustering- K-meloids clustering-
Application image segmentation using K- means
clustering
3What are similarity and dissimilarity
measures?
- Similarities are usually non-negative and are
often between 0 (no similarity) and 1(complete
similarity). The dissimilarity between two
objects is the numerical measure of the degree to
which the two objects are different.
Dissimilarity is lower for more similar pairs of
objects.
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8- Manhattan distance is a metric in which the
distance between two points is the sum of the
absolute differences of their Cartesian
coordinates. - In simple language, it is the absolute sum of
the difference between the x-coordinates  and
y-coordinates of each of the points. - Suppose we have a point A and a point B.
- To find the Manhattan distance between them, we
just have to sum up the absolute variation along
the x and y axes. - We find Manhattan distance between two points by
measuring along axis at right angles.In a plane
with p1 at (x1, y1) and p2 at (x2, y2).Manhattan
distance x1 x2 y1 y2
9(No Transcript)
10Clustering in Machine Learning
- Clustering or cluster analysis is a machine
learning technique, which groups the unlabelled
dataset. - It can be defined as "A way of grouping the data
points into different clusters, consisting of
similar data points. - The objects with the possible similarities remain
in a group that has less or no similarities with
another group."
11- It does it by finding some similar patterns in
the unlabelled dataset such as shape, size,
color, behavior, etc., and divides them as per
the presence and absence of those similar
patterns. - It is an unsupervised learning method, hence no
supervision is provided to the algorithm, and it
deals with the unlabeled dataset. - After applying this clustering technique, each
cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the
processing of large and complex datasets.
12- Example Let's understand the clustering
technique with the real-world example of Mall
When we visit any shopping mall, we can observe
that the things with similar usage are grouped
together. Such as the t-shirts are grouped in one
section, and trousers are at other sections,
similarly, at vegetable sections, apples,
bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the
things. - The clustering technique also works in the same
way. Other examples of clustering are grouping
documents according to the topic. - The clustering technique can be widely used in
various tasks. Some most common uses of this
technique are - Market Segmentation
- Statistical data analysis
- Social network analysis
- Image segmentation
- etc.
13- Apart from these general usages, it is used by
the Amazon in its recommendation system to
provide the recommendations as per the past
search of products. - Netflix also uses this technique to recommend the
movies and web-series to its users as per the
watch history. - The below diagram explains the working of the
clustering algorithm. We can see the different
fruits are divided into several groups with
similar properties.
14(No Transcript)
15What is spectral clustering algorithm?
- Spectral clustering is a technique with roots in
graph theory, where the approach is used to
identify communities of nodes in a graph based on
the edges connecting them. - The method is flexible and allows us to cluster
non graph data as well.
16To perform a spectral clustering we need 3 main
steps
- Create a similarity graph between our N objects
to cluster. - Compute the first k eigenvectors of its Laplacian
matrix to define a feature vector for each
object. - Run k-means on these features to separate objects
into k classes.
17(No Transcript)
18Spectral Clustering Matrix Representation
- Adjacency and Affinity Matrix (A)
19Degree Matrix (D)
- A Degree Matrix is a diagonal matrix, where the
degree of a node (i.e. values) of the diagonal is
given by the number of edges connected to it. - We can also obtain the degree of the nodes by
taking the sum of each row in the adjacency
matrix.
20(No Transcript)
21Laplacian Matrix (L)
- This is another representation of the graph/data
points, which attributes to the beautiful
properties leveraged by Spectral Clustering. - One such representation is obtained by
subtracting the Adjacency Matrix from the Degree
Matrix (i.e. L D A).
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Choosing the number of clusters in hierarchical
clustering
- To get the optimal number of clusters for
hierarchical clustering, we make use a dendrogram
which is tree-like chart that shows the sequences
of merges or splits of clusters. - If two clusters are merged, the dendrogram will
join them in a graph and the height of the join
will be the distance between those clusters.
28How do you select the number of clusters in
hierarchical clustering?
- Decide the number of clusters (k) Select k random
points from the data as centroid. Assign all the
points to the nearest cluster centroid. Calculate
the centroid of newly formed clusters.
29How do you determine the number of clusters?
- Compute clustering algorithm (e.g., k-means
clustering) for different values of k. ... - For each k, calculate the total within-cluster
sum of square (wss). - Plot the curve of wss according to the number of
clusters k.
30Do we need to define the number of clusters in
advance for hierarchical clustering?
- Hierarchical clustering does not require you to
pre-specify the number of clusters, the way that
k-means does, but you do select a number of
clusters from your output.
31Clustering data points and features
- Clustering is the task of dividing the population
or data points into a number of groups such that
data points in the same groups are more similar
to other data points in the same group than those
in other groups. In simple words, the aim is to
segregate groups with similar traits and assign
them into clusters.
32What are features in clustering?
- A clustering feature is essentially a summary of
the statistics for the given cluster. - Using a clustering feature, we can easily derive
many useful statistics of a cluster. For example,
the cluster's centroid, x0, radius, R, and
diameter, D, are.
33What is clustering in business intelligence?
- Cluster analysis or simply clustering is the
process of partitioning a set of data objects (or
observations) into subsets. ... In business
intelligence, clustering can be used to organize
a large number of customers into groups, where
customers within a group share strong similar
characteristics.
34What is Biclustering used for?
- Biclustering is a powerful data mining technique
that allows clustering of rows and columns,
simultaneously, in a matrix-format data set. It
was first applied to gene expression data in
2000, aiming to identify co-expressed genes under
a subset of all the conditions/samples.
35What is clustering algorithm in BI?
- The K- Means Clustering algorithm is a process by
which objects are classified into number of
groups so that they are as much dissimilar as
possible from one group to another, and as much
similar as possible within each group. KMeans
Clustering is a grouping of similar things or
data.
36(No Transcript)
37(No Transcript)
38What is multi-view clustering?
- Multi-view graph clustering This category
of methods seeks to find a fusion graph (or
network) across all views and then uses graph-cut
algorithms or other technologies (e.g., spectral
clustering) on the fusion graph in order to
produce the clustering result.
39(No Transcript)
40(No Transcript)
41K-mean clustering
- Steps
- Take mean value
- Find the nearest number to mean put it in the
cluster - Repeat step 12 until we get same mean .