Important clustering methods used in microarray data analysis - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Important clustering methods used in microarray data analysis

Description:

Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA Contents Multidimensional scaling plots Related to ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 17
Provided by: SHor1
Category:

less

Transcript and Presenter's Notes

Title: Important clustering methods used in microarray data analysis


1
Important clustering methods used in microarray
data analysis
  • Steve Horvath
  • Human Genetics and Biostatistics
  • UCLA

2
Contents
  • Multidimensional scaling plots
  • Related to principal component analysis
  • k-means clustering
  • hierarchical clustering

3
Introduction to clustering
4
MDS plot of clusters
5
MDS plot of clusters
6
2 references for clustering
  • T. Hastie, R. Tibshirani, J. Friedman (2002) The
    elements of Statistical Learning. Springer Series
  • L. Kaufman, P. Rousseeuw (1990) Finding groups in
    data. Wiley Series in Probability
  •  

7
Introduction to clustering
Cluster analysis aims to group or segment a
collection of objects into subsets or "clusters",
such that those within each cluster are more
closely related to one another than objects
assigned to different clusters.   An object can
be described by a set of measurements (e.g.
covariates, features, attributes) or by its
relation to other objects.   Sometimes the goal
is to arrange the clusters into a natural
hierarchy, which involves successively grouping
or merging the clusters themselves so that at
each level of the hierarchy clusters within the
same group are more similar to each other than
those in different groups.  
  •  

8
Proximity matrices are the input to most
clustering algorithms
  •  

Proximity between pairs of objects similarity or
dissimilarity. If the original data were
collected as similarities, a monotone-decreasing
function can be used to convert them to
dissimilarities. Most algorithms use
(symmetric) dissimilarities (e.g. distances) But
the triangle inequality does not have to hold.
Triangle inequality  
9
Different intergroup dissimilarities
Let G and H represent 2 groups.
10
Agglomerative clustering, hierarchical clustering
and dendrograms
11
Hierarchical clustering plot
12
Agglomerative clustering
  • Agglomerative clustering algorithms begin with
    every observation representing a singleton
    cluster.
  • At each of the N-1 the closest 2 (least
    dissimilar) clusters are merged into a single
    cluster.
  • Therefore a measure of dissimilarity between 2
    clusters must be defined.
  •  

13
Comparing different linkage methods
  •  If there is a strong clustering tendency, all 3
    methods produce similar results.
  • Single linkage has a tendency to combine
    observations linked by a series of close
    intermediate observations ("chaining). Good for
    elongated clusters
  • Bad Complete linkage may lead to clusters where
    observations assigned to a cluster can be much
    closer to members of other clusters than they are
    to some members of their own cluster. Use for
    very compact clusters (like perls on a string).
  • Group average clustering represents a compromise
    between the extremes of single and complete
    linkage. Use for ball shaped clusters

14
Dendrogram
  • Recursive binary splitting/agglomeration can be
    represented by a rooted binary tree.
  • The root node represents the entire data set.
  • The N terminal nodes of the trees represent
    individual observations.
  • Each nonterminal node ("parent") has two daughter
    nodes.
  • Thus the binary tree can be plotted so that the
    height of each node is proportional to the value
    of the intergroup dissimilarity between its 2
    daughters.
  • A dendrogram provides a complete description of
    the hierarchical clustering in graphical format.

15
Comments on dendrograms
  • Caution different hierarchical methods as well
    as small changes in the data can lead to
    different dendrograms.
  • Hierarchical methods impose hierarchical
    structure whether or not such structure actually
    exists in the data.
  • In general dendrograms are a description of the
    results of the algorithm and not graphical
    summary of the data.
  • Only valid summary to the extent that the
    pairwise observation dissimilarities obey the
    ultrametric inequality

for all i,i,k
16
Figure 1
average
complete
single
Write a Comment
User Comments (0)
About PowerShow.com