BINF636%20Clustering%20and%20Classification - PowerPoint PPT Presentation

About This Presentation
Title:

BINF636%20Clustering%20and%20Classification

Description:

We have to be able to measure similarity or dissimilarity in ... How we measure distance can have a profound effect on the performance of ... Stork (2001) ... – PowerPoint PPT presentation

Number of Views:265
Avg rating:3.0/5.0
Slides: 109
Provided by: jeffs59
Learn more at: http://www.binf.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: BINF636%20Clustering%20and%20Classification


1
BINF636Clustering and Classification
  • Jeff Solka Ph.D.
  • Fall 2008

2
Gene Expression Data
samples
Genes
xgi expression for gene g in sample i
3
The Pervasive Notion of Distance
  • We have to be able to measure similarity or
    dissimilarity in order to perform clustering,
    dimensionality reduction, visualization, and
    discriminant analysis.
  • How we measure distance can have a profound
    effect on the performance of these algorithms.

4
Distance Measures and Clustering
  • Most of the common clustering methods such as
    k-means, partitioning around medoid (PAM) and
    hierarchical clustering are dependent on the
    calculation of distance or an interpoint distance
    matrix.
  • Some clustering methods such as those based on
    spectral decomposition have a less clear
    dependence on the distance measure.

5
Distance Measures and Discriminant Analysis
  • Many supervised learning procedures (a.k.a.
    discriminant analysis procedures) also depend on
    the concept of a distance.
  • nearest neighbors
  • k-nearest neighbors
  • Mixture-models

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Two Main Classes of Distances
  • Consider two gene expression profiles as
    expressed across I samples. Each of these can be
    considered as points in RI space. We can
    calculate the distance between these two points.
  • Alternatively we can view the gene expression
    profiles as being manifestations of samples from
    two different probability distributions.

10
(No Transcript)
11
A General Framework for Distances Between Points
  • Consider two m-vectors x (x1, , xm) and y
    (y1, , ym). Define a generalized distance of the
    form
  • We call this a pairwise distance function as the
    pairing of features within observations is
    preserved.

12
Minkowski Metric
  • Special case of our generalized metric

13
Euclidean and Manhattan Metric
14
Correlation-based Distance Measures
  • Championed for use within the microarray
    literature by Eisen.
  • Types
  • Pearsons sample correlation distance.
  • Eisens cosine correlation distance.
  • Spearman sample correlation distance.
  • Kendalls t sample correlation.

15
Pearson Sample Correlation Distance (COR)
16
Eisen Cosine Correlation Distance (EISEN)
17
Spearman Sample Correlation Distance (SPEAR)
18
Tau Kendalls t Sample Correlation (TAU)
19
Some Observations - I
  • Since we are subtracting the correlation measures
    from 1, things that are perfectly positively
    correlated (correlation measure of 1) will have a
    distance close to 0 and things that are perfectly
    negatively correlated (correlation measure of -1)
    will have a distance close to 2.
  • Correlation measures in general are invariant to
    location and scale transformations and tend to
    group together genes whose expression values are
    linearly related.

20
Some Observations - II
  • The parametric methods (COR and EISEN) tend to be
    more negatively effected by the presence of
    outliers than the non-parametric methods (SPEAR
    and TAU)
  • Under the assumption that we have standardized
    the data so that x and y are m-vectors with zero
    mean and unit length then there is a simple
    relationship between the Pearson correlation
    coefficient r(x,y) and the Euclidean distance

21
Mahalanobis Distance
  • This allows data directional variability to come
    into play when calculating distances.
  • How do we estimate S?

22
Distances and Transformations
  • Assume that g is an invertible possible
    non-linear transformation g x ? x
  • This transformation induces a new metric d via

23
Distances and Scales
  • Original scanned fluorescence intensities
  • Logarithmically transformed data
  • Data transformed by the general logarithm

24
Experiment-specific Distances Between Genes
  • One might like to use additional experimental
    design information in deterring how one
    calculates distances between the genes.
  • One might wish to used smoothed estimates or
    other sorts of statistical fits and measure
    distances between these.
  • In time course data distances that honor the time
    order of the data are appropriate.

25
Standardizing Genes
26
Standardizing Arrays (Samples)
27
Scaling and Its Implication to Data Analysis - I
  • Types of gene expression data
  • Relative (cDNA)
  • Absolute (Affymetrix)
  • xgi is the expression of gene g on sample I as
    measured on a log scale
  • Let ygi xgi xgA patient A is our reference
  • The distance between patient samples

28
Scaling and Its Implication to Data Analysis - II
29
Scaling and Its Implication to Data Analysis - III
The distance between two genes are given by
30
Summary of Effects of Scaling on Distance Measures
  • Minkowski distances
  • Distance between samples is the same for relative
    and absolute measures
  • Distance between genes is not the same for
    relative and absolute measures
  • Pearson correlation-based distance
  • Distances between genes is the same for relative
    and absolute measures
  • Distances between samples is not the same for
    relative and absolute measures

31
What is Cluster Analysis?
  • Given a collection of n objects each of which is
    described by a set of p characteristics or
    variables derive a useful division into a number
    of classes.
  • Both the number of classes and the properties of
    the classes are to be determined.
  • (Everitt 1993)

32
Why Do This?
  • Organize
  • Prediction
  • Etiology (Causes)

33
How Do We Measure Quality?
  • Multiple Clusters
  • Male, Female
  • Low, Middle, Upper Income
  • Neither True Nor False
  • Measured by Utility

34
Difficulties In Clustering
  • Cluster structure may be manifest in a multitude
    of ways
  • Large data sets and high dimensionality
    complicate matters

35
Clustering Prerequisites
  • Method to measure the distance between
    observations and clusters
  • Similarity
  • Dissimilarity
  • This was discussed previously
  • Method of normalizing the data
  • We discussed this previously
  • Method of reducing the dimensionality of the data
  • We discussed this previously

36
The Number of Groups Problem
  • How Do We Decide on the Appropriate Number of
    Clusters?
  • Duda, Hart and Stork (2001)
  • Form Je(2)/Je(1) where Je(M) is the sum of
    squares error criterion for the m cluster model.
    The distribution of this ratio is usually not
    known.

37
Optimization Methods
  • Minimizing or Maximizing Some Criteria
  • Does Not Necessarily Form Hierarchical Clusters

38
Clustering Criteria
The Sum of Squared Error Criteria
39
Spoofing of the Sum of Squares Error Criterion
40
Related Criteria
  • With a little manipulation we obtain
  • Instead of using average squared distances
    betweens points in a cluster as indicated above
    we could use perhaps the median or maximum
    distance
  • Each of these will produce its own variant

41
Scatter Criteria
42
Relationship of the Scattering Criteria
43
Measuring the Size of Matrices
  • So we wish to minimize SW while maximizing SB
  • We will measure the size of a matrix by using its
    trace of determinant
  • These are equivalent in the case of univariate
    data

44
Interpreting the Trace Criteria

45
The Determinant Criteria
  • SB will be singular if the number of clusters is
    less than or equal to the dimensionality
  • Partitions based on Je may change under linear
    transformations of the data
  • This is not the case with Jd

46
Other Invariant Criteria
  • It can be shown that the eigenvalues of SW-1SB
    are invariant under nonsingular linear
    transformation
  • We might choose to maximize

47
k-means Clustering
  • Begin initialize n, k, m1, m2, , mk
  • Do classify n samples according to nearest mi
  • Recompute mi
  • Until no no change in mi
  • Return m1, m2, .., mk
  • End
  • Complexity of the algorithm is O(ndkT)
  • T is the number of iterations
  • T is typically ltlt n

48
Example Mean Trajectories
49
Optimizing the Clustering Criterion
  • N(n,g) The number of partitions of n
    individuals into g groups
  • N(15,3)2,375,101
  • N(20,4)45,232,115,901
  • N(25,8)690,223,721,118,368,580
  • N(100,5)1068
  • Note that the 3.15 x 10 17 is the estimated age
    of the
  • universe in seconds

50
Hill Climbing Algorithms
  • 1 - Form initial partition into required number
    of groups
  • 2 - Calculate change in clustering criteria
    produced by moving each individual from its own
    to another cluster.
  • 3 - Make the change which leads to the greatest
    improvement in the value of the clustering
    criterion.
  • 4 - Repeat steps (2) and (3) until no move of a
    single individual causes the clustering criterion
    to improve.
  • Guarantees local not global optimum

51
How Do We Choose c
  • Randomly classify points to generate the mis
  • Randomly generate mis
  • Base location of the c solution on the c-1
    solution
  • Base location of the c solution on a hierarchical
    solution

52
Alternative Methods
  • Simulated Annealing
  • Genetic Algorithms
  • Quantum Computing

53
Hierarchical Cluster Analysis
  • 1 Cluster to n Clusters
  • Agglomerative Methods
  • Fusion of n Data Points into Groups
  • Divisive Methods
  • Separate the n Data Points Into Finer Groupings

54
Dendrograms
  • agglomerative
  • 0 1 2 3 4
  • (1) (1,2) (1,2,3,4,5)
  • (2) (3,4,5)
  • (3)
  • (4) (4,5)
  • 4 3 2 1 0
  • divisive

55
Agglomerative Algorithm(Bottom Up or Clumping)
  • Start Clusters C1, C2, ..., Cn each with 1
  • data point
  • 1 - Find nearest pair Ci, Cj, merge Ci and Cj,
  • delete Cj, and decrement cluster count by 1
  • If number of clusters is greater than 1 then
  • go back to step 1

56
Inter-cluster Dissimilarity Choices
  • Furthest Neighbor (Complete Linkage)
  • Nearest Neighbor (Single Linkage)
  • Group Average

57
Single Linkage (Nearest Neighbor) Clustering
  • Distance Between Groups is Defined as That of the
    Closest Pair of Individuals Where We Consider 1
    Individual From Each Group
  • This method may be adequate when the clusters are
    fairly well separated Gaussians but it is subject
    to problems with chaining

58
Example of Single Linkage Clustering
  • 1 2 3 4 5
  • 1 0.0
  • 2 2.0 0.0
  • 3 6.0 5.0 0.0
  • 4 10.0 9.0 4.0 0.0
  • 5 9.0 8.0 5.0 3.0 0.0
  • (1 2) 3 4 5
  • (1 2) 0.0
  • 3 5.0 0.0
  • 4 9.0 4.0 0.0
  • 5 8.0 5.0 3.0 0.0

59
Complete Linkage Clustering (Furthest Neighbor)
  • Distance Between Groups is Defined as Most
    Distance Pairs of Individuals

60
Complete Linkage Example
  • 1 2 3 4 5
  • 1 0.0
  • 2 2.0 0.0
  • 3 6.0 5.0 0.0
  • 4 10.0 9.0 4.0 0.0
  • 5 9.0 8.0 5.0 3.0 0.0
  • (1,2) is the First Cluster
  • d(12) 3 maxd13,d23d136.0
  • d(12)4 maxd14,d24d1410.0
  • d(12)5 maxd15,d25d159.0
  • So the cluster consisting of (12) will be merged
    with the
  • cluster consisting of (3)

61
Group Average Clustering
  • Distance between clusters is the average of the
    distance between all pairs of individuals between
    the 2 groups
  • A compromise between single linkage and complete
    linkage

62
Centroid Clusters
  • We use centroid of a group once it is formed.

63
Problems With Hierarchical Clustering
  • Well it really gives us a continuum of different
    clusterings of the data
  • As stated previously there are specific artifacts
    of the various methods

64
Dendrogram
65
Data Color Histogram or Data Image
Orderings of the data matrix were first discussed
in Bertin. Wegman in 1990 coined the term data
color histogram. Mike Minnotte and Webster West
subsequently termed the phrase data image in
1998.
66
Data Image Reveals Obfuscated Cluster Structure
Subset of the pairs plot
Sorted on Observations
Sorted on Observations and Features
90 observations in R100 drawn from a standard
normal distribution The first and second 30 rows
were shifted by 20 in their first and second
dimensions respectively. This data matrix was
then multiplied by a 100 x 100 matrix of
Gaussian noise.
67
The Data Image in the Gene Expression Community
  • Extracted from

68
Example Dataset
69
Complete Linkage Clustering
70
Single Linkage Clustering

71
Average Linkage Clustering

72
Pruning Our Tree
  • cutree(tree, k NULL, h NULL)
  • tree a tree as produced by hclust. cutree() only
    expects a list with components merge, height,
    and labels, of appropriate content each.
  • k an integer scalar or vector with the desired
    number of groups
  • h numeric scalar or vector with heights where
    the tree should be cut.
  • At least one of k or h must be specified, k
    overrides h if
  • both are given.
  • Values Returned
  • cutree returns a vector with group memberships
    if k or h are scalar, otherwise a matrix with
    group meberships is returned where each column
    corresponds to the elements of k or h,
    respectively (which are also used as column
    names).

73
Example Pruning
  • gt x.cl2lt-cutree(hclust(x.dist),k2)
  • gt x.cl2110
  • 1 1 1 1 1 1 1 1 1 1 1
  • gt x.cl2190200
  • 1 2 2 2 2 2 2 2 2 2 2 2

74
Identifying the Number of Clusters
  • As indicated previously we really have no way of
    identify the true cluster structure unless we
    have divine intervention
  • In the next several slides we present some
    well-known methods

75
Method of Mojena
  • Select the number of groups based on the first
    stage of the dendogram that satisfies
  • The a0,a1,a2,... an-1 are the fusion levels
    corresponding to stages with n, n-1, ,1
    clusters. and are the mean and unbiased
    standard deviation of these fusion levels and k
    is a constant.
  • Mojena (1977) 2.75 lt k lt 3.5
  • Milligan and Cooper (1985) k1.25

76
Hartigans k-means theory
  • When deciding on the number of clusters,
  • Hartigan (1975, pp 90-91) suggests the
  • following rough rule of thumb. If k is the
  • result of kmeans with k groups and kplus1 is
  • the result with k1 groups, then it is
  • justifiable to add the extra group when
  • (sum(kwithinss)/sum(kplus1withinss)-1)(nrow(x)-
    k-1)
  • is greater than 10.

77
kmeans Applied to our Data Set
78
The 3 term kmeans solution
79
The 4 term kmeans Solution
80
Determination of the Number of Clusters Using the
Hartigan Criteria
81
MIXTURE-BASED CLUSTERING
82
HOW DO WE CHOOSE g?
  • Human Intervention
  • Divine Intervention
  • Likelihood Ratio Test Statistic
  • Wolfes Method
  • Bootstrap
  • AIC,BIC, MDL
  • Adaptive Mixtures Based Methods
  • Pruning
  • SHIP (AKMM)

83
Akaike's Information criteria (AIC)
  • AIC(g) -2L(f) N(g) where N(g) is the number
    of free parameters in the model of size g.
  • We Choose g In Order to Minimize the AIC
    Condition
  • This Criterion is Subject to the Same Regularity
    Conditions as -2logl

84
MIXTURE VISUALIZATION 2-d
85
MODEL-BASED CLUSTERING
  • This technique takes a density function approach.
  • Uses finite mixture densities as models for
    cluster analysis.
  • Each component density characterizes a cluster.

86
Minimal Spanning Tree-Based Clustering
Diansheng Guo Donna Peuquet Mark Gahegan, (2002)
, Opening the black box interactive hierarchical
clustering for multivariate spatial patterns,
Geographic Information Systems Proceedings of the
tenth ACM international symposium on Advances in
geographic information systems, McLean, Virginia,
USA
87
What is Pattern Recognition?
  • From Devroye, Györfi and Lugosi
  • Pattern recognition or discrimination is about
    guessing or predicting the unknown nature of an
    observation, a discrete quantity such as black or
    white, one or zero, sick or healthy, real or
    fake.
  • From Duda, Hart and Stork
  • The act of taking in raw data and taking an
    action based on the category of the pattern.

88
Isnt This Just Statistics?
  • Short answer yes.
  • Breiman (Statistical Sciences, 2001) suggests
    there are two cultures within statistical
    modeling Stochastic Modelers and Algorithmic
    Modelers.

89
Algorithmic Modeling
  • Pattern recognition (classification) is concerned
    with predicting class membership of an
    observation.
  • This can be done from the perspective of
    (traditional statistical) data models.
  • Often, the data is high dimensional, complex, and
    of unknown distributional origin.
  • Thus, pattern recognition often falls into the
    algorithmic modeling camp.
  • The measure of performance is whether it
    accurately predicts the class, not how well it
    models the distribution.
  • Empirical evaluations often are more compelling
    than asymptotic theorems.

90
Pattern Recognition Flowchart
91
Pattern Recognition Concerns
  • Feature extraction and distance calculation
  • Development of automated algorithms for
    classification.
  • Classifier performance evaluation.
  • Latent or hidden class discovery based on
    etracted feature analysis.
  • Theoretical considerations.

92
Linear and Quadratic Discriminant Analysis in
Action
93
Nearest Neighbor Classifier
94
SVM Training Cartoon
95
CART Analysis of the Fisher Iris Data
96
Random Forests
  • Create a large number of trees based on random
    samples of our dataset.
  • Use a bootstrap sample for each random sample.
  • Variables used to create the splits are a random
    sub-sample of all of the features.
  • All trees are grown fully.
  • Majority vote determines membership of a new
    observation.

97
Boosting and Bagging
98
Boosting
99
Evaluating Classifiers
100
Resubstitution
101
Cross Validation
102
Leave-k-Out
103
Cross-Validation Notes
104
Test Set
105
Some Classifier Results on the Golub ALL vs AML
Dataset
106
References - I
  • Richard O. Duda, Peter E. Hart, David G. Stork,
    2001, Pattern Calssification, 2nd Edition.
  • Eisen MB, Spellman PT, Brown PO and Botstein D.
    (1998). Cluster Analysis and Display of
    Genome-Wide Expression Patterns. Proc Natl Acad
    Sci U S A 95, 14863-8.
  • Brian S. Everitt, Sabine Landau, Morven Leese
    ,(2001), Cluster Analysis,4th Edition, arnold.
  • Gasch AP and Eisen MB (2002). Exploring the
    conditional coregulation of yeast gene expression
    through fuzzy k-means clustering. Genome Biology
    3(11), 1-22.
  • Gad Getz, Erel Levine, and Eytan Domany . (2000)
    Coupled two-way clustering analysis of gene
    microarray data, PNAS, vol. 97, no. 22, pp.
    1207912084.
  • Hastie T, Tibshirani R, Eisen MB, Alizadeh A,
    Levy R, Staudt L, Chan WC, Botstein D and Brown
    P. (2000). 'Gene Shaving' as a Method for
    Identifying Distinct Sets of Genes with Similar
    Expression Patterns. GenomeBiology.com 1,

107
References - II
  • A. K. Jain, M. N. Murty, P. J. Flynn , (1999)
    Data clustering a review, ACM Computing Surveys
    (CSUR),  Volume 31 Issue 3.
  • John Quackenbush, (2001),COMPUTATIONAL ANALYSIS
    OF MICROARRAY DATA NATURE REVIEWS GENETICS VOLUME
    2, 419, pp. 418 427
  • Ying Xu, Victor Olman, and Dong Xu
    (2002)Clustering gene expression data using a
    graph-theoretic approach an application of
    minimum spanning trees Bioinformatics 2002 18
    536-545.

108
References - III
  • Hastie, Tibshirani, Friedman, The Elements of
    Statistical Learning Data Mining, Inference, and
    Prediction, 2001.
  • Devroye, Györfi, Lugosi, A Probabilistic Theory
    of Pattern Recognition,1996.
  • Ripley, Pattern Recognition and Neural Networks,
    1996.
  • Fukunaga, Introduction to Statistical Pattern
    Recognition, 1990.
Write a Comment
User Comments (0)
About PowerShow.com