Clustering - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Clustering

Description:

Clustering. Petter Mostad. Clustering vs. class prediction. Class prediction: A learning set of objects with known classes. Goal: put new objects into existing classes ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 36
Provided by: csCha
Category:

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • Petter Mostad

2
Clustering vs. class prediction
  • Class prediction
  • A learning set of objects with known classes
  • Goal put new objects into existing classes
  • Also called Supervised learning, or
    classification
  • Clustering
  • No learning set, no given classes
  • Goal discover the best classes or groupings
  • Also called Unsupervised learning, or class
    discovery

3
Overview
  • General clustering theory
  • Steps, methods, algorithms, issues...
  • Clustering microarray data
  • Recommendations for this kind of data
  • Programs for clustering
  • Some other visualization techniques

4
Issues in clustering
  • Used to explore and visualize data, with few
    preconceptions
  • Many subjective choices must be made, so a
    clustering output tends to be subjective
  • It is difficult to get truly statistically
    significant conclusions
  • Algorithms will always produce clusters, whether
    any exist in the data or not

5
Steps in clustering
  • Feature selection and extraction
  • Defining and computing similarities
  • Clustering or grouping objects
  • Assessing, presenting, and using the result

6
1. Feature selection and extraction
  • Deciding which measurements matter for similarity
  • Data reduction
  • Filtering away objects
  • Normalization of measurements

7
The data matrix
  • Every row contains the measurements for one
    object.
  • Similarities are computed between all pairs of
    rows
  • If measurements are of same type, one can instead
    cluster them!

measurements
objects
8
2. Defining and computing similarities
  • Similarity measures for continuous data vectors
  • Euclidean distance
  • Minkowski distance (including Manhattan metric)
  • Mahalanobis distance where S is a
    covariance matrix

9
  • Centered and non-centered (absolute) Pearson
    correlation
  • centered
  • non-centered
  • where
  • Spearman rank correlation
  • Compute the ranking of the numbers in each vector
  • Find correlation between ranking numbers
  • ....

10
Geometrical view of clustering
  • If measurements are coordinates, objects become
    points in some space
  • If the simiarity measure is Euclidean distance,
    the goal is to group nearby points
  • Note When we have only 2 or 3 measurements per
    object, we can do better than most algorithms
    using visual inspection

11
Similarity measures for discrete data
  • Comparing two binary vectors, count the numbers
    a,b,c,d of 1-1s, 1-0s, 0-1s, and 0-0s,
    respectively
  • Construct different similarity measurements based
    on these numbers
  • Similarity of for example trees or other objects
    can be defined in reasonable ways

12
Similarities using contexts
  • Mutual Neighbour Distance
  • where is the neighbour number of x
    with respect to y
  • This is not a metric, but similarities do not
    need to be based on metrics.

13
3. Clustering or grouping
  • Hierarchical clusterings
  • Divisive Starts with one big cluster and
    subdivides on cluster in each step
  • Agglomerative Starts with each object in
    separate cluster. In each step, joins the two
    closest clusters
  • Partitional clusterings
  • Probabilistic or fuzzy clusterings

14
Hierarchical clustering
  • Agglomerative clustering depends on type of
    linkage, i.e., how to compute the distance
    between merged cluster (UV) and old cluster (W)
  • d(UV, W) min(d(U, W), d(V,W)) (single linkage)
  • d(UV, W) max(d(U,W), d(V,W)) (complete linkage)
  • d(UV, W) average over all distances between
    objects in (UV) and objects in W (average
    linkage, or UPGMA Unweighted Pair Group Method
    with Arithmetic mean)
  • The output is a dendrogram
  • A simplification of average linkage is often
    implemented (average group linkage) It may
    lead to inverted dendrograms!

15
Dendrograms, visualizations
  • The data matrix is often visualized using three
    colors, representing positive, negative, and zero
    values.
  • Hierarchical clustering results often represented
    with a dendrogram. The similarity at which
    clusters merge should correspond to height of
    corresponding horizontal line in dendrogram!
  • To display the dendrogram, the objects (lines or
    columns) need to be sorted, this can be done in
    two ways at every time when two clusters are
    merged.

16
(No Transcript)
17
Wards hierarchical clustering
  • Agglomerative.
  • Goal minimize Error Sum of Squares (ESS) at
    every step.
  • ESS The sum over all clusters, of the sum of
    the squares of the distances from the objects to
    the cluster centroid.
  • When joining two clusters, find the pair that
    results in the smallest increase in ESS.

18
Partitional clusterings
  • The number of desired clusters is fixed at the
    start
  • K-means clustering
  • Partition into k initial clusters
  • Iteratively, reassign points to groups with the
    closest centroid. Recompute centroids.
  • Repeat until stability
  • The result may depend on initial clusters
  • May include a procedure joining or splitting
    clusters according to size
  • The choice of number of clusters may not be
    obvious

19
Probabilistic or fuzzy clustering
  • The output is, for each object and each cluster,
    a probability or weight that the object belongs
    to the cluster
  • Example The observations are modelled as
    produced by drawing from a number of probability
    densities (often multivariate normal). Parameters
    are then estimated with Maximum Likelihood (for
    example using EM algorithm).
  • Example A fuzzy version of k-means, where
    weights for objects are changed iteratively

20
Neural networks for clustering
  • Neural networks are mathematical models made to
    be similar to actual neural networks
  • They consist of layers of nodes that send out
    signals based probabilistically on input
    signals
  • Most known uses are classifications, i.e., with
    learning sets

21
Self-Organising Maps (SOM)
22
Clustering as optimization
  • Given similarity definition and definition of
    what is an optimal clustering, it can often be
    a huge algorithmic challenge to find the optimum.
  • Example Subdivide many thousand objects into 50
    clusters, minimizing e.g. the sum of the squared
    distances to centroids.
  • Then, algorithms for optimization are central.

23
Genetic algorithms
  • Tries to use evolution to obtain good solutions
    to a problem
  • A number of solutions are kept at every step
    They may then mate or mutate, to produce new
    solutions. The fittest solutions are kept.
  • Can be seen as an optimization algorithm
  • A great challenge to design ways of mating and
    mutating that produce an efficient algorithm

24
Simulated annealing
  • A general optimization technique
  • Iterative At every step, nearby solutions are
    chosen with probabilities depending on their
    optimality (so even less optimal solutions may be
    chosen)
  • As the algorithm proceeds, and the temperature
    sinks, the probability of choosing less optimal
    solutions also sinks.
  • Is a good general way to avoid local optima.

25
4. Assessing and using the result
  • Visualization and summarization of the clusters
  • Note You should always investigate the
    dependence of your results on the choices you
    have made for the clustering!

26
Examples of applications of clustering
  • Image analysis
  • Speech recognition
  • Data mining
  • ....

27
Clustering microarray data
samples
  • Samples are columns, genes are rows, in data
    matrix
  • What values to cluster?
  • What is a biologically relevant measure of
    similarity?
  • One can cluster genes and/or samples

genes
28
Clustering microarray data
  • Use logged data, usually
  • Data should be on same scale (but usually is if
    you use data that is already normalized)
  • You may have to filter away genes that show too
    little variation over samples.
  • Use an appropriate distance measure for the
    question you want to focus on (Pearson
    correlation often works OK).
  • Use appropriate clustering algorithm
    (Hierarchical average linkage usually works OK).
  • If you draw some conclusion from the clustering
    results, try to vary your clustering choices to
    see how stable these results are.
  • Clustering works best as a tool to generate
    hypotheses and ideas, which may then be tested in
    other ways.

29
Clustering tumor samples
30
Clustering to confirm or reject hypotheses?
  • A clustering may appear to validate, or be
    validated by, a grouping derived by using other
    data
  • Caution The many different ways to do a
    clustering may make it possible to tweak it to
    produce the clusters you want
  • There is a huge and complex multiple testing
    problem
  • Note that small changes in data can change result
    dramatically
  • If you insist on trying to get significance
  • Using permutations of data
  • Using resampling of data (bootstrapping)

31
How to do clustering Programs
  • A good program for clustering and visualization
    HCE
  • Great visualization options
  • Adapted to microarray data
  • http//www.cs.umd.edu/hcil/hce/
  • Can import similarity matrices
  • Classic for microarray data Cluster TreeView
    (Eisen)
  • R/BioConductor package cluster, hclust function,
    heatmap function, ...
  • Many other programs/packages

32
Other visualization techniques Principal
Components
  • The principal components can be viewed as the
    axes of a better coordinate system for the
    data.
  • Better in the sense that the data is maximally
    spread out along the first principal components.
  • The principal components correspond to
    eigenvectors of the covariance matrix of the
    data.
  • The eigenvalues represent the part of the total
    variance explained by each of the principal
    components.

33
Principal component analysis of expression data
34
Principal component analysis of expression data
35
Other visualization techniques Multidimensional
scaling
  • Start with some points in a very high dimension.
  • Goal Display these points in a lower dimension,
    so that distances between them are similar to
    distances in original dimension.
  • May also try to preserve only the ranking of the
    pairwise distances.
  • Makes it possible to use powerful visual
    inspection, in 2 or 3 dimensions.
  • Can sometimes give very convincing pictures
    separating samples in a predicted way.
Write a Comment
User Comments (0)
About PowerShow.com