Statistical analysis of array data: Dimensionality reduction, Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical analysis of array data: Dimensionality reduction, Clustering

Description:

Statistical analysis of array data: Dimensionality reduction, ... If clusters centroids are stabile or some other stopping criteria is achieved, stop algorithm. ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 29
Provided by: kat7177
Category:

less

Transcript and Presenter's Notes

Title: Statistical analysis of array data: Dimensionality reduction, Clustering


1
Statistical analysis of array data
Dimensionality reduction, Clustering
  • Katja Astikainen, Riikka Kaven
  • 25.2.2005

2
Contents
  • Problems and approaches
  • Dimensionality reduction by PCA
  • Clustering overview
  • Hierarchical clustering
  • K-means
  • Mixture models and EM

3
Problems and approaches
  • Basic idea is to find patterns of expression
    across multiple genes and experiments
  • Models of expression are utilized in e.g.
    classification of diseases more precisely
    (tautiluokitus,sairausaste)
  • Expression patterns can be utilized to exploring
    cellular pathways
  • With help of gene expression modeling and also
    condition (experiment) clustering one can find
    genes that are co-regulated
  • clustering methods can also be used for sequens
    alignments
  • There are several methods for this, but we are
    going introduce
  • Principal Component Analysis (PCA)
  • Clustering (hierarchical, K-means, EM)

4
Dimensionality reduction by PCA
  • PCA is statistical data analysis technique
  • method to reduce dimensionality
  • method to identify new meaningful underlying
    variables
  • method to compress the data
  • method to visualize the data

5
Dimensionality reduction by PCA
  • We have N data points xi,,xn in M dimensional
    space, where values x are genes expression
    vectors.
  • With PCA we can reduct the dimension to K which
    is usually much lower than M.
  • Imagine taking three-dimensional cloud of
    datapoints and rotating it so you can view it
    from different perspectives. You might imagine
    that certain views would allow you to better
    separate the data into groups than others.
  • With PCA we can ignore some of the redundant
    experiments (low variance), or use some average
    of the information without loss of information.

6
Dimensionality reduction by PCA
  • We are looking for unit vector u1 such that, on
    average the squared length of of the projection
    of the xs along the u1 is maximal (vectors are
    column vectors)
  • Generally if the first u1,,uk-1 components have
    been determined the next component is the one
    that maximize the residual variance
  • The principal components for the expression
    vectors are given by ciuix

7
Dimensionality reduction by PCA
  • How can we find the eigenvectors ui
  • Find such eigenvectoctors wich shows the most
    informative part of the data vectors that show
    the direction of maximal variance of the data.
  • Fist we calculate the covariance matrix
  • Find out the eigenvalues and eigenvectors
    uk from the covariance matrix
  • eigen value is a measure of the proportion of the
    variance explained by the corresponding
    eigenvector
  • Select the uis wich are the eigenvectors of the
    sample covariance matrix associated with the K
    largest eigenvalues
  • eigenvectors wich explains the most of the
    variance in the data
  • discovers the important features and patterns in
    the data
  • for datavisualization use two or three
    dimensional spaces

8
Clustering overview
  • Data analysis methods for discovering patterns
    and underlying cluster structures
  • Different kind of methods such as Hierarchical
    clustering, partitioning based k-means and Self
    Organizing map (SOM)
  • Theres no single method that is best for every
    data
  • clustering methods are unsuperviced methods (like
    k-means)
  • there is no information about the true clusters
    or their amount
  • clustering algorithms are used for analysing the
    data
  • discovered clusters are just estimations of the
    truth (often the result is local optimum)

9
Clustering overview
  • Data types
  • Typically the clustered data is numerical vector
    data like gene expression data (expression
    vectors)
  • Numerical data can also be represented in
    relative coordinates
  • Data might also be qualitative (nominal) which
    brings challenge for comparing the data elements
  • Number of clusters is often unknown
  • One way to estimate the number of clusters is
    analysing the data by PCA
  • you might use the eigenvectors to estimate the
    number of clusters
  • Other way is to make guesses and justify the
    number of cluster by good results (what ever they
    are)

10
Clustering overview
  • Similarity measures
  • Pearson correlation (normalized vectors dot
    product)
  • Distance measures
  • euclidean (natural distance between two vectors)
  • It is important to use appropriate
    distance/similarity measures
  • in euclidean space vectors might be close to each
    other but their correlation could be 0

1000000000 0000000001
11
Clustering overview
Cost function and probabilististic interpretation
  • For comparing different ways of clustering the
    same data, we need some kind of cost function for
    the clustering algorithm
  • The goal of clustering is to try to minimize such
    cost function
  • Generally cost function depends on some
    quantities
  • Centers of the clusters
  • The distance of each point in a cluster to the
    cluster center
  • The average degree of similarity of a points in a
    cluster
  • Cost functions are algorithm spesific, so
    comparing the results of different clustering
    algorithms might be almost impossible

12
Clustering overview
Cost function and probabilististic interpretation
  • There are some advantages associated
  • with probabilistic models
  • they are often utilized in cost functions
  • It is popular method to use in the clustering
    cost function the negative log-likelihood of an
    underlying probabilistic model

13
Hierarchical clustering
  • The basic idea is to construct hierarchical tree
    which consist of nested clusters
  • Algorithm is bottom-up method where clustering
    starts from single data points (genes) and stops
    when all data points are in same cluster (the
    root of the tree)
  • Clustering begins with computing pairwise
    similarities between each data point and when
    clusters are formed similarity comparing is made
    between clusters.
  • Branching process is repeated at most N-1 times
    which means that the leaf nodes (genes) make
    first pairs and the tree becomes a binary-tree.

14
Hierarchical clustering phases
  • Calculate the pairwais similarities between data
    points into matrix
  • Find two datapoints (nodes in the tree) wich are
    closest to each other or are most similar.
  • Group them together to make a new cluster.
  • Calculate the averige vector of datapoints which
    is expression profile for the cluster (inner node
    in the tree that joins the leaf nodes
    datapoints vectors)
  • Calculate new correlation matrix
  • calculate pairwise similarity between the new
    cluster and other clusters.

15
Tree Visualization
  • With Hierarchical clustering we could find the
    dendoclusters of datapoints but the constructed
    tree isnt yet in optimal order
  • After finding the dendogram which tells the
    similarity between nodes and genes, the final and
    optimal linear order for nodes can be discovered
    with help of dynamic programming

16
Tree visualization with dynamic programming 2
A
B
C
D
E
  • Goal Quickly and easily arrange the data for
    further inspection

17
Tree visualization with dynamic programming 2
A
B
C
D
E
Greedily join nearest cluster pair 3
  • nearest we use correlation coefficient
    (normalized dot product)
  • can use other measures as well

18
Tree visualization with dynamic programming 2
A
C
B
D
E
  • Greedily join nearest cluster pair 3
  • Optimal ordering minimize summed distance
    between consecutive genes
  • Criterion suggested by Eisen

19
Tree visualization with dynamic programming 2
B
A
C
E
D
  • Greedily join nearest cluster pair 3
  • Optimal ordering minimize summed distance
    between consecutive genes
  • Criterion suggested by Eisen

20
Hierarchical clusteringdynamic programming
  • Optimal linear ordering for genes expression
    vectors can be computed in O(N4) steps
  • We would like to maximize the similarity between
    neighbournodes
  • where is the ith leaf when the tree is
    ordered according to
  • . The algorithm works from bottom up towards
    the root by recursively computing the cost of the
    optimal ordering M(V,U,W)

1
21
Hierarchical clusteringdynamic programming
  • The dynamic programming recurrence is given by
  • The optimal cost M(V) for V is obtained by
    maximizing over all pairs, U, W.
  • The global optimal cost is obtained recursively
    when V is the root of the tree, and the optimal
    tree can be found by standard backtracking.

1
22
k-means algorithm
  • Data points are divided into k clusters
  • Find by iterating such group of centroids
    Cv1,,vK, which minimize the squared distances
    (d2) between expression vectors xjxn and the
    centroid which they belong REPxj,C
  • where the distance measure d is euclidean.
  • In practise the result is approximation (local
    optimum).
  • Each expression vector belongs into one cluster.

23
k-means-algorithm phases
  • Initially put the expression vectors randomly
    into k clusters.
  • Define the clusters centroids by calculating the
    average vector from expression vectors which
    belong into the cluster.
  • Compute the distances between expression vectors
    and centroids.
  • Move every expression vector into cluster with
    closest centroid.
  • Define new centroids for clusters. If clusters
    centroids are stabile or some other stopping
    criteria is achieved, stop algorithm. Otherwise
    repeat steps 3-5.

24
k-means clustering
Kuva 4 4 K-means example 1) Expression
vectors are randomly divided into three clusters
2) Define the centroids. 3) Compute expression
vectors distances to the centroids. 4) Compute
centroids new locations. 5) Compute expression
vectors distances to the centroids. 6) Compute
centroids new locations and finish the clustering
cause the centroids are stabilized. Clusters
formed are circled.
25
Mixture models and EM
  • EM algortihm is based on modelling complex
    distributions by combining together simple
    Gaussian distributions of clusters
  • K-means algorithm is an oline approximation of EM
    algorithm
  • maximizes the quadratic log-likelihood (minimizes
    quadratic distances of datapoints to their
    clusters centroids)
  • The EM algorithm is used to optimize the centers
    of each cluster (weighted variance is maximal)
    which means that we find the maximum likelihood
    estimate for the center of the Gaussian
    distribution of the cluster
  • Some initial guesses has to be made before
    starting
  • number of clusters (k)
  • initial centers of clusters

26
Mixture models and EM
  • Algorithm is an iterative process with two
    optimization task
  • E-step the membership probabilities (hidden
    variables) of each datapoint for each mixture
    model (cluster) are Estimated
  • The maximum likehood estimate of the mixing
    coefficient is the sample mean of the
    conditional probatilities that d1 comes from
    model k

27
Mixture model and EM
  • M-step K-separate estimation problems of
    Maximizing the log-likelihood of k component with
    a weight given by the estimated membership
    probabilities
  • In M-step means of Gaussian distributions are
    estimated so that they maximize the likelihood of
    the models

28
References
  • 1 Baldi, P and Hatfield, Wesley G, DNA
    Microarrays and Gene Expression, Cambridge
    University Press, 2002, 73-96.
  • 2 URL http//www-2.cs.cmu.edu/zivbj/class04/lec
    ture11.ppt
  • 3 Eisen MB, Spellman PT, Brown PO and Botstein
    D. (1998). Cluster Analysis and Display of
    Genome-Wide Expression Patterns. Proc Natl Acad
    Sci U S A 95, 14863-8.
  • 4 Gasch, A. P. and Eisen, M. B., Exploring the
    conditional coregulation of yeast gene expression
    through fuzzy k-means clustering. Genome Biology,
    3,11(2002), 122.
  • URL http//citeseer.ist.psu.edu/gasch02exploring.
    html.
Write a Comment
User Comments (0)
About PowerShow.com