Statistical analysis of array data: Dimensionality reduction, Clustering

About This Presentation

Title:

Statistical analysis of array data: Dimensionality reduction, Clustering

Description:

Statistical analysis of array data: Dimensionality reduction, ... If clusters centroids are stabile or some other stopping criteria is achieved, stop algorithm. ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 29

Provided by: kat7177

Category:

more less

Transcript and Presenter's Notes

Title: Statistical analysis of array data: Dimensionality reduction, Clustering

1
Statistical analysis of array data
Dimensionality reduction, Clustering

Katja Astikainen, Riikka Kaven
25.2.2005

2
Contents

Problems and approaches
Dimensionality reduction by PCA
Clustering overview
Hierarchical clustering
K-means
Mixture models and EM

3
Problems and approaches

Basic idea is to find patterns of expression
across multiple genes and experiments
Models of expression are utilized in e.g.
classification of diseases more precisely
(tautiluokitus,sairausaste)
Expression patterns can be utilized to exploring
cellular pathways
With help of gene expression modeling and also
condition (experiment) clustering one can find
genes that are co-regulated
clustering methods can also be used for sequens
alignments
There are several methods for this, but we are
going introduce
Principal Component Analysis (PCA)
Clustering (hierarchical, K-means, EM)

4
Dimensionality reduction by PCA

PCA is statistical data analysis technique
method to reduce dimensionality
method to identify new meaningful underlying
variables
method to compress the data
method to visualize the data

5
Dimensionality reduction by PCA

We have N data points xi,,xn in M dimensional
space, where values x are genes expression
vectors.
With PCA we can reduct the dimension to K which
is usually much lower than M.
Imagine taking three-dimensional cloud of
datapoints and rotating it so you can view it
from different perspectives. You might imagine
that certain views would allow you to better
separate the data into groups than others.
With PCA we can ignore some of the redundant
experiments (low variance), or use some average
of the information without loss of information.

6
Dimensionality reduction by PCA

We are looking for unit vector u1 such that, on
average the squared length of of the projection
of the xs along the u1 is maximal (vectors are
column vectors)
Generally if the first u1,,uk-1 components have
been determined the next component is the one
that maximize the residual variance
The principal components for the expression
vectors are given by ciuix

7
Dimensionality reduction by PCA

How can we find the eigenvectors ui
Find such eigenvectoctors wich shows the most
informative part of the data vectors that show
the direction of maximal variance of the data.
Fist we calculate the covariance matrix
Find out the eigenvalues and eigenvectors
uk from the covariance matrix
eigen value is a measure of the proportion of the
variance explained by the corresponding
eigenvector
Select the uis wich are the eigenvectors of the
sample covariance matrix associated with the K
largest eigenvalues
eigenvectors wich explains the most of the
variance in the data
discovers the important features and patterns in
the data
for datavisualization use two or three
dimensional spaces

8
Clustering overview

Data analysis methods for discovering patterns
and underlying cluster structures
Different kind of methods such as Hierarchical
clustering, partitioning based k-means and Self
Organizing map (SOM)
Theres no single method that is best for every
data
clustering methods are unsuperviced methods (like
k-means)
there is no information about the true clusters
or their amount
clustering algorithms are used for analysing the
data
discovered clusters are just estimations of the
truth (often the result is local optimum)

9
Clustering overview

Data types
Typically the clustered data is numerical vector
data like gene expression data (expression
vectors)
Numerical data can also be represented in
relative coordinates
Data might also be qualitative (nominal) which
brings challenge for comparing the data elements
Number of clusters is often unknown
One way to estimate the number of clusters is
analysing the data by PCA
you might use the eigenvectors to estimate the
number of clusters
Other way is to make guesses and justify the
number of cluster by good results (what ever they
are)

10
Clustering overview

Similarity measures
Pearson correlation (normalized vectors dot
product)
Distance measures
euclidean (natural distance between two vectors)
It is important to use appropriate
distance/similarity measures
in euclidean space vectors might be close to each
other but their correlation could be 0

1000000000 0000000001
11
Clustering overview
Cost function and probabilististic interpretation

For comparing different ways of clustering the
same data, we need some kind of cost function for
the clustering algorithm
The goal of clustering is to try to minimize such
cost function
Generally cost function depends on some
quantities
Centers of the clusters
The distance of each point in a cluster to the
cluster center
The average degree of similarity of a points in a
cluster
Cost functions are algorithm spesific, so
comparing the results of different clustering
algorithms might be almost impossible

12
Clustering overview
Cost function and probabilististic interpretation

There are some advantages associated
with probabilistic models
they are often utilized in cost functions
It is popular method to use in the clustering
cost function the negative log-likelihood of an
underlying probabilistic model

13
Hierarchical clustering

The basic idea is to construct hierarchical tree
which consist of nested clusters
Algorithm is bottom-up method where clustering
starts from single data points (genes) and stops
when all data points are in same cluster (the
root of the tree)
Clustering begins with computing pairwise
similarities between each data point and when
clusters are formed similarity comparing is made
between clusters.
Branching process is repeated at most N-1 times
which means that the leaf nodes (genes) make
first pairs and the tree becomes a binary-tree.

14
Hierarchical clustering phases

Calculate the pairwais similarities between data
points into matrix
Find two datapoints (nodes in the tree) wich are
closest to each other or are most similar.
Group them together to make a new cluster.
Calculate the averige vector of datapoints which
is expression profile for the cluster (inner node
in the tree that joins the leaf nodes
datapoints vectors)
Calculate new correlation matrix
calculate pairwise similarity between the new
cluster and other clusters.

15
Tree Visualization

With Hierarchical clustering we could find the
dendoclusters of datapoints but the constructed
tree isnt yet in optimal order
After finding the dendogram which tells the
similarity between nodes and genes, the final and
optimal linear order for nodes can be discovered
with help of dynamic programming

16
Tree visualization with dynamic programming 2
A
B
C
D
E

Goal Quickly and easily arrange the data for
further inspection

17
Tree visualization with dynamic programming 2
A
B
C
D
E
Greedily join nearest cluster pair 3

nearest we use correlation coefficient
(normalized dot product)
can use other measures as well

18
Tree visualization with dynamic programming 2
A
C
B
D
E

Greedily join nearest cluster pair 3
Optimal ordering minimize summed distance
between consecutive genes
Criterion suggested by Eisen

19
Tree visualization with dynamic programming 2
B
A
C
E
D

Greedily join nearest cluster pair 3
Optimal ordering minimize summed distance
between consecutive genes
Criterion suggested by Eisen

20
Hierarchical clusteringdynamic programming

Optimal linear ordering for genes expression
vectors can be computed in O(N4) steps
We would like to maximize the similarity between
neighbournodes
where is the ith leaf when the tree is
ordered according to
. The algorithm works from bottom up towards
the root by recursively computing the cost of the
optimal ordering M(V,U,W)

1
21
Hierarchical clusteringdynamic programming

The dynamic programming recurrence is given by
The optimal cost M(V) for V is obtained by
maximizing over all pairs, U, W.
The global optimal cost is obtained recursively
when V is the root of the tree, and the optimal
tree can be found by standard backtracking.

1
22
k-means algorithm

Data points are divided into k clusters
Find by iterating such group of centroids
Cv1,,vK, which minimize the squared distances
(d2) between expression vectors xjxn and the
centroid which they belong REPxj,C
where the distance measure d is euclidean.
In practise the result is approximation (local
optimum).
Each expression vector belongs into one cluster.

23
k-means-algorithm phases

Initially put the expression vectors randomly
into k clusters.
Define the clusters centroids by calculating the
average vector from expression vectors which
belong into the cluster.
Compute the distances between expression vectors
and centroids.
Move every expression vector into cluster with
closest centroid.
Define new centroids for clusters. If clusters
centroids are stabile or some other stopping
criteria is achieved, stop algorithm. Otherwise
repeat steps 3-5.

24
k-means clustering
Kuva 4 4 K-means example 1) Expression
vectors are randomly divided into three clusters
2) Define the centroids. 3) Compute expression
vectors distances to the centroids. 4) Compute
centroids new locations. 5) Compute expression
vectors distances to the centroids. 6) Compute
centroids new locations and finish the clustering
cause the centroids are stabilized. Clusters
formed are circled.
25
Mixture models and EM

EM algortihm is based on modelling complex
distributions by combining together simple
Gaussian distributions of clusters
K-means algorithm is an oline approximation of EM
algorithm
maximizes the quadratic log-likelihood (minimizes
quadratic distances of datapoints to their
clusters centroids)
The EM algorithm is used to optimize the centers
of each cluster (weighted variance is maximal)
which means that we find the maximum likelihood
estimate for the center of the Gaussian
distribution of the cluster
Some initial guesses has to be made before
starting
number of clusters (k)
initial centers of clusters

26
Mixture models and EM

Algorithm is an iterative process with two
optimization task
E-step the membership probabilities (hidden
variables) of each datapoint for each mixture
model (cluster) are Estimated
The maximum likehood estimate of the mixing
coefficient is the sample mean of the
conditional probatilities that d1 comes from
model k

27
Mixture model and EM

M-step K-separate estimation problems of
Maximizing the log-likelihood of k component with
a weight given by the estimated membership
probabilities
In M-step means of Gaussian distributions are
estimated so that they maximize the likelihood of
the models

28
References

1 Baldi, P and Hatfield, Wesley G, DNA
Microarrays and Gene Expression, Cambridge
University Press, 2002, 73-96.
2 URL http//www-2.cs.cmu.edu/zivbj/class04/lec
ture11.ppt
3 Eisen MB, Spellman PT, Brown PO and Botstein
D. (1998). Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8.
4 Gasch, A. P. and Eisen, M. B., Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biology,
3,11(2002), 122.
URL http//citeseer.ist.psu.edu/gasch02exploring.
html.

Write a Comment

User Comments (0)

About PowerShow.com

Statistical analysis of array data: Dimensionality reduction, Clustering - PowerPoint PPT Presentation

Statistical analysis of array data: Dimensionality reduction, Clustering

Statistical analysis of array data: Dimensionality reduction, ... If clusters centroids are stabile or some other stopping criteria is achieved, stop algorithm. ... – PowerPoint PPT presentation