Title: Statistical analysis of array data: Dimensionality reduction, Clustering
1Statistical analysis of array data
Dimensionality reduction, Clustering
- Katja Astikainen, Riikka Kaven
- 25.2.2005
2Contents
- Problems and approaches
- Dimensionality reduction by PCA
- Clustering overview
- Hierarchical clustering
- K-means
- Mixture models and EM
3Problems and approaches
- Basic idea is to find patterns of expression
across multiple genes and experiments - Models of expression are utilized in e.g.
classification of diseases more precisely
(tautiluokitus,sairausaste) - Expression patterns can be utilized to exploring
cellular pathways - With help of gene expression modeling and also
condition (experiment) clustering one can find
genes that are co-regulated - clustering methods can also be used for sequens
alignments - There are several methods for this, but we are
going introduce - Principal Component Analysis (PCA)
- Clustering (hierarchical, K-means, EM)
4Dimensionality reduction by PCA
- PCA is statistical data analysis technique
- method to reduce dimensionality
- method to identify new meaningful underlying
variables - method to compress the data
- method to visualize the data
5Dimensionality reduction by PCA
- We have N data points xi,,xn in M dimensional
space, where values x are genes expression
vectors. - With PCA we can reduct the dimension to K which
is usually much lower than M. - Imagine taking three-dimensional cloud of
datapoints and rotating it so you can view it
from different perspectives. You might imagine
that certain views would allow you to better
separate the data into groups than others. - With PCA we can ignore some of the redundant
experiments (low variance), or use some average
of the information without loss of information.
6Dimensionality reduction by PCA
- We are looking for unit vector u1 such that, on
average the squared length of of the projection
of the xs along the u1 is maximal (vectors are
column vectors) - Generally if the first u1,,uk-1 components have
been determined the next component is the one
that maximize the residual variance - The principal components for the expression
vectors are given by ciuix
7Dimensionality reduction by PCA
- How can we find the eigenvectors ui
- Find such eigenvectoctors wich shows the most
informative part of the data vectors that show
the direction of maximal variance of the data. - Fist we calculate the covariance matrix
- Find out the eigenvalues and eigenvectors
uk from the covariance matrix -
- eigen value is a measure of the proportion of the
variance explained by the corresponding
eigenvector - Select the uis wich are the eigenvectors of the
sample covariance matrix associated with the K
largest eigenvalues - eigenvectors wich explains the most of the
variance in the data - discovers the important features and patterns in
the data - for datavisualization use two or three
dimensional spaces -
8Clustering overview
- Data analysis methods for discovering patterns
and underlying cluster structures - Different kind of methods such as Hierarchical
clustering, partitioning based k-means and Self
Organizing map (SOM) - Theres no single method that is best for every
data - clustering methods are unsuperviced methods (like
k-means) - there is no information about the true clusters
or their amount - clustering algorithms are used for analysing the
data - discovered clusters are just estimations of the
truth (often the result is local optimum)
9Clustering overview
- Data types
- Typically the clustered data is numerical vector
data like gene expression data (expression
vectors) - Numerical data can also be represented in
relative coordinates - Data might also be qualitative (nominal) which
brings challenge for comparing the data elements - Number of clusters is often unknown
- One way to estimate the number of clusters is
analysing the data by PCA - you might use the eigenvectors to estimate the
number of clusters - Other way is to make guesses and justify the
number of cluster by good results (what ever they
are)
10Clustering overview
- Similarity measures
- Pearson correlation (normalized vectors dot
product) - Distance measures
- euclidean (natural distance between two vectors)
- It is important to use appropriate
distance/similarity measures - in euclidean space vectors might be close to each
other but their correlation could be 0
1000000000 0000000001
11Clustering overview
Cost function and probabilististic interpretation
- For comparing different ways of clustering the
same data, we need some kind of cost function for
the clustering algorithm - The goal of clustering is to try to minimize such
cost function - Generally cost function depends on some
quantities - Centers of the clusters
- The distance of each point in a cluster to the
cluster center - The average degree of similarity of a points in a
cluster - Cost functions are algorithm spesific, so
comparing the results of different clustering
algorithms might be almost impossible
12Clustering overview
Cost function and probabilististic interpretation
- There are some advantages associated
- with probabilistic models
- they are often utilized in cost functions
- It is popular method to use in the clustering
cost function the negative log-likelihood of an
underlying probabilistic model
13Hierarchical clustering
- The basic idea is to construct hierarchical tree
which consist of nested clusters - Algorithm is bottom-up method where clustering
starts from single data points (genes) and stops
when all data points are in same cluster (the
root of the tree) - Clustering begins with computing pairwise
similarities between each data point and when
clusters are formed similarity comparing is made
between clusters. - Branching process is repeated at most N-1 times
which means that the leaf nodes (genes) make
first pairs and the tree becomes a binary-tree.
14Hierarchical clustering phases
- Calculate the pairwais similarities between data
points into matrix - Find two datapoints (nodes in the tree) wich are
closest to each other or are most similar. - Group them together to make a new cluster.
- Calculate the averige vector of datapoints which
is expression profile for the cluster (inner node
in the tree that joins the leaf nodes
datapoints vectors) - Calculate new correlation matrix
- calculate pairwise similarity between the new
cluster and other clusters.
15Tree Visualization
- With Hierarchical clustering we could find the
dendoclusters of datapoints but the constructed
tree isnt yet in optimal order - After finding the dendogram which tells the
similarity between nodes and genes, the final and
optimal linear order for nodes can be discovered
with help of dynamic programming
16Tree visualization with dynamic programming 2
A
B
C
D
E
- Goal Quickly and easily arrange the data for
further inspection
17Tree visualization with dynamic programming 2
A
B
C
D
E
Greedily join nearest cluster pair 3
- nearest we use correlation coefficient
(normalized dot product) - can use other measures as well
18Tree visualization with dynamic programming 2
A
C
B
D
E
- Greedily join nearest cluster pair 3
- Optimal ordering minimize summed distance
between consecutive genes - Criterion suggested by Eisen
19Tree visualization with dynamic programming 2
B
A
C
E
D
- Greedily join nearest cluster pair 3
- Optimal ordering minimize summed distance
between consecutive genes - Criterion suggested by Eisen
20Hierarchical clusteringdynamic programming
- Optimal linear ordering for genes expression
vectors can be computed in O(N4) steps - We would like to maximize the similarity between
neighbournodes - where is the ith leaf when the tree is
ordered according to - . The algorithm works from bottom up towards
the root by recursively computing the cost of the
optimal ordering M(V,U,W)
1
21Hierarchical clusteringdynamic programming
- The dynamic programming recurrence is given by
- The optimal cost M(V) for V is obtained by
maximizing over all pairs, U, W. - The global optimal cost is obtained recursively
when V is the root of the tree, and the optimal
tree can be found by standard backtracking.
1
22k-means algorithm
- Data points are divided into k clusters
- Find by iterating such group of centroids
Cv1,,vK, which minimize the squared distances
(d2) between expression vectors xjxn and the
centroid which they belong REPxj,C -
- where the distance measure d is euclidean.
- In practise the result is approximation (local
optimum). - Each expression vector belongs into one cluster.
23k-means-algorithm phases
- Initially put the expression vectors randomly
into k clusters. - Define the clusters centroids by calculating the
average vector from expression vectors which
belong into the cluster. - Compute the distances between expression vectors
and centroids. - Move every expression vector into cluster with
closest centroid. - Define new centroids for clusters. If clusters
centroids are stabile or some other stopping
criteria is achieved, stop algorithm. Otherwise
repeat steps 3-5.
24k-means clustering
Kuva 4 4 K-means example 1) Expression
vectors are randomly divided into three clusters
2) Define the centroids. 3) Compute expression
vectors distances to the centroids. 4) Compute
centroids new locations. 5) Compute expression
vectors distances to the centroids. 6) Compute
centroids new locations and finish the clustering
cause the centroids are stabilized. Clusters
formed are circled.
25Mixture models and EM
- EM algortihm is based on modelling complex
distributions by combining together simple
Gaussian distributions of clusters - K-means algorithm is an oline approximation of EM
algorithm - maximizes the quadratic log-likelihood (minimizes
quadratic distances of datapoints to their
clusters centroids) - The EM algorithm is used to optimize the centers
of each cluster (weighted variance is maximal)
which means that we find the maximum likelihood
estimate for the center of the Gaussian
distribution of the cluster - Some initial guesses has to be made before
starting - number of clusters (k)
- initial centers of clusters
26Mixture models and EM
- Algorithm is an iterative process with two
optimization task - E-step the membership probabilities (hidden
variables) of each datapoint for each mixture
model (cluster) are Estimated -
- The maximum likehood estimate of the mixing
coefficient is the sample mean of the
conditional probatilities that d1 comes from
model k -
27Mixture model and EM
- M-step K-separate estimation problems of
Maximizing the log-likelihood of k component with
a weight given by the estimated membership
probabilities - In M-step means of Gaussian distributions are
estimated so that they maximize the likelihood of
the models
28References
- 1 Baldi, P and Hatfield, Wesley G, DNA
Microarrays and Gene Expression, Cambridge
University Press, 2002, 73-96. - 2 URL http//www-2.cs.cmu.edu/zivbj/class04/lec
ture11.ppt - 3 Eisen MB, Spellman PT, Brown PO and Botstein
D. (1998). Cluster Analysis and Display of
Genome-Wide Expression Patterns. Proc Natl Acad
Sci U S A 95, 14863-8. - 4 Gasch, A. P. and Eisen, M. B., Exploring the
conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biology,
3,11(2002), 122. - URL http//citeseer.ist.psu.edu/gasch02exploring.
html.