Title: Examples of Clustering Biological Data
1Examples of Clustering Biological Data
- 6.892 / 7.93 Spring Term 2002
- March 12, 2002
- David Gifford
2Student Projects
- Ideally groups of four
- Two biologists
- Two computer scientists or quantitative experts
- Pick a research project related to the class
- You will have access to data, Matlab, etc.
- Work product
- Preliminary outline of plans (5 minutes in class)
- Final presentation (30 minutes)
- Talk and slides are graded
3Ordering effect
4The problem
There are 2n-1 linear orderings consistent with
the structure of the tree. An optimal linear
ordering, one that maximizes the similarity of
adjacent elements in the ordering, is impractical
to compute.
Eisen et al, PNAS 98
5Problem definition
Denote by ? the space of the possible linear
orderings consistent with the tree. Denote by v1
vn the tree leaves. Our goal is to find an
ordering that maximizes the similarity of
adjacent elements
where S is the similarity matrix
6Computing the optimal similarity
Recursively compute the optimal similarity
OT(u,w) for any pair of leaves (u,w) which could
be on different corners (leftmost and rightmost)
of T. For a leaf u?T, CT(u) is the set of all
possible corner leaves of T when u is on one
corner of T.
T
T2
T1
w
u
k
m
OT(u,w) maxm?CT1(u),k?CT2(w) OT1(u,m)
OT2(k,w) S(m,k)
7Improvement
worst time is still O(n4) but
8Running time biological datasets
Results obtained on 700 Mhz Pentium pc with 512M
memory running Linux
9Does it help ?
Recall the statement we started with - An
optimal linear ordering, one that maximizes the
similarity of adjacent elements in the
ordering, is impractical to compute.
10Results hand generated data
Input
11Biological results
- Spellman identified 800 genes as cell cycle
regulated in Saccharomyces cerevisiae. - Genes were assigned to five groups termed
G1,S,S/G2,G2/M and M/G1 which approximate the
commonly used cell cycle groups in the
literature. - This assignment was performed using a phasing
method which is a supervised classification
algorithm. - In addition to the phasing method, the authors
clustered these genes using hierarchical
clustering, noting - There is no simple relationship between these
two phasing and clustering methods, although
there are common features in the results.
12Cell Cycle 24 experiments of cdc15 temperature
sensitive mutant
Hierarchical clustering
1324 experiments of cdc15 temperature sensitive
mutant
1459 experiments, combining cdc15, cdc28 and ?
factor arrest
15Clustering of the 79 experiments in Eisens
paper. The numbers to the right of each gene
represents the complex to which it belongs
according to the MIPS database.
Optimal ordering
Hierarchical clustering
16Using optimal ordering to identify the different
clusters. 24 experiments of cdc15 mutant from
Spellmans paper.
0
1
0
1
17Clustering Demos
- K-means
- Demo 1 3 clusters in data, k 3
- Demo 2 1 cluster in data, k 2
- Mixture Models
- Demo 3 1 cluster in data, k 2 (same data as
Demo 2) - Demo 4 3 clusters in data, k 2
- Demo 5 3 clusters in data, k 3
- Ill put the code on the web site
18Clustering Summary
- Clustering allows you to organize data and see
patterns - Can reduce the dimensionality of data (e.g. PCA)
- Ultimately, we would like to use clusters to
explain biological phenomenon - The first step is classification