Title: Examples of Clustering Biological Data
1Examples of Clustering Biological Data
- 6.892 / 7.93 Spring Term 2002
- March 7, 2002
- David Gifford
23
1
4
5
2
33
1
4
5
2
4- For n leaves there are n-1 internal nodes
- Each flip in an internal node creates a new
linear ordering - There are 2n-1 possible linear ordering of the
leafs of the tree
1
2
5Cluster analysis and display of genome wide
expression patterns
- Eisen, Spellman, Brown, Botstein, PNAS 95, pp.
14863-14868, December 1988. - Clustered S. cervisiae and human data to find
genes of similar function - Goal was to characterize genes of unknown
function and uncover patterns that can be
interpreted as indications of the status of
cellular processes
6Clustering algorithm overview
- Pairwise average-linkage cluster analysis
- Form of hierarchical clustering
- Also used in sequence and phylogenetic analysis
- Tree created where branch lengths reflect degree
of similarity - Pairwise similarity function is key to algorithm
function - Similarity function used is the normalized vector
dot product (correlation coefficient)
7Clustering result interpretation
- Order the tree using a heuristic
- Permits biologists to assimilate and explore data
in a natural and intuitive manner soft
clustering - Display colors (log ratio)
- 0 black
- gt0 red of increasing intensity
- lt0 green of increasing intensity
- Dendrogram attached to display
8Sources of Experimental Data
- Printed array technology
- Essentially every ORF in S. cerevisiae
represented - Human 8600 distinct transcripts represented
- Experimental Conditions
- Diauxic shift
- Mitotic cell division cycle
- Sporulation
- Temperature and reducing shocks
- Human serum starvation and stimulation
9Serum stimulation, human fibroblasts Samples at
t 0, 15 min, 30 min, 1 hr, 2 hr, 3 hr, 4 hr, 8
hr, 12 hr, 16 hr, 20 hr, 24 hr
A cholesterol biosynthesis B cell cycle C
immediate-early response D signaling and
angiogenesis E wound heading and tissue
remodeling
102,467 S. cerevisiae genes Alpha factor arrest
(18) Elutriation (14) CDC15 TS mutant
(15) Sporulation (7) High temperature shock
(6) Reducing agents (4) Low temperature (4) I
Stress down regulation
11Randomized data does not cluster
Permute R1 within rows R2 within columns R3
both
12Serum response of human cells
- Iyer, Eisen, et. al. Science 263, 1 January 1999
- More detailed analysis of Figure 1 from PNAS
paper - Printed microarrays
- 8600 different genes (as in PNAS 1998)
- Experiment
- Deprive human fibroblasts of serum
- Removes necessary growth factors
- Add fetal bovine serum
- Study time course of response
13Clustering of 512 genes whose expression changed
substantially in response to serum addition
14Serum response is rapid
- Immediate response dominated by transcription
factors involved in signal transduction - c-FOS, JUN B, mitogen-activated protein (MAP)
kinase phosphatase-1 (MKP1) all detected within
15 minutes after stimulation - Discovered over 200 previously uncharacterized
genes that responded to serum stimulation
1512 clusters explored
- Signal transduction immediate-early
transcription factors other transcription
factors cycle cycle and proliferation
coagulation and hemostasis inflammation
angiogenesis tissue remodeling cytoskeletal
reorganization re-epithelialization
unidentified role in wound healing cholesterol
biosynthesis
16Cell cycle inhibition gene cluster
6-12 hours after stimulation hit lows Coincident
with fibroblast passage into G1
17Comprehensive Identification of Cell
Cycle-regulated Genes of S. cerevisiae by
Microarray Hybridization
- Spellman, Sherlock, Zhang, Iyer, Anders, Eisen,
Brown, Botstein, Futcher - Molecular Biology of the Cell 9, Issue 12,
3273-3279, December 1998b
18Three types of synchronization used
- Alpha factor arrest
- Added for 120 minutes and then removed
- Elutriation
- Size based synchronization
- Arrest of a cdc15 temperature-sensitive mutant
19Results
- Identified 800 genes that met a defined criteria
for cell cycle regulation - More than half of these 800 genes respond to
either the G1 cyclin Cln3p or the B-type cyclin
Clb2p - Analyzed set of cell cycle regulated genes for
known and new promoter elements - Several known elements are predictive of cell
cycle regulation
20Clustering Metrics
- CDC score based on
- Compute modified Fourier transform of each genes
expression pattern for the alpha factor
experiments (given division time) - Assign each gene to one of 5 cell cycle classes
by correlation with known genes in each of the 5
classes - Scale Fourier transform by correlation with best
class - Repeat for cdc15 and cdc28 experiments,
calculating phase offset that maximizes resulting
vector - Add all three vectors together with fudge factors
- Take magnitude of final vector represents degree
of periodicity at specified division interval
21CDC Metric
- Selected threshold CDC score that was exceeded by
91 of known cell-cycle regulated genes - Yikes! This is quite complicated.
22CDC aggregate score is independent of the cdc28
data
23Genes sorted by phase of expression based on
Fourier analysis
24Genes clustered using correlation based
algorithm CLN2 strongly cell cycle regulated,
CLN1, CLN2, CLB6, RNR1, CDC9, CDC21, etc.
25Analysis of errors
- False negatives caused by noise in data set that
hides their signal - False positives can be caused by random
fluctuations - False positives can be caused by
cross-hybridization with true positive - Issue when sequence is 75 similar
- E.g. DBF2 and DBF20
- Or mRNA for regulated gene contains unregulated
gene sequence
26Can we make this simpler?
- Use unsupervised learning technique
- Goal would be to recover temporal structure
without using Fourier analysis
27Ordering effect
28The problem
There are 2n-1 linear orderings consistent with
the structure of the tree. An optimal linear
ordering, one that maximizes the similarity of
adjacent elements in the ordering, is impractical
to compute.
Eisen et al, PNAS 98
29Problem definition
Denote by ? the space of the possible linear
orderings consistent with the tree. Denote by v1
vn the tree leaves. Our goal is to find an
ordering that maximizes the similarity of
adjacent elements
where S is the similarity matrix
30Computing the optimal similarity
Recursively compute the optimal similarity
OT(u,w) for any pair of leaves (u,w) which could
be on different corners (leftmost and rightmost)
of T. For a leaf u?T, CT(u) is the set of all
possible corner leaves of T when u is on one
corner of T.
T
T2
T1
w
u
k
m
OT(u,w) maxm?CT1(u),k?CT2(w) OT1(u,m)
OT2(k,w) S(m,k)
31Improvement
worst time is still O(n4) but
32Running time biological datasets
Results obtained on 700 Mhz Pentium pc with 512M
memory running Linux
33Does it help ?
Recall the statement we started with - An
optimal linear ordering, one that maximizes the
similarity of adjacent elements in the
ordering, is impractical to compute.
34Results hand generated data
Input
35Biological results
- Spellman identified 800 genes as cell cycle
regulated in Saccharomyces cerevisiae. - Genes were assigned to five groups termed
G1,S,S/G2,G2/M and M/G1 which approximate the
commonly used cell cycle groups in the
literature. - This assignment was performed using a phasing
method which is a supervised classification
algorithm. - In addition to the phasing method, the authors
clustered these genes using hierarchical
clustering, noting - There is no simple relationship between these
two phasing and clustering methods, although
there are common features in the results.
36Cell Cycle 24 experiments of cdc15 temperature
sensitive mutant
Hierarchical clustering
3724 experiments of cdc15 temperature sensitive
mutant
3859 experiments, combining cdc15, cdc28 and ?
factor arrest
39Clustering of the 79 experiments in Eisens
paper. The numbers to the right of each gene
represents the complex to which it belongs
according to the MIPS database.
Optimal ordering
Hierarchical clustering
40Using optimal ordering to identify the different
clusters. 24 experiments of cdc15 mutant from
Spellmans paper.
0
1
0
1
41Student Projects
- Ideally groups of four
- Two biologists
- Two computer scientists or quantitative experts
- Pick a research project related to the class
- You will have access to data, Matlab, etc.
- Work product
- Preliminary outline of plans (5 minutes in class)
- Final presentation (30 minutes)
- Talk and slides are graded
42M/G1 clusters
43G1 Cluster(A) CLN1 Cluster (B) Y chrom. ends
44Binding site frequencies
45Bassett, Eisen, Boguski 1999
S. Cerevisiae 3800 genes 365 experiments
46Portion of the dendrogram mating pathway
47(No Transcript)