Examples of Clustering Biological Data - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Examples of Clustering Biological Data

Description:

For n leaves there are n-1 internal nodes. Each flip in an internal node ... Permits biologists to assimilate and explore data in a natural and intuitive ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 48
Provided by: leannecon
Category:

less

Transcript and Presenter's Notes

Title: Examples of Clustering Biological Data


1
Examples of Clustering Biological Data
  • 6.892 / 7.93 Spring Term 2002
  • March 7, 2002
  • David Gifford

2
3
1
4
5
2
3
3
1
4
5
2
4
  • For n leaves there are n-1 internal nodes
  • Each flip in an internal node creates a new
    linear ordering
  • There are 2n-1 possible linear ordering of the
    leafs of the tree

1
2
5
Cluster analysis and display of genome wide
expression patterns
  • Eisen, Spellman, Brown, Botstein, PNAS 95, pp.
    14863-14868, December 1988.
  • Clustered S. cervisiae and human data to find
    genes of similar function
  • Goal was to characterize genes of unknown
    function and uncover patterns that can be
    interpreted as indications of the status of
    cellular processes

6
Clustering algorithm overview
  • Pairwise average-linkage cluster analysis
  • Form of hierarchical clustering
  • Also used in sequence and phylogenetic analysis
  • Tree created where branch lengths reflect degree
    of similarity
  • Pairwise similarity function is key to algorithm
    function
  • Similarity function used is the normalized vector
    dot product (correlation coefficient)

7
Clustering result interpretation
  • Order the tree using a heuristic
  • Permits biologists to assimilate and explore data
    in a natural and intuitive manner soft
    clustering
  • Display colors (log ratio)
  • 0 black
  • gt0 red of increasing intensity
  • lt0 green of increasing intensity
  • Dendrogram attached to display

8
Sources of Experimental Data
  • Printed array technology
  • Essentially every ORF in S. cerevisiae
    represented
  • Human 8600 distinct transcripts represented
  • Experimental Conditions
  • Diauxic shift
  • Mitotic cell division cycle
  • Sporulation
  • Temperature and reducing shocks
  • Human serum starvation and stimulation

9
Serum stimulation, human fibroblasts Samples at
t 0, 15 min, 30 min, 1 hr, 2 hr, 3 hr, 4 hr, 8
hr, 12 hr, 16 hr, 20 hr, 24 hr
A cholesterol biosynthesis B cell cycle C
immediate-early response D signaling and
angiogenesis E wound heading and tissue
remodeling
10
2,467 S. cerevisiae genes Alpha factor arrest
(18) Elutriation (14) CDC15 TS mutant
(15) Sporulation (7) High temperature shock
(6) Reducing agents (4) Low temperature (4) I
Stress down regulation
11
Randomized data does not cluster
Permute R1 within rows R2 within columns R3
both
12
Serum response of human cells
  • Iyer, Eisen, et. al. Science 263, 1 January 1999
  • More detailed analysis of Figure 1 from PNAS
    paper
  • Printed microarrays
  • 8600 different genes (as in PNAS 1998)
  • Experiment
  • Deprive human fibroblasts of serum
  • Removes necessary growth factors
  • Add fetal bovine serum
  • Study time course of response

13
Clustering of 512 genes whose expression changed
substantially in response to serum addition
14
Serum response is rapid
  • Immediate response dominated by transcription
    factors involved in signal transduction
  • c-FOS, JUN B, mitogen-activated protein (MAP)
    kinase phosphatase-1 (MKP1) all detected within
    15 minutes after stimulation
  • Discovered over 200 previously uncharacterized
    genes that responded to serum stimulation

15
12 clusters explored
  • Signal transduction immediate-early
    transcription factors other transcription
    factors cycle cycle and proliferation
    coagulation and hemostasis inflammation
    angiogenesis tissue remodeling cytoskeletal
    reorganization re-epithelialization
    unidentified role in wound healing cholesterol
    biosynthesis

16
Cell cycle inhibition gene cluster
6-12 hours after stimulation hit lows Coincident
with fibroblast passage into G1
17
Comprehensive Identification of Cell
Cycle-regulated Genes of S. cerevisiae by
Microarray Hybridization
  • Spellman, Sherlock, Zhang, Iyer, Anders, Eisen,
    Brown, Botstein, Futcher
  • Molecular Biology of the Cell 9, Issue 12,
    3273-3279, December 1998b

18
Three types of synchronization used
  • Alpha factor arrest
  • Added for 120 minutes and then removed
  • Elutriation
  • Size based synchronization
  • Arrest of a cdc15 temperature-sensitive mutant

19
Results
  • Identified 800 genes that met a defined criteria
    for cell cycle regulation
  • More than half of these 800 genes respond to
    either the G1 cyclin Cln3p or the B-type cyclin
    Clb2p
  • Analyzed set of cell cycle regulated genes for
    known and new promoter elements
  • Several known elements are predictive of cell
    cycle regulation

20
Clustering Metrics
  • CDC score based on
  • Compute modified Fourier transform of each genes
    expression pattern for the alpha factor
    experiments (given division time)
  • Assign each gene to one of 5 cell cycle classes
    by correlation with known genes in each of the 5
    classes
  • Scale Fourier transform by correlation with best
    class
  • Repeat for cdc15 and cdc28 experiments,
    calculating phase offset that maximizes resulting
    vector
  • Add all three vectors together with fudge factors
  • Take magnitude of final vector represents degree
    of periodicity at specified division interval

21
CDC Metric
  • Selected threshold CDC score that was exceeded by
    91 of known cell-cycle regulated genes
  • Yikes! This is quite complicated.

22
CDC aggregate score is independent of the cdc28
data
23
Genes sorted by phase of expression based on
Fourier analysis
24
Genes clustered using correlation based
algorithm CLN2 strongly cell cycle regulated,
CLN1, CLN2, CLB6, RNR1, CDC9, CDC21, etc.
25
Analysis of errors
  • False negatives caused by noise in data set that
    hides their signal
  • False positives can be caused by random
    fluctuations
  • False positives can be caused by
    cross-hybridization with true positive
  • Issue when sequence is 75 similar
  • E.g. DBF2 and DBF20
  • Or mRNA for regulated gene contains unregulated
    gene sequence

26
Can we make this simpler?
  • Use unsupervised learning technique
  • Goal would be to recover temporal structure
    without using Fourier analysis

27
Ordering effect
28
The problem
There are 2n-1 linear orderings consistent with
the structure of the tree. An optimal linear
ordering, one that maximizes the similarity of
adjacent elements in the ordering, is impractical
to compute.
Eisen et al, PNAS 98
29
Problem definition
Denote by ? the space of the possible linear
orderings consistent with the tree. Denote by v1
vn the tree leaves. Our goal is to find an
ordering that maximizes the similarity of
adjacent elements
where S is the similarity matrix
30
Computing the optimal similarity
Recursively compute the optimal similarity
OT(u,w) for any pair of leaves (u,w) which could
be on different corners (leftmost and rightmost)
of T. For a leaf u?T, CT(u) is the set of all
possible corner leaves of T when u is on one
corner of T.
T
T2
T1
w
u
k
m
OT(u,w) maxm?CT1(u),k?CT2(w) OT1(u,m)
OT2(k,w) S(m,k)
31
Improvement
worst time is still O(n4) but
32
Running time biological datasets
Results obtained on 700 Mhz Pentium pc with 512M
memory running Linux
33
Does it help ?
Recall the statement we started with - An
optimal linear ordering, one that maximizes the
similarity of adjacent elements in the
ordering, is impractical to compute.

34
Results hand generated data
Input
35
Biological results
  • Spellman identified 800 genes as cell cycle
    regulated in Saccharomyces cerevisiae.
  • Genes were assigned to five groups termed
    G1,S,S/G2,G2/M and M/G1 which approximate the
    commonly used cell cycle groups in the
    literature.
  • This assignment was performed using a phasing
    method which is a supervised classification
    algorithm.
  • In addition to the phasing method, the authors
    clustered these genes using hierarchical
    clustering, noting
  • There is no simple relationship between these
    two phasing and clustering methods, although
    there are common features in the results.

36
Cell Cycle 24 experiments of cdc15 temperature
sensitive mutant
Hierarchical clustering
37
24 experiments of cdc15 temperature sensitive
mutant
38
59 experiments, combining cdc15, cdc28 and ?
factor arrest
39
Clustering of the 79 experiments in Eisens
paper. The numbers to the right of each gene
represents the complex to which it belongs
according to the MIPS database.
Optimal ordering
Hierarchical clustering
40
Using optimal ordering to identify the different
clusters. 24 experiments of cdc15 mutant from
Spellmans paper.
0
1
0
1
41
Student Projects
  • Ideally groups of four
  • Two biologists
  • Two computer scientists or quantitative experts
  • Pick a research project related to the class
  • You will have access to data, Matlab, etc.
  • Work product
  • Preliminary outline of plans (5 minutes in class)
  • Final presentation (30 minutes)
  • Talk and slides are graded

42
M/G1 clusters
43
G1 Cluster(A) CLN1 Cluster (B) Y chrom. ends
44
Binding site frequencies
45
Bassett, Eisen, Boguski 1999
S. Cerevisiae 3800 genes 365 experiments
46
Portion of the dendrogram mating pathway
47
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com