Cluster Analysis of Gene Expression Profiles - PowerPoint PPT Presentation

About This Presentation

Title:

Cluster Analysis of Gene Expression Profiles

Description:

Identifying groups of genes that exhibit a similar expression ' ... Systematic determination of genetic network architecture, Nat.Genet., 22, (1999) 281-285. ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 32

Provided by: mariomed

Learn more at: http://eh3.uc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis of Gene Expression Profiles

1
Cluster Analysis of Gene Expression Profiles

Identifying groups of genes that exhibit a
similar expression "behavior" across a number of
experimental conditions
Assuming that such "co-expression" will tell us
something about these genes are regulated or even
possibly something about their function
(Functional Genomics)
Using information from multiple genes at a time -
as opposed to the single gene at a time analysis
we did so far
We can also cluster biological samples based on
the expression of some or all of the genes
Example Identifying groups of molecularly
similar tumor
"Molecular phenotyping"
"Unsupervised learning" in the computer science
lingo

2
Cluster analysis

Often a large portion of genes are "not
interesting"
The meaning of the "not interesting" depends on
the context
Possibly we are interested in genes that whose
expression is not constant across all
experimental conditions. To remove
"non-interesting" genes one can apply a
"variation filter".
Various sorts of "filtering" of "non-interesting"
genes generally amounts to performing some kind
of informal statistical testing with a very low
confidence.
For now, we will just play with our data with
some more exciting examples to follow
We have six measurements for each gene and will
try to cluster genes and experimental conditions
using this data

3
Cluster analysis

In our data we will apply the following filter
We will look only at genes for whom there is some
evidence of differential expression between W and
C treatments (p-value lt 0.01in the dye-adjusted
analysis).
For the simplicity, we will also look only on
genes with no flagged observations.
gt DFilterlt-(AnovadBp.value,"W"lt0.01
apply(NNBLimmadataCweights,1,sum)6)
gt CDatalt-NNBLimmadataCMDFilter,
gt dim(CData)
1 153 6
gt range(CData)
1 -4.765045 3.951253

4
Data to Analyse

gt library(limma)
gt library(marray)
gt heatmap(CData,ColvNA,RowvNA,cexCol0.8,cexRow
0.2)
gt maColorBar(seq(-3,3, 0.2), colheat.colors(12),
horizontalFALSE, k7)

5
Clustered Data

gt heatmap(CData,cexCol0.8,cexRow0.2)

6
Displaying it differently for historical reasons

gt pallt-maPalette(low"green", high"red",
mid"black")
gt maColorBar(seq(-3,3, 0.2), colpal,
horizontalFALSE, k7)
gt heatmap(CData,cexCol0.8,cexRow0.2,colpal)

7
Hierarchical Clustering

Calculating the "distance" or "similarity between
each pair of expression profiles
Merging two "closest" profiles, forming a "node"
in the clustering tree and re-calculating the
"distance between such a "sub-cluster" and rest
of the profiles or sub-clusters using on of the
"linkage" principles. Again merge two closest
sub-clusters
Complete linkage - define the distance/similarity
between the two clusters as the maximum/minimum
distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster
and the other profile is from the second
sub-cluster
Average linkage - define the distance/similarity
between the two clusters as the average
distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster
and the other profile is from the second
sub-cluster
Single linkage - define the distance/similarity
between the two clusters as the minimum/maximum
distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster
and the other profile is from the second
sub-cluster

8
Euclidian Distance

R actually operates on distances, so similarities
have to be transformed into distances - usually
straightforward
Euclidian distance

In 2 and 3 dimensions, this is our usual, every
day's distance
gt CDataslt-CData110,
gt EDistanceslt-dist(CDatas,method "euclidean",
diag T, upper T)
gt print(EDistances,digits2)
gt EDistanceslt-dist(CDatas,method "euclidean")
gt print(EDistances,digits2)

9
Distance Matrix

Distance Matrix - whole
1 2 3 4 5 6 7 8 9
10
1 0.00 2.38 1.78 3.02 2.15 1.88 2.07 3.15 0.55
1.23
2 2.38 0.00 0.67 1.19 0.89 0.60 0.45 4.54 2.30
3.08
3 1.78 0.67 0.00 1.41 0.84 0.39 0.50 4.14 1.74
2.54
4 3.02 1.19 1.41 0.00 1.37 1.53 1.50 5.45 3.10
3.89
5 2.15 0.89 0.84 1.37 0.00 0.68 0.90 4.74 2.22
3.03
6 1.88 0.60 0.39 1.53 0.68 0.00 0.33 4.20 1.83
2.60
7 2.07 0.45 0.50 1.50 0.90 0.33 0.00 4.22 1.97
2.69
8 3.15 4.54 4.14 5.45 4.74 4.20 4.22 0.00 2.74
2.23
9 0.55 2.30 1.74 3.10 2.22 1.83 1.97 2.74 0.00
0.99
10 1.23 3.08 2.54 3.89 3.03 2.60 2.69 2.23 0.99
0.00
Distance Matrix - lower triangular
1 2 3 4 5 6 7 8 9
2 2.38
3 1.78 0.67
4 3.02 1.19 1.41
5 2.15 0.89 0.84 1.37

10
Dendrograms - Complete Linkage

gt Clusteringlt-hclust(EDistances,method"complete")
gt plot(Clustering)

Distance Matrix - lower triangular 1 2
3 4 5 6 7 8 9 2 2.38
3 1.78 0.67
4 3.02 1.19 1.41
5 2.15 0.89 0.84
1.37 6 1.88 0.60 0.39
1.53 0.68 7 2.07 0.45 0.50
1.50 0.90 0.33 8 3.15 4.54 4.14
5.45 4.74 4.20 4.22 9 0.55 2.30 1.74
3.10 2.22 1.83 1.97 2.74 10 1.23 3.08 2.54
3.89 3.03 2.60 2.69 2.23 0.99
11
Clustering genes and samples

gt EDistancesSlt-dist(t(CDatas),method
"euclidean")
gt ClusteringSlt-hclust(EDistancesS,method"complete
")
gt heatmap(CDatas,Colvas.dendrogram(ClusteringS),R
owvas.dendrogram(Clustering),cexCol0.8,colpal,c
exRow1)

12
Creating Clusters - "Cutting the Tree"

gt TwoClusterslt-cutree(Clustering,k 2, h NULL)
gt TwoClusters
1 1 2 2 2 2 2 2 1 1 1
gt heatmap(CDatasTwoClusters1,,ColvNA,RowvNA,
cexCol0.8, colpal,cexRow1)

Homework Print names for these genes
13
Clustering by partitioning K-means algorithm

For a pre-specified number of clusters iterate
between calculating cluster "centroides" (i.e.
cluster means) and re-assigning each profile to
the cluster with the closest "centroid"

t1st iteration

iterate until ct1ct

14
Questions

How many clusters there are in the data?
What is the statistical significance of a
clustering?
What is a confidence in assigning any particular
expression profile to any particular cluster?
Difficult questions, particularly difficult to
answer when using heuristic methods like
hierarchical clustering and k-means
Need statistical models

15
Two genes at a time

Are these two genes co-expressed?
By looking at their expression patterns alone,
combined with the null distribution of the
similarity measure in non-co-expressed genes, we
could conclude that this is the case.

16
Another look

What if we knew that there are two and only two
distinct patterns in the data and we know how
they look (thick dashed lines)?
Given this additional information we are likely
to conclude that our two genes actually have
different patterns of expression.

17
Many genes at a time

Simultaneous detection of patterns of
expression defined by groups of expression
profiles and assignment of individual expression
profiles to appropriate patterns.
By looking at all genes at the same time, we
came up with a completely different conclusion
than when looking at only two of them.
Questions How many clusters? How confident are
we in the number of clusters in the data? How
confident are we that our two genes belong to two
different clusters? Is such a confidence
statement taking into account the uncertainty
about the true number of clusters?

18
Gene-specific normalization of the data
19
Clustering using non-normalized data
K-means
Euclidian Distance
Pearson's correlation
20
Clustering using normalized data
K-means
Euclidian Distance
Pearson's correlation
21
Why do we cluster?

Guilt by association

Co-expression
Co-regulation
Functional relationship
Assigning function to genes
22
Dissecting the gene expression regulatory
mechanisms
S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho,
G.M.Church. Systematic determination of genetic
network architecture, Nat.Genet., 22, (1999)
281-285.
23
Multi-way Anova

Identifying and quantifying sources of variation
Ability to "factor out" certain sources -
("adjusting")
For the beginning, we will reproduce paired
t-test results by assuming that arrays are one of
the factors in a Two-way ANOVA
Second, adjusting for the dye effects in a
Three-way ANOVA
Third, four and more - way ANOVA when having
multiple factors of interest

24
Multi-way Anova

Identifying and quantifying sources of variation
Ability to "factor out" certain sources -
("adjusting")
For the beginning, we will reproduce paired
t-test results by assuming that arrays are one of
the factors in a Two-way ANOVA
Second, adjusting for the dye effects in a
Three-way ANOVA
Third, four and more - way ANOVA when having
multiple factors of interest

25
Sources of variation And the Two-way ANOVA
26
Statistical Inference in Two-way ANOVA
27
Alternative Formulations of the Two-way ANOVA

Parameter Estimates
Gets complicated
Regardless of how the model is parametrized
certain parametersremain unchanged (Trt2-Trt1)
In this sense all formulations are equivalent

28
Re-organizing data for ANOVA
gt DAnovalt-NNBLimmadataC gt DAnovaweightslt-cbind(DA
novaweights,DAnovaweights) gt DAnovaMlt-cbind((DA
novaADAnovaM/2),DAnovaA-(DAnovaM/2)) gt
DAnovaAlt-cbind(DAnovaA,DAnovaA) gt
attributes(DAnova) names 1 "weights" "targets"
"genes" "printer" "M" "A"
class 1 "MAList" attr(,"package") 1 "limma"
gt NNBLimmadataCM1, 51-C1-3-vs-W1-5
60-W2-3-vs-C2-5 72-C3-3-vs-W3-5 79-W4-3-vs-C4-5
82-C5-3-vs-W5-5 97-W6-3-vs-C6-5 -0.05280598
-0.15767422 0.40130216 -0.35292771
-0.22061576 -0.21653047 gt
NNBLimmadataCA1, 51-C1-3-vs-W1-5
60-W2-3-vs-C2-5 72-C3-3-vs-W3-5 79-W4-3-vs-C4-5
82-C5-3-vs-W5-5 97-W6-3-vs-C6-5 6.289658
6.129577 6.483613 6.452317
6.510143 7.106134 gt
DAnovaM1, 51-C1-3-vs-W1-5 60-W2-3-vs-C2-5
72-C3-3-vs-W3-5 79-W4-3-vs-C4-5 82-C5-3-vs-W5-5
97-W6-3-vs-C6-5 6.263255 6.050740
6.684264 6.275853 6.399835
6.997869 51-C1-3-vs-W1-5 60-W2-3-vs-C2-5
72-C3-3-vs-W3-5 79-W4-3-vs-C4-5 82-C5-3-vs-W5-5
97-W6-3-vs-C6-5 6.316061 6.208414
6.282962 6.628781 6.620451
7.214399
29
Setting-up the design matrix for limma
gt targets SlideNumber
FileName Cy3 Cy5 Date 51-C1-3-vs-W1-5
51 51-C1-3-vs-W1-5.gpr C W
11/8/2004 60-W2-3-vs-C2-5 60
60-W2-3-vs-C2-5.gpr W C 11/8/2004 72-C3-3-vs-W
3-5 72 72-C3-3-vs-W3-5.gpr C W
11/8/2004 79-W4-3-vs-C4-5 79
79-W4-3-vs-C4-5.gpr W C 11/8/2004 82-C5-3-vs-W
5-5 82 82-C5-3-vs-W5-5.gpr C W
11/8/2004 97-W6-3-vs-C6-5 97
97-W6-3-vs-C6-5.gpr W C 11/8/2004 gt
trtlt-c(targetsCy5,targetsCy3) gt trt 1 "W"
"C" "W" "C" "W" "C" "C" "W" "C" "W" "C" "W" gt
arraylt-c(16,16) gt array 1 1 2 3 4 5 6 1 2 3
4 5 6
30
Setting-up the design matrix for limma
gt designalt-model.matrix(-1factor(array)factor(t
rt)) gt designa factor(array)1 factor(array)2
factor(array)3 factor(array)4 factor(array)5
factor(array)6 factor(trt)W 1 1
0 0 0
0 0 1 2
0 1 0 0
0 0 0 3
0 0 1
0 0 0 1 4
0 0 0
1 0 0
0 5 0 0 0
0 1 0
1 6 0 0
0 0 0
1 0 7 1 0
0 0 0
0 0 8 0
1 0 0 0
0 1 9 0
0 1 0
0 0 0 10
0 0 0 1
0 0 1 11
0 0 0
0 1 0 0 12
0 0 0
0 0 1
1 attr(,"assign") 1 1 1 1 1 1 1
2 attr(,"contrasts") attr(,"contrasts")"factor(ar
ray)" 1 "contr.treatment" attr(,"contrasts")"f
actor(trt)" 1 "contr.treatment"
31
Setting-up the design matrix for limma
gt colnames(designa)lt-c(paste("A",16,sep""),"W")
gt designa A1 A2 A3 A4 A5 A6 W 1 1 0 0 0
0 0 1 2 0 1 0 0 0 0 0 3 0 0 1 0 0
0 1 4 0 0 0 1 0 0 0 5 0 0 0 0 1 0
1 6 0 0 0 0 0 1 0 7 1 0 0 0 0 0 0 8
0 1 0 0 0 0 1 9 0 0 1 0 0 0 0 10 0
0 0 1 0 0 1 11 0 0 0 0 1 0 0 12 0 0
0 0 0 1 1 attr(,"assign") 1 1 1 1 1 1 1
2 attr(,"contrasts") attr(,"contrasts")"factor(ar
ray)" 1 "contr.treatment" attr(,"contrasts")"f
actor(trt)" 1 "contr.treatment"

Write a Comment

User Comments (0)