Title: Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis
1Making Sense of Complicated Microarray
DataPart II Gene Clustering and Data Analysis
- Gabriel Eichler
- Boston University
- Some slides adapted from MeV documentation slides
2Why Cluster?
- Clustering is a process by which you can explore
your data in an efficient manner. - Visualization of data can help you review the
data quality. - Assumption Guilt by association similar gene
expression patterns may indicate a biological
relationship.
3Expression Vectors
- Gene Expression Vectors encapsulate the
expression of a gene over a set of experimental
conditions or sample types.
-0.8
1.5
1.8
0.5
-1.3
-0.4
1.5
0.8
Numeric Vector
Line Graph
Heatmap
4Expression Vectors As Points in Expression Space
t 1
t 2
t 3
G1
-0.8
-0.3
-0.7
G2
-0.8
-0.7
-0.4
Similar Expression
G3
-0.4
-0.6
-0.8
G4
0.9
1.2
1.3
G5
1.3
0.9
-0.6
Experiment 3
Experiment 2
Experiment 1
5Distance and Similarity
-the ability to calculate a distance (or
similarity, its inverse) between two expression
vectors is fundamental to clustering
algorithms -distance between vectors is the basis
upon which decisions are made when grouping
similar patterns of expression -selection of a
distance metric defines the concept of distance
6Distance a measure of similarity between gene
expression.
p1
- Some distances (MeV provides 11 metrics)
- Euclidean ??i 1 (xiA - xiB)2
p0
3. Pearson correlation
7Clustering Algorithms
8Clustering Algorithms
- Be weary - confounding computational artifacts
are associated with all clustering algorithms.
-You should always understand the basic concepts
behind an algorithm before using it. - Anything will cluster! Garbage In means Garbage
Out.
9Hierarchical Clustering
- IDEA Iteratively combines genes into groups
based on similar patterns of observed expression - By combining genes with genes OR genes with
groups algorithm produces a dendrogram of the
hierarchy of relationships. - Display the data as a heatmap and dendrogram
- Cluster genes, samples or both
(HCL-1)
10Hierarchical Clustering
11Hierarchical Clustering
12Hierarchical Clustering
13Hierarchical Clustering
14Hierarchical Clustering
15Hierarchical Clustering
16Hierarchical Clustering
17Hierarchical Clustering
18Hierarchical Clustering
H
L
19Hierarchical Clustering
Samples
Genes
The Leaf Ordering Problem
- Find optimal layout of branches for a given
dendrogram - architecture
- 2N-1 possible orderings of the branches
- For a small microarray dataset of 500 genes
- there are 1.6E150 branch configurations
20Hierarchical Clustering
The Leaf Ordering Problem
21Hierarchical Clustering
- Pros
- Commonly used algorithm
- Simple and quick to calculate
- Cons
- Real genes probably do not have a hierarchical
organization
22Self-Organizing Maps (SOMs)
A
Idea Place genes onto a grid so that genes with
similar patterns of expression are placed on
nearby squares.
B
C
D
c
a
d
b
23Self-Organizing Maps (SOMs)
A
IDEA Place genes onto a grid so that genes with
similar patterns of expression are placed on
nearby squares.
B
C
D
c
a
d
b
24Self-organizing Maps (SOMs)
25Self-organizing Maps (SOMS)
26The Gene Expression Dynamics Inspector GEDI
S a m p l e s
Group A
Group B
Group C
C1
C2
C3
C4
B1
B2
B3
B4
A1
A2
A3
A4
1.5 1.4 1.7 1.2 .85 .65 .50 .55 2.5 2.8 2.7 2.1
.78 .95 .75 .45 1.1 1.2 1.0 1.3 .56 .62 .78 .89
.45 .23 .15 .05 .82 .71 .62 .49 .11 .16 .11 .95
2.2 4.5 6.7 6.2 2.2 2.5 2.8 2.9 .48 .90 1.5 1.8
2.1 2.0 1.9 1.6 4.2 4.8 5.2 5.5 2.5 2.6 2.0 1.9
1.2 1.1 1.6 2.9 1.1 1.8 1.9 1.4 1.7 1.2 1.1 1.6
Gene 1
G en e s
Gene 2
G en e s
Gene 3
Gene 4
Gene 5
Gene 6
Group A
Group B
Group C
- GEDIs Features
- Allows for simultaneous analysis or several time
courses or datasets - Displays the data in an intuitive and comparable
mathematically driven visualization - The same genes maps to the same tiles
H
Group A
Group B
Group C
L
1
2
3
4
27Software Demonstrations
- MeV available at http//www.tigr.org/software/tm4/
mev.html
GEDI available at http//www.chip.org/ge/gedihome
.htm
28Comparison of GEDI vs. Hierarchical
ClusteringHierarchical clustering of random
data(GIGO)
From CreateGEP_Journal.wpd, random_A
29Questions