Title: Clustering
1Tutorial 8
2Clustering
- General Methods
- Unsupervised Clustering
- Hierarchical clustering
- K-means clustering
- Expression data
- GEO
- UCSC
- ArrayExpress
- Tools
- EPCLUST
- Mev
3Microarray - Reminder
4Expression Data Matrix
Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6
Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9
Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7
Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1
Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3
Gene 5 0.1 2.6 2.2 2.7 -2.1
Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9
- Each column represents all the gene expression
levels from a single experiment. - Each row represents the expression of a gene
across all experiments.
5Expression Data Matrix
Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6
Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9
Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7
Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1
Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3
Gene 5 0.1 2.6 2.2 2.7 -2.1
Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9
- Each element is a log ratio log2 (T/R).
- T - the gene expression level in the testing
sample - R - the gene expression level in the
reference sample
6Microarray Data Matrix
Black indicates a log ratio of zero, i.e. TR
Green indicates a negative log ratio, i.e. TltR
Grey indicates missing data
Red indicates a positive log ratio, i.e. TgtR
7Microarray Data Different representations
TgtR
Log ratio
Log ratio
TltR
Exp
Exp
8A real example
500 genes 3 knockdown conditions To complicate
to analyze without help
9Microarray Data Clusters
10- How to determine the similarity between two
genes? (for clustering)
Patrik D'haeseleer, How does gene expression
clustering work?, Nature Biotechnology 23, 1499 -
1501 (2005) , http//www.nature.com/nbt/journal/v
23/n12/full/nbt1205-1499.html
11Unsupervised Clustering
Hierarchical Clustering
12Hierarchical Clustering
genes with similar expression patterns are
grouped together and are connected by a series of
branches (dendrogram).
2
1
6
3
5
4
Leaves (shapes in our case) represent genes and
the length of the paths between leaves represents
the distances between genes.
13Hierarchical clustering finds an entire hierarchy
of clusters.
If we want a certain number of clusters we need
to cut the tree at a level indicates that number
(in this case - four).
14Hierarchical clustering result
Five clusters
15K-means Clustering
An algorithm to classify the data into K number
of groups.
K4
16How does it work?
1
2
3
4
The centroid of each of the k clusters becomes
the new means.
k initial "means" (in this casek3) are randomly
selected from the data set (shown in color).
k clusters are created by associating every
observation with the nearest mean
Steps 2 and 3 are repeated until convergence has
been reached.
The algorithm divides iteratively the genes into
K groups and calculates the center of each group.
The results are the optimal groups (center
distances) for K clusters.
17Different types of clustering different results
18How to search for expression profiles
- GEO (Gene Expression Omnibus)
- http//www.ncbi.nlm.nih.gov/geo/
- Human genome browser
- http//genome.ucsc.edu/
- ArrayExpress
- http//www.ebi.ac.uk/arrayexpress/
19(No Transcript)
20Searching for expression profiles in the GEO
Datasets - suitable for analysis with GEO tools
Expression profiles by gene
Probe sets
Microarray experiments
Groups of related microarray experiments
21Clustering
Download dataset
Statistic analysis
22Clustering analysis
23Clustering
Download dataset
Statistic analysis
24The expression distribution for different lines
in the cluster
25(No Transcript)
26Searching for expression profiles in the Human
Genome browser.
27Keratine 10 is highly expressed in skin
28ArrayExpress
http//www.ebi.ac.uk/arrayexpress/
29(No Transcript)
30What can we do with all the expression profiles?
Clusters!
How?
EPCLUST
http//www.bioinf.ebc.ee/EP/EP/EPCLUST/
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37 In the input matrix each column should
represents a gene and each row should represent
an experiment (or individual).
Hierarchical clustering
Edit the input matrix Transpose,Normalize,Randomi
ze
K-means clustering
38Data
Clusters
39 In the input matrix each column should
represents a gene and each row should represent
an experiment (or individual).
Hierarchical clustering
Edit the input matrix Transpose,Normalize,Randomi
ze
K-means clustering
40Samples found in cluster
Graphical representation of the cluster
Graphical representation of the cluster
4110 clusters, as requested
42Multi experiment viewer
http//www.tm4.org/mev/