Title: Oct 4
1Presentation volunteer needed
- Oct 4
- A genome-wide study of gene activity reveals
developmental signaling pathways in the
preimplantation mouse embryo. Wang et al, Dev
Cell 2004. - Gene expression in the preimplantation embryo
in-vitro developmental changes. Reproductive
BioMedicine 2005
2Dimension Reduction Methods
3Motivation
- High dimensional data points are difficult to
visualize to detect or confirm the relationship
among them - In microarray data, one sample point has
thousands of genes, and one gene points has tens
of samples
4If applying dimension reduction
- Better visualize the unsupervised clustering
results - Color hierarchical or K-means clusters in reduced
dimension (2 or 3D) to assess cluster tightness
and outliers - Discover clusters visually in lower dimensions
5Better visualize unsupervised clustering results
-- HC Sample clustering use the 167 filtered
genes -- Three major samples clusters identified
and colored by HC -- Use the cluster information
to project samples from 167 dimension to 2D using
Linear Discriminant Analysis (LDA)
6Discover clusters visually in lower dimensions
-- HC Sample clustering use the 167 filtered
genes -- Three major samples clusters identified
and colored by HC -- Do NOT use the cluster
information to project samples from 167 dimension
to 2D using Principle Component Analysis (PCA)
The visual clustering after PCA may not agree
with the HC results well
7Viewing clustering result through dimension
reduction, more example
8Viewing clustering result through dimension
reduction, more example
Clustering of gene expression data. a,
Hierarchical clustering dendrogram with the
cluster of 19 melanomas at the centre. b, MDS
three-dimensional plot of all 31 cutaneous
melanoma samples showing major cluster of 19
samples (blue, within cylinder), and remaining 12
samples (gold). Bitter et al. Nature. VOL 406
p536, 2000
9Dimension Reduction Methods
- Principle Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Multi-Dimensional Scaling (MDS)
10Principle Component Analysis (PCA)
- Given N data vectors from k-dimensions, find c lt
k orthogonal vectors that can be best used to
represent data - The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions) - Each data vector is a linear combination of the c
principal component vectors - Project on the subspace which preserve the most
of the data variability
11Principle Component Analysis (PCA)
12E.g. a sample point X (gene 1, gene 2, gene 3
gene n) a gene point X (sample 1, sample2,
sample p)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16by
First three PC directions
First two PC directions
17Homework (Due Sept 25)
- Read more on Principle Component Analysis
- http//csnet.otago.ac.nz/cosc453/student_tutorial
s/principal_components.pdfsearch22Principle20c
omponent22 - (Team) Download a gene expression
datasethttp//bioinfor.bioen.uiuc.edu/bioe598/da
ta/colon-cancer.xlsRead it into R. Find the top
200 genes with the largest standard deviation.
Use these 200 genes to cluster the samples (the
pam function is k-means clustering). Are you
satisfied with your clustering result? Are there
alternative ways to do this? Hand in your report. - Inspect the course project datasets on the course
webpage. What can you do with them?
18Self-organizing maps (SOM)
Interpreting patterns of gene expression with
self-organizing maps Methods and application to
hematopoietic differentiation, Tamayo et al. PNAS
Vol. 96, pp. 2907, 1999
19- Method
- choose a geometry of nodes (e.g. a 6 by 5
grid) - The nodes are mapped into k-dimensional gene
expression space (kno. of conditions),
initially at random, and then iteratively
adjusted - Each iteration involves randomly selecting a data
point P and moving the nodes in the direction of
P. The closest node N_P is moved the most,
whereas other nodes are moved by smaller amounts
depending on their distance from N_P in the
initial geometry.
20(No Transcript)
21The position of node N at iteration i is denoted
fi(N). The initial mapping f0 is random. On
subsequent iterations, a data point P is selected
and the node NP that maps nearest to P is
identified. The mapping of nodes is then adjusted
by moving points toward P
22Yeast cell cycle data Cho et al. 1998, Mol. Cell
2, 65-73
23SOM clustering of periodic genes