Title: SPR (Surface Plasmon Resonance) Chemical Sensing Microsystems
1Gene Shaving as a method for identifying
distinct sets of genes with similar expression
patterns Tim Randolph Garth Tan Presentation
for Stat 593E May 15, 2003
2Presentation Outline
- Biology Background
- Reminder of Principle Component Analysis
- What is Gene Shaving ?
- The Gene Shaving Algorithm
- Applications of Gene Shaving
- Conclusions
3What is gene expression?
- Each cell contains a complete copy of all genes.
- The difference between a skin cell and bone cell
is determined by which genes are producing
proteins - i.e., which genes are being expressed.
- The expression of DNA information occurs in two
steps - Transcription DNA ? mRNA
- Translation mRNA ? protein
- DNA microarrays measure transcription (i.e., the
mRNA produced)
4(No Transcript)
5Reference cells sample
test cells sample
Transcription
Label with dye
Hybridize to array
6The Dataset
- N x p expression matrix X
- p columns (patients)
- N rows (genes)
- Green under-expressed genes.
- Red over-expressed genes.
- X xij
7The ratio of the red and green intensities for
each spot indicates the relative abundance of the
corresponding DNA probe in the two nucleic acid
target samples. Xij log2 (R/G) Xij lt
0, gene is over expressed in test sample
relative to reference sample Xij 0, gene is
expressed equally Xij gt 0, gene is under
expressed in test sample relative to
reference.sample.
8 Remarks
- Knowing the list of human genes does not mean we
know what they do. - cDNA arrays help study the variation of gene
expression across samples (e.g., tissues, or
patients). - Major challenge is interpreting data that
consists of the expression levels of, say 6000
genes and 50 patients. - Present goal create a clustering that organizes
genes with coherent behavior across samples.
91st eigengene (principal component of XT)
- Singular value decomposition of XT
- XT U S VT
s1
v1
g1
g2
gN
u1
sr
XTV U S
s1 u1 XTv1 linear comb. columns of XT
(genes) with highest variance
10Introduction
- What is Gene Shaving ?
- A new statistical method that identifies subsets
of genes with coherent expression patterns and
large variation across different conditions - Differs from hierarchical clustering and other
widely used methods for analyzing gene expression
in that genes may belong to more that one
cluster.
11The Gene Shaving Algorithm
12Estimating the Optimal Cluster Size K
- Gene Shaving requires a quality measure for a
cluster - To select a good cluster, the method focuses on
high coherence between members of the cluster
13Estimating the Optimal Cluster Size K (cont.)
- The method defines the following measures of
variances for a cluster Sk
- The Between Variance is the variance of the
mean gene - The Within Variance measures the variability
of each gene about the average
14Estimating the Optimal Cluster Size K (cont.)
- A useful measure for choosing cluster size is the
percent variance - A large R2 implies a tight cluster of coherent
genes - Gene Shaving uses this measure for selecting a
cluster from the shaving sequence Sk
15Estimating the Optimal Cluster Size K (cont.)
- Once a cluster is selected from the sequence, we
can proceed to finding the optimal cluster size - Let Dk be the R2 measure for the k-th sequence
member. - We wish to find the Gap between this value Dk
and Dbk, which is the R2 measure for cluster
Sbk - This Sbk is the clustering sequence from a
permuted matrix Xb
16Estimating the Optimal Cluster Size K (cont.)
- The Gap function is defined as
- Where Dk is the average of Dbk over b.
- The optimal cluster size K is selected such that
this Gap is the largest -
17The Gene Shaving Algorithm(cont.)
18So Far form clusters Sk with
- high variance across samples
- high correlation among genes within a cluster
- low correlation between genes in different
clusters.
The procedure seeks clusters Sk by maximizing
v(Sk) var(vector of col. avgs.)
Now incorporate supervision use info, y, about
the patients, and seek Sk by maximizing (1- a)
v(Sk) a J( v(Sk) , y )
19- Goal is in predicting patient survival
- Find genes whose expression correlates with
patient survival. - Produce groupings of patients which are
statistically different in survival. - Use additional information about the patients,
y (y1,, yp), and combine unsupervised
supervised criteria into the objective function - (1- a) v(Sk) a J( v(Sk) , y ) 0 ? a ?
1
20Maximize (1- a) v(Sk) a J( v(Sk) , y )
- Information measure J( v(Sk) , y ) is a
quadratic function that depends on the type of
patient information, y. - y (y1,, yp) may identify catagories of
patients. - Used here y (p patient survival times),
and - J(v(Sk) , y) g gT
- where g is the score vector of the Cox model for
predicting - survival.
21They chose a 0.1 as it seemed to give a good
mix of high gene correlation and low p-value for
the Cox model.
22- This produced a cluster of 234 genes.
- It includes strong genes for predicting
survival (130 of the 200 stongest) as well as
someweak genes (e.g., 1332).
23- Gap curve for supervised shaving.
- Survival curves in the two groups defined by the
low or high expression of the 234 genes. - Group I has high expression of positive
genes, and low expression of negative genes - Group 2 has low expression of positive
genes, and high expression of negative genes. - Negative genes are those preceded by a
minus sign in Table 2.
24Conclusions
- The proposed gene shaving methods search for
clusters of genes showing both high variation
across the samples, and correlation across the
genes. - This method is a potentially useful tool for
exploration of gene expression data and
identification of interesting clusters of genes
worth further investigation