Title: Whole%20Genome%20Expression%20Analysis
1- Whole Genome Expression Analysis
2In this presentation
- Part 1 Gene Expression Microarray Data
- Part 2 Global Expression Sequence Data
Analysis - Part 3 Proteomic Data Analysis
3Part1
Microarray Data
4 Examples of Problems
Gene sequence problems Given a DNA sequence
state which sections are coding or noncoding
regions. Which sections are promoters etc...
Protein Structure problems Given a DNA or amino
acid sequence state what structure the resulting
protein takes.
Gene expression problems Given DNA/gene
microarray expression data infer either clinical
or biological class labels or genetic machinery
that gives rise to the expression data.
Protein expression problems Study expression of
proteins and their function.
5Microarray Technology
Basic idea The state of the cell is determined
by proteins. A gene codes for a protein which is
assembled via mRNA. Measuring amount particular
mRNA gives measure of amount of corresponding
protein. Copies of mRNA is expression of a
gene. Microarray technology allows us to measure
the expression of thousands of genes at
once. Measure the expression of thousands of
genes under different experimental conditions and
ask what is different and why.
6Oligo vs cDNA arrays
Lockhart and Winzler 2000
7Format
What is whole genome expression
analysis? Clustering algorithm - Hierarchical
clustering - K-means clustering - Principal
component analysis - Self organizing
maps Beyond clustering - Support vector
machines - Automatic discovery of regulatory
patterns in promoter region - Bayesian
networks analysis
8What is whole genome expression analysis?
Messenger RNA is only an intermediate of gene
expression
9What is whole genome expression analysis?
(continued)
Why measure mRNA?
10What is whole genome expression analysis?
(continued)
11Why Separate Feature selection ?
- most learning algorithms looks for non-linear
combinations of features -- can easily find many
spurious combinations given small of records
and large of genes - We first reduce number of genes by a linear
method, e.g. T-values - Heuristic select genes from each class
- Then apply a favorite machine learning algorithm
12Feature selection approach
- Rank genes by measure select top 200-500
- T-test for Mean Difference
- Signal to Noise (S2N)
- Other Information-based, biological?
- Almost any method works well with a good feature
selection
13Heatmap Visualization of selected fields
ALL
AML
Heatmap visualization is done by normalizing each
gene to mean 0, std. 1 to get a picture like
this. Good correlation overall
AML-related
ALL-related
14Controlling False Positives
CD37 antigen
Class
178 105 4174 7133
1 1 2 2
Mean Difference between Classes T-value
-3.25 Significance p0.0007
15Controlling False Positives with Randomization
Randomized Class
CD37 antigen
Class
Randomization is Less Conservative Preserves
inner structure of data
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize
Class
178 105 4174 7133
2 1 1 2
T-value -1.1
16Controlling false positives with randomization, II
Rand Class
Gene
Class
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize 500 times
Gene
Class
Bottom 1 T-value -2.08 Select potentially
interesting genes at 1
178 105 4174 7133
2 1 1 2
17Part2
Microarray Data Classification
18Classification
- desired features
- robust in presence of false positives
- understandable
- return confidence/probability
- fast enough
- simplest approaches are most robust
19Popular Classification Methods
- Decision Trees/Rules
- find smallest gene sets, but also false positives
- Neural Nets -
- work well if of genes is reduced
- SVM
- good accuracy, does its own gene selection, hard
to understand - K-nearest neighbor - robust for small genes
- Bayesian nets - simple, robust
20Model-building Methodology
- Select gene set (and other parameters) on a train
set. - When the final model is built, evaluate it on the
independent test set which should not be used in
the model building
21Selecting the best gene set
We tested 9 sets, 10 fold x-validation (90 neural
nets) Select gene set with lowest average error
Heuristically, at least 10 genes overall
22Results on the test data
- Evaluation of the 10 genes per class on the test
data (34 samples) gives - 33 correct predictions (97 accuracy),
- 1 error on sample 66
- Actual Class AML, Net prediction ALL
- net confidence low
23Classification - other applications
- Combining clinical and genetic data
- Outcome / Treatment prediction
- Age, Sex, stage of disease, are useful
- e.g. if Data from Male, not Ovarian cancer
24Classification Multi-Class
- Similar Approach
- select top genes most correlated to each class
- select best subset using cross-validation
- build a single model separating all classes
- Advanced
- build separate model for each class vs. rest
- choose model making the strongest prediction
25Multi-class Data Example
- Brain data, Pomeroy et al 2002, Nature (415), Jan
2002 - 42 examples, about 7,000 genes, 5 classes
- Selected top 100 genes most correlated to each
class - Selected best subset by testing 1,2, , 20 genes
subsets, leave-one-out x-validation for each
26Brain data results
Number of genes (same from each class)
27Part3
Classification of Cancer Microarray Data
28Cancer Classification
38 examples of Myeloid and Lymphoblastic
leukemias Affymetrix human 6800, (7128 genes
including control genes) 34 examples to test
classifier Results 33/34 correct d
perpendicular distance from hyperplane
d
Test data
29Gene expression and Coregulation
30Nonlinear classifier
31Nonlinear SVM
Nonlinear SVM does not help when using all genes
but does help when removing top genes, ranked by
Signal to Noise (Golub et al).
32Rejections
Golub et al classified 29 test points correctly,
rejected 5 of which 2 were errors using 50
genes Need to introduce concept of rejects to
SVM
33Rejections
34Estimating a CDF
35The Regularized Solution
36Rejections for SVM
P(c1 d)
1/d
37 Results with rejections
Results 31 correct, 3 rejected of which 1 is an
error
d
Test data
38 Why Feature Selection
- SVMs as stated use all genes/features
- Molecular biologists/oncologists seem to be
conviced that only a small subset of genes are
responsible for particular biological properties,
so they want which genes are are most important
in discriminating - Practical reasons, a clinical device with
thousands of genes is not financially practical - Possible performance improvement
39 Results with Gene Selection
AML vs ALL 40 genes 34/34 correct, 0 rejects.
5 genes 31/31 correct, 3 rejects of which
1 is an error.
B vs T cells for AML 10 genes 33/33 correct, 0
rejects.
40 Leave-one-out Procedure
41 The Basic Idea
Use leave-one-out (LOO) bounds for SVMs as a
criterion to select features by searching over
all possible subsets of n features for the ones
that minimizes the bound. When such a search is
impossible because of combinatorial explosion,
scale each feature by a real value variable and
compute this scaling via gradient descent on the
leave-one-out bound. One can then keep the
features corresponding to the largest scaling
variables. The rescaling can be done in the
input space or in a Principal Components space.