Whole%20Genome%20Expression%20Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Whole%20Genome%20Expression%20Analysis

Description:

CD37 antigen. 178. 105. 4174. 7133. 1. 1. 2. 2. Mean Difference ... CD37 antigen. Randomization is. Less Conservative. Preserves inner. structure of data ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 42

Provided by: tvisw

Category:

more less

Transcript and Presenter's Notes

Title: Whole%20Genome%20Expression%20Analysis

1

Whole Genome Expression Analysis

2
In this presentation

Part 1 Gene Expression Microarray Data
Part 2 Global Expression Sequence Data
Analysis
Part 3 Proteomic Data Analysis

3
Part1
Microarray Data
4
Examples of Problems
Gene sequence problems Given a DNA sequence
state which sections are coding or noncoding
regions. Which sections are promoters etc...
Protein Structure problems Given a DNA or amino
acid sequence state what structure the resulting
protein takes.
Gene expression problems Given DNA/gene
microarray expression data infer either clinical
or biological class labels or genetic machinery
that gives rise to the expression data.
Protein expression problems Study expression of
proteins and their function.
5
Microarray Technology
Basic idea The state of the cell is determined
by proteins. A gene codes for a protein which is
assembled via mRNA. Measuring amount particular
mRNA gives measure of amount of corresponding
protein. Copies of mRNA is expression of a
gene. Microarray technology allows us to measure
the expression of thousands of genes at
once. Measure the expression of thousands of
genes under different experimental conditions and
ask what is different and why.
6
Oligo vs cDNA arrays
Lockhart and Winzler 2000
7
Format
What is whole genome expression
analysis? Clustering algorithm - Hierarchical
clustering - K-means clustering - Principal
component analysis - Self organizing
maps Beyond clustering - Support vector
machines - Automatic discovery of regulatory
patterns in promoter region - Bayesian
networks analysis
8
What is whole genome expression analysis?
Messenger RNA is only an intermediate of gene
expression
9
What is whole genome expression analysis?
(continued)
Why measure mRNA?
10
What is whole genome expression analysis?
(continued)
11
Why Separate Feature selection ?

most learning algorithms looks for non-linear
combinations of features -- can easily find many
spurious combinations given small of records
and large of genes
We first reduce number of genes by a linear
method, e.g. T-values
Heuristic select genes from each class
Then apply a favorite machine learning algorithm

12
Feature selection approach

Rank genes by measure select top 200-500
T-test for Mean Difference
Signal to Noise (S2N)
Other Information-based, biological?
Almost any method works well with a good feature
selection

13
Heatmap Visualization of selected fields
ALL
AML
Heatmap visualization is done by normalizing each
gene to mean 0, std. 1 to get a picture like
this. Good correlation overall
AML-related
ALL-related
14
Controlling False Positives
CD37 antigen
Class
178 105 4174 7133
1 1 2 2
Mean Difference between Classes T-value
-3.25 Significance p0.0007
15
Controlling False Positives with Randomization
Randomized Class
CD37 antigen
Class
Randomization is Less Conservative Preserves
inner structure of data
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize
Class
178 105 4174 7133
2 1 1 2
T-value -1.1
16
Controlling false positives with randomization, II
Rand Class
Gene
Class
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize 500 times
Gene
Class
Bottom 1 T-value -2.08 Select potentially
interesting genes at 1
178 105 4174 7133
2 1 1 2
17
Part2
Microarray Data Classification
18
Classification

desired features
robust in presence of false positives
understandable
return confidence/probability
fast enough
simplest approaches are most robust

19
Popular Classification Methods

Decision Trees/Rules
find smallest gene sets, but also false positives
Neural Nets -
work well if of genes is reduced
SVM
good accuracy, does its own gene selection, hard
to understand
K-nearest neighbor - robust for small genes
Bayesian nets - simple, robust

20
Model-building Methodology

Select gene set (and other parameters) on a train
set.
When the final model is built, evaluate it on the
independent test set which should not be used in
the model building

21
Selecting the best gene set
We tested 9 sets, 10 fold x-validation (90 neural
nets) Select gene set with lowest average error
Heuristically, at least 10 genes overall
22
Results on the test data

Evaluation of the 10 genes per class on the test
data (34 samples) gives
33 correct predictions (97 accuracy),
1 error on sample 66
Actual Class AML, Net prediction ALL
net confidence low

23
Classification - other applications

Combining clinical and genetic data
Outcome / Treatment prediction
Age, Sex, stage of disease, are useful
e.g. if Data from Male, not Ovarian cancer

24
Classification Multi-Class

Similar Approach
select top genes most correlated to each class
select best subset using cross-validation
build a single model separating all classes
Advanced
build separate model for each class vs. rest
choose model making the strongest prediction

25
Multi-class Data Example

Brain data, Pomeroy et al 2002, Nature (415), Jan
2002
42 examples, about 7,000 genes, 5 classes
Selected top 100 genes most correlated to each
class
Selected best subset by testing 1,2, , 20 genes
subsets, leave-one-out x-validation for each

26
Brain data results
Number of genes (same from each class)
27
Part3
Classification of Cancer Microarray Data
28
Cancer Classification
38 examples of Myeloid and Lymphoblastic
leukemias Affymetrix human 6800, (7128 genes
including control genes) 34 examples to test
classifier Results 33/34 correct d
perpendicular distance from hyperplane
d
Test data
29
Gene expression and Coregulation
30
Nonlinear classifier
31
Nonlinear SVM
Nonlinear SVM does not help when using all genes
but does help when removing top genes, ranked by
Signal to Noise (Golub et al).
32
Rejections
Golub et al classified 29 test points correctly,
rejected 5 of which 2 were errors using 50
genes Need to introduce concept of rejects to
SVM
33
Rejections
34
Estimating a CDF
35
The Regularized Solution
36
Rejections for SVM
P(c1 d)
1/d
37
Results with rejections
Results 31 correct, 3 rejected of which 1 is an
error
d
Test data
38
Why Feature Selection

SVMs as stated use all genes/features
Molecular biologists/oncologists seem to be
conviced that only a small subset of genes are
responsible for particular biological properties,
so they want which genes are are most important
in discriminating
Practical reasons, a clinical device with
thousands of genes is not financially practical
Possible performance improvement

39
Results with Gene Selection
AML vs ALL 40 genes 34/34 correct, 0 rejects.
5 genes 31/31 correct, 3 rejects of which
1 is an error.
B vs T cells for AML 10 genes 33/33 correct, 0
rejects.
40

Leave-one-out Procedure

41

The Basic Idea
Use leave-one-out (LOO) bounds for SVMs as a
criterion to select features by searching over
all possible subsets of n features for the ones
that minimizes the bound. When such a search is
impossible because of combinatorial explosion,
scale each feature by a real value variable and
compute this scaling via gradient descent on the
leave-one-out bound. One can then keep the
features corresponding to the largest scaling
variables. The rescaling can be done in the
input space or in a Principal Components space.

Write a Comment

User Comments (0)