Whole%20Genome%20Expression%20Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Whole%20Genome%20Expression%20Analysis

Description:

CD37 antigen. 178. 105. 4174. 7133. 1. 1. 2. 2. Mean Difference ... CD37 antigen. Randomization is. Less Conservative. Preserves inner. structure of data ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 42
Provided by: tvisw
Category:

less

Transcript and Presenter's Notes

Title: Whole%20Genome%20Expression%20Analysis


1
  • Whole Genome Expression Analysis

2
In this presentation
  • Part 1 Gene Expression Microarray Data
  • Part 2 Global Expression Sequence Data
    Analysis
  • Part 3 Proteomic Data Analysis

3
Part1
Microarray Data
4
Examples of Problems
Gene sequence problems Given a DNA sequence
state which sections are coding or noncoding
regions. Which sections are promoters etc...
Protein Structure problems Given a DNA or amino
acid sequence state what structure the resulting
protein takes.
Gene expression problems Given DNA/gene
microarray expression data infer either clinical
or biological class labels or genetic machinery
that gives rise to the expression data.
Protein expression problems Study expression of
proteins and their function.
5
Microarray Technology
Basic idea The state of the cell is determined
by proteins. A gene codes for a protein which is
assembled via mRNA. Measuring amount particular
mRNA gives measure of amount of corresponding
protein. Copies of mRNA is expression of a
gene. Microarray technology allows us to measure
the expression of thousands of genes at
once. Measure the expression of thousands of
genes under different experimental conditions and
ask what is different and why.
6
Oligo vs cDNA arrays
Lockhart and Winzler 2000
7
Format
What is whole genome expression
analysis? Clustering algorithm - Hierarchical
clustering - K-means clustering - Principal
component analysis - Self organizing
maps Beyond clustering - Support vector
machines - Automatic discovery of regulatory
patterns in promoter region - Bayesian
networks analysis
8
What is whole genome expression analysis?
Messenger RNA is only an intermediate of gene
expression
9
What is whole genome expression analysis?
(continued)
Why measure mRNA?
10
What is whole genome expression analysis?
(continued)
11
Why Separate Feature selection ?
  • most learning algorithms looks for non-linear
    combinations of features -- can easily find many
    spurious combinations given small of records
    and large of genes
  • We first reduce number of genes by a linear
    method, e.g. T-values
  • Heuristic select genes from each class
  • Then apply a favorite machine learning algorithm

12
Feature selection approach
  • Rank genes by measure select top 200-500
  • T-test for Mean Difference
  • Signal to Noise (S2N)
  • Other Information-based, biological?
  • Almost any method works well with a good feature
    selection

13
Heatmap Visualization of selected fields
ALL
AML
Heatmap visualization is done by normalizing each
gene to mean 0, std. 1 to get a picture like
this. Good correlation overall
AML-related
ALL-related
14
Controlling False Positives
CD37 antigen
Class
178 105 4174 7133
1 1 2 2
Mean Difference between Classes T-value
-3.25 Significance p0.0007
15
Controlling False Positives with Randomization
Randomized Class
CD37 antigen
Class
Randomization is Less Conservative Preserves
inner structure of data
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize
Class
178 105 4174 7133
2 1 1 2
T-value -1.1
16
Controlling false positives with randomization, II
Rand Class
Gene
Class
178 105 4174 7133
1 1 2 2
2 1 1 2
Randomize 500 times
Gene
Class
Bottom 1 T-value -2.08 Select potentially
interesting genes at 1
178 105 4174 7133
2 1 1 2
17
Part2
Microarray Data Classification
18
Classification
  • desired features
  • robust in presence of false positives
  • understandable
  • return confidence/probability
  • fast enough
  • simplest approaches are most robust

19
Popular Classification Methods
  • Decision Trees/Rules
  • find smallest gene sets, but also false positives
  • Neural Nets -
  • work well if of genes is reduced
  • SVM
  • good accuracy, does its own gene selection, hard
    to understand
  • K-nearest neighbor - robust for small genes
  • Bayesian nets - simple, robust

20
Model-building Methodology
  • Select gene set (and other parameters) on a train
    set.
  • When the final model is built, evaluate it on the
    independent test set which should not be used in
    the model building

21
Selecting the best gene set
We tested 9 sets, 10 fold x-validation (90 neural
nets) Select gene set with lowest average error
Heuristically, at least 10 genes overall
22
Results on the test data
  • Evaluation of the 10 genes per class on the test
    data (34 samples) gives
  • 33 correct predictions (97 accuracy),
  • 1 error on sample 66
  • Actual Class AML, Net prediction ALL
  • net confidence low

23
Classification - other applications
  • Combining clinical and genetic data
  • Outcome / Treatment prediction
  • Age, Sex, stage of disease, are useful
  • e.g. if Data from Male, not Ovarian cancer

24
Classification Multi-Class
  • Similar Approach
  • select top genes most correlated to each class
  • select best subset using cross-validation
  • build a single model separating all classes
  • Advanced
  • build separate model for each class vs. rest
  • choose model making the strongest prediction

25
Multi-class Data Example
  • Brain data, Pomeroy et al 2002, Nature (415), Jan
    2002
  • 42 examples, about 7,000 genes, 5 classes
  • Selected top 100 genes most correlated to each
    class
  • Selected best subset by testing 1,2, , 20 genes
    subsets, leave-one-out x-validation for each

26
Brain data results
Number of genes (same from each class)
27
Part3
Classification of Cancer Microarray Data
28
Cancer Classification
38 examples of Myeloid and Lymphoblastic
leukemias Affymetrix human 6800, (7128 genes
including control genes) 34 examples to test
classifier Results 33/34 correct d
perpendicular distance from hyperplane
d
Test data
29
Gene expression and Coregulation
30
Nonlinear classifier
31
Nonlinear SVM
Nonlinear SVM does not help when using all genes
but does help when removing top genes, ranked by
Signal to Noise (Golub et al).
32
Rejections
Golub et al classified 29 test points correctly,
rejected 5 of which 2 were errors using 50
genes Need to introduce concept of rejects to
SVM
33
Rejections
34
Estimating a CDF
35
The Regularized Solution
36
Rejections for SVM
P(c1 d)
1/d
37
Results with rejections
Results 31 correct, 3 rejected of which 1 is an
error
d
Test data
38
Why Feature Selection
  • SVMs as stated use all genes/features
  • Molecular biologists/oncologists seem to be
    conviced that only a small subset of genes are
    responsible for particular biological properties,
    so they want which genes are are most important
    in discriminating
  • Practical reasons, a clinical device with
    thousands of genes is not financially practical
  • Possible performance improvement

39
Results with Gene Selection
AML vs ALL 40 genes 34/34 correct, 0 rejects.
5 genes 31/31 correct, 3 rejects of which
1 is an error.
B vs T cells for AML 10 genes 33/33 correct, 0
rejects.
40
 
Leave-one-out Procedure
 
41
 
The Basic Idea
Use leave-one-out (LOO) bounds for SVMs as a
criterion to select features by searching over
all possible subsets of n features for the ones
that minimizes the bound. When such a search is
impossible because of combinatorial explosion,
scale each feature by a real value variable and
compute this scaling via gradient descent on the
leave-one-out bound. One can then keep the
features corresponding to the largest scaling
variables. The rescaling can be done in the
input space or in a Principal Components space.
 
Write a Comment
User Comments (0)
About PowerShow.com