SVM presentation | free to download

About This Presentation

Transcript and Presenter's Notes

Title: SVM

1
Statistical Classification for Gene Analysis
based on Micro-array Data

Fan Li Yiming Yang
hustlf_at_cs.cmu.edu
In collaboration with Judith Klein-Seetharaman

2
Principles of cDNA microarray
DNA clones
Laser 2
Treated sample
Laser 1
Reference
Excitation
Reverse transcription
PCR purification
Emission
Label with Fluorescent dyes
Robot printing
Hybridize target to microarray
Computer analysis
G. Gibson et al.
3
Microarray data how it looks like ?
Expression level of a gene across treatments
Expression matrix
Expression profiles of genes in a certain
condition
Typical examples Heat shock, G phase in cell
cycle, etc conditions Liver cancer patient,
normal person, etc samples
4
AML/ALL micro-array dataset

This dataset can be downloaded from
http//genome-www.standford.edu/clustering
Maxtrix
Each Row a gene
Each column a patient (a sample)
Each patient belong to one of two diseases
types AML(acute myeloid leukemia) or ALL (acute
lymph oblastic leukemia) disease
The 72 patient samples are further divided into a
training set(including 27 ALLs and 11 AMLs) and a
test set(including 20 ALLs and 14 AMLs). The
whole dataset is over 7129 probes from 6817 human
genes.

5
Published work on AML/ALL

Classification task gene expression -gt AML,
ALL
Techniques Support Vector Machings (SVM),
Rocchio-style and logistic regression classifiers
Main findings classifiers can get a better
performance when using a small subset (8) of
genes, instead of thousands
Implication Many genes are irrelevant or
redundant?

6
Possible Relationship (Hypothesis)
7
How can find such a structure?

Find the most informative genes (primary ones)
Statistical feature selection (brief)
Find the genes related (or similar) to the
primary ones
Unsupervised clustering (detailed)
based on statistical patterns of gene distributed
over microarrays
Bayes network for causal reasoning(future
direction)

8
Possible Relationship (Hypothesis)
disease
9
Feature selection

Feature selection
Choose a small subset of input variable (a few
instead of 7000 genes, for example)
In text categorization
Features words in documents
Output variables subject categories of a
document
In protein classification
Features amino acid motifs
Output variables protein categories
In genome micro-array data
Features useful genes
Output variables diseased or not of a patient

10
Feature selection on micro-array (ALM vs ALL)

Golub-Slonim GS-ranking (filtering method)
Ben-Dor TNoM-ranking (filtering method)
Isabelle-Guyon Recursive SVM(Wrapper method)
Selected 8 genes (out of 1000 in that dataset)
Accuracy 100
Our work (Fan Yiming) (best)
Selected 3 genes (using Ridge regression)
Accuracy 100

11
Feature selection experiments already done in
this micro-array data

The 3 genes we found
Id1882 CST3 Cystatin C(amyloid angiopathy and
cerebral hemorrhage) M27891_at
Id6201 INTERLEUKIN-8PRECURSOR Y00787_at
Id4211 VIL2 Villin 2(ezrin) X51521_at

12
Some analysis on the result we get

The first two genes are strongly correlated with
each other.
The third gene is very different from the first
two genes.
1st gene 2nd gene is bad (10/34 errors)
1st gene 3rd gene is good (1/34 error)

13
QuestionAs the next step, Can we find more
gene-gene relationship?

Several techniques available
Clustering
Bayesian network learning
Independent component analysis

14
Clustering Analysis in micro-array data

Clustering methods have already been widely used
to find similar genes or common binding sites
from micro-array data.
A lot of different clustering algorithms
Hierarchical clustering
K-means
SOM
CAST

15
A example of hierarchical clustering
analysis(from Spellman et al.)
16
Our clustering experiment on AML/ALL dataset

Our clustering result is over the top 1000 genes
most relevant to the disease.

17
The feature-selection curve
18
Our clustering result in the top 1000 genes
19
Some analysis to the clustering result

The first two genes are always clustered in the
same cluster(in hierarchical clustering, they are
in cluster 1. In k-means clustering, they are in
cluster 2)
The third gene is always not clustered in the
same group with the first two genes(in
hierarchical clustering, it is in cluster 23. In
k-means clustering, it is in cluster 1)
This validates our previous analysis.

20
Disadvantage of Clustering

However
It can not find out the internal relationship
inside one cluster
It can not find the relationship between
clusters
genes connected to each other may not be in the
same cluster.
Clustering vs Bayesian network learning(copied
from David K,Gifford, Science, VOL293, Sept,2001)

21
A counter example of clustering analysis
22
Bayesian network learning

Thus Bayesian network seems a much better
technique if we want to model the relationship
among genes.
Researcher have done experiments and constructed
bayesian networks from micro-array data.
They found there are a few genes which have a lot
of connections with other genes.
They use prior biology knowledge to validate
their learned edges(interactions between genes
and found they are reasonable)

23
A example of the bayesian network

Part of the bayesian network Nir Friedman
constructed. There are total 800 genes(nodes) in
the graph. These 800 genes are all cell-cycle
regulated genes.

24
(No Transcript)
25
Our plan in genetic regulatory network
construction

There are several possible ways
Using feature selection technique to make the
network learning task more robust and with less
computational cost.
Learning gene regulatory networks on microarray
dataset with disease labels(thus we may find
pathways relevant to specific disease).
Using ICA to finding hidden variables(hidden
layers) and check its consistency with bayes
network learning result.

26
Our plan in genetic regulatory network
construction

Use prior prior biology knowledge in gene network
,like the network motifs. The following example
is copied from Shai S.Shen-Orr, Naturtics
,genetics, 2002. Previous network learning
algorithm have not considered those characters.

27
(No Transcript)
28
Reference

Using Bayesnetwork to analyze Expression Data ,
Nir Friedman, M.Linial, I.Nachman, Journal of
Computational Biology , 7601-620, 2000.
Gene selection for cancer classification using
support vector machines. Guyon,I.et al. Machine
Learning,46,389-422.
Clustering analysis and display of genome-wide
expression patterns, Eisen,M.B. et al. PNAs,
9514863-14868, 1998
Clustering gene expression patterns . Ben-Dor,
A.,Shamir,R., and Yakini,Z., Computational
Biology, 6(3/4)281-297, 1999.

Write a Comment

User Comments (0)

About PowerShow.com

SVM PowerPoint PPT Presentation