Title: SVM
 1Statistical Classification for Gene Analysis 
based on Micro-array Data
- Fan Li  Yiming Yang 
 - hustlf_at_cs.cmu.edu 
 - In collaboration with Judith Klein-Seetharaman 
 
  2Principles of cDNA microarray
DNA clones
Laser 2
Treated sample
Laser 1
Reference
Excitation
Reverse transcription
PCR purification
Emission
Label with Fluorescent dyes
Robot printing
Hybridize target to microarray
Computer analysis
G. Gibson et al. 
 3Microarray data  how it looks like ?
Expression level of a gene across treatments
Expression matrix
Expression profiles of genes in a certain 
condition
Typical examples Heat shock, G phase in cell 
cycle, etc  conditions Liver cancer patient, 
normal person, etc  samples 
 4AML/ALL micro-array dataset
- This dataset can be downloaded from 
http//genome-www.standford.edu/clustering  - Maxtrix 
 - Each Row  a gene 
 - Each column  a patient (a sample) 
 - Each patient belong to one of two diseases 
types AML(acute myeloid leukemia) or ALL (acute 
lymph oblastic leukemia) disease  - The 72 patient samples are further divided into a 
training set(including 27 ALLs and 11 AMLs) and a 
test set(including 20 ALLs and 14 AMLs). The 
whole dataset is over 7129 probes from 6817 human 
genes. 
  5Published work on AML/ALL
- Classification task gene expression -gt AML, 
ALL  - Techniques Support Vector Machings (SVM), 
Rocchio-style and logistic regression classifiers  - Main findings classifiers can get a better 
performance when using a small subset (8) of 
genes, instead of thousands  - Implication Many genes are irrelevant or 
redundant? 
  6Possible Relationship (Hypothesis) 
 7How can find such a structure?
- Find the most informative genes (primary ones) 
 - Statistical feature selection (brief) 
 - Find the genes related (or similar) to the 
primary ones  - Unsupervised clustering (detailed) 
 - based on statistical patterns of gene distributed 
over microarrays  - Bayes network for causal reasoning(future 
direction)  
  8Possible Relationship (Hypothesis)
disease 
 9Feature selection 
- Feature selection 
 - Choose a small subset of input variable (a few 
instead of 7000 genes, for example)  - In text categorization 
 - Features  words in documents 
 - Output variables  subject categories of a 
document  - In protein classification 
 - Features  amino acid motifs  
 - Output variables  protein categories 
 - In genome micro-array data 
 - Features  useful genes 
 - Output variables  diseased or not of a patient 
 
  10Feature selection on micro-array (ALM vs ALL)
- Golub-Slonim GS-ranking (filtering method) 
 - Ben-Dor TNoM-ranking (filtering method) 
 - Isabelle-Guyon Recursive SVM(Wrapper method) 
 - Selected 8 genes (out of 1000 in that dataset) 
 - Accuracy 100 
 - Our work (Fan  Yiming) (best) 
 - Selected 3 genes (using Ridge regression) 
 - Accuracy 100 
 
  11Feature selection experiments already done in 
this micro-array data
- The 3 genes we found 
 - Id1882 CST3 Cystatin C(amyloid angiopathy and 
cerebral hemorrhage) M27891_at  - Id6201 INTERLEUKIN-8PRECURSOR Y00787_at 
 - Id4211 VIL2 Villin 2(ezrin) X51521_at 
 
  12Some analysis on the result we get
- The first two genes are strongly correlated with 
each other.  - The third gene is very different from the first 
two genes.  - 1st gene  2nd gene is bad (10/34 errors) 
 - 1st gene  3rd gene is good (1/34 error) 
 
  13QuestionAs the next step, Can we find more 
gene-gene relationship? 
- Several techniques available 
 - Clustering 
 - Bayesian network learning 
 - Independent component analysis 
 -  
 
  14Clustering Analysis in micro-array data
- Clustering methods have already been widely used 
to find similar genes or common binding sites 
from micro-array data.  - A lot of different clustering algorithms 
 - Hierarchical clustering 
 - K-means 
 - SOM 
 - CAST 
 
  15A example of hierarchical clustering 
analysis(from Spellman et al.) 
 16Our clustering experiment on AML/ALL dataset
- Our clustering result is over the top 1000 genes 
most relevant to the disease. 
  17The feature-selection curve 
 18Our clustering result in the top 1000 genes 
 19Some analysis to the clustering result
- The first two genes are always clustered in the 
same cluster(in hierarchical clustering, they are 
in cluster 1. In k-means clustering, they are in 
cluster 2)  - The third gene is always not clustered in the 
same group with the first two genes(in 
hierarchical clustering, it is in cluster 23. In 
k-means clustering, it is in cluster 1)  - This validates our previous analysis.
 
  20Disadvantage of Clustering 
- However 
 - It can not find out the internal relationship 
inside one cluster  -  It can not find the relationship between 
clusters  -  genes connected to each other may not be in the 
same cluster.  - Clustering vs Bayesian network learning(copied 
from David K,Gifford, Science, VOL293, Sept,2001) 
  21A counter example of clustering analysis 
 22Bayesian network learning
- Thus Bayesian network seems a much better 
technique if we want to model the relationship 
among genes.  - Researcher have done experiments and constructed 
bayesian networks from micro-array data.  - They found there are a few genes which have a lot 
of connections with other genes.  - They use prior biology knowledge to validate 
their learned edges(interactions between genes 
and found they are reasonable) 
  23A example of the bayesian network
- Part of the bayesian network Nir Friedman 
constructed. There are total 800 genes(nodes) in 
the graph. These 800 genes are all cell-cycle 
regulated genes. 
  24(No Transcript) 
 25Our plan in genetic regulatory network 
construction
- There are several possible ways 
 - Using feature selection technique to make the 
network learning task more robust and with less 
computational cost.  -  Learning gene regulatory networks on microarray 
dataset with disease labels(thus we may find 
pathways relevant to specific disease).  - Using ICA to finding hidden variables(hidden 
layers) and check its consistency with bayes 
network learning result.  
  26Our plan in genetic regulatory network 
construction
- Use prior prior biology knowledge in gene network 
,like the network motifs. The following example 
is copied from Shai S.Shen-Orr, Naturtics 
,genetics, 2002. Previous network learning 
algorithm have not considered those characters.  
  27(No Transcript) 
 28Reference
- Using Bayesnetwork to analyze Expression Data , 
Nir Friedman, M.Linial, I.Nachman, Journal of 
Computational Biology , 7601-620, 2000.  - Gene selection for cancer classification using 
support vector machines. Guyon,I.et al. Machine 
Learning,46,389-422.  - Clustering analysis and display of genome-wide 
expression patterns, Eisen,M.B. et al. PNAs, 
9514863-14868, 1998  - Clustering gene expression patterns . Ben-Dor, 
A.,Shamir,R., and Yakini,Z., Computational 
Biology, 6(3/4)281-297, 1999.