Knowledge-based%20Analysis%20of%20Microarray%20Gene%20Expression%20Data%20using%20Support%20Vector%20Machines - PowerPoint PPT Presentation

About This Presentation
Title:

Knowledge-based%20Analysis%20of%20Microarray%20Gene%20Expression%20Data%20using%20Support%20Vector%20Machines

Description:

Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines Michael P. S. Brown, William Noble Grundy, David Lin, Nello Cristianini ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Knowledge-based%20Analysis%20of%20Microarray%20Gene%20Expression%20Data%20using%20Support%20Vector%20Machines


1
Knowledge-based Analysis of Microarray Gene
Expression Data using Support Vector Machines
  • Michael P. S. Brown, William Noble Grundy, David
    Lin, Nello Cristianini, Charles Sugnet, Terrence
    S. Furey, Manuel Ares, Jr. David Haussler

Proceedings of the National Academy of Sciences.
2000
2
Overview
  • Objective Classify genes based on functionality
  • Observation Genes of similar function yield
    similar expression pattern in microarray
    hybridization experiments
  • Method Use SVM to build classifiers, using
    microarray gene expression data.

3
Previous Methods
  • Most current methods employ unsupervised learning
    methods (at the time of the publication)
  • Genes are grouped using clustering algorithms
    based on a distance measure
  • Hierarchical clustering
  • Self-organizing maps

4
Advantages of Supervised Learning and SVMs
  • Supervised methods can take advantage of prior
    knowledge
  • SVMs are well suited to extremely
    high-dimensional feature space

5
DNA Microarray Data
  • Each data point represents the ratio of
    expression levels of a particular gene in an
    experimental condition and a reference condition
  • n genes on a single chip
  • m experiments performed
  • The results is an n by m matrix of
    expression-level ratios

m experiments
m-element expression vector for a single gene
n genes
6
DNA Microarray Data
  • Normalized logarithmic ratio
  • For gene X, in experience i, define
  • Ei is the expression level in the experiment
  • Ri is the expression level in the reference state
  • Xi(x1, x2,..., xn) is the normalized logarithmic
    ratio
  • Xi is positive when the gene is induced (turned
    up)
  • Xi is negative when the gene is repressed (turned
    down)

7
Support Vector Machines
  • Searches for a hyperplane that
  • Maximizes the margin
  • Minimizes the violation of the margin

Edda Leopold and Jörg Kindermann
8
Linear Inseparability
  • What if data points are not linearly separable?

Andrew W. Moore
9
Linear Inseparability
  • Map the data to higher-dimension space

Andrew W. Moore
10
Linear Inseparability
  • Problems with mapping data to higher-dimension
    space
  • Overfitting
  • SVM chooses the maximum margin, and deals well
    with overfitting
  • High computational cost
  • SVM kernels only involve dot products between
    points (cheap!)

11
SVM Kernels
  • K(X, Y) is function that calculates a measure of
    similarity between X and Y
  • Dot product
  • K(X,Y) X.Y
  • Simplest kernel. Linear hyperplane
  • Degree d polynomials
  • K(X,Y) (X.Y 1)d
  • Gaussian
  • K(X,Y) exp(-X - Y2/2?2)

12
Experimental Dataset
  • Expression data from the budding yeast
  • 2467 genes (n)
  • 79 experiments (m)
  • Dataset available on Stanford web site
  • Six functional classes
  • From the Munich Information Centre for Protein
    Sequences Yeast Genome Database
  • Class definitions come from biochemical and
    genetic studies
  • Training data
  • positive labels set of genes that have a common
    function
  • Negative labels set of genes known not to be a
    member of this function class

13
Experimental Design
  • Compare the performance of
  • SVM (with degree 1 kernel, i.e. linear))
  • SVM (with degree 2 kernel)
  • SVM (with degree 3 kernel)
  • SVM (Gaussian)
  • Parzen Windows
  • Fishers Linear Discriminate
  • C4.5 Decision Trees
  • MOC1 Decision Trees

14
Experimental Design
  • Define the cost of method M
  • C(M) fp(M) 2.fn(M)
  • False negatives are weighted higher because the
    number of true negatives is larger
  • Cost of each method is compared to
  • C(N) cost of classifying everything as negative
  • Cost saving of method M is
  • S(M) C(N) - C(M)

15
Experimental Results
  • SVMs outperform other methods
  • All classifiers fail to recognize the HTH protein
  • this is expected
  • Members of this class are not similarly
    regulated

16
Consistently Misclassified Genes
  • 20 genes are consistently misclassified by 4 SVM
    kernels, in different experiments
  • Difference between the expression data and
    definitions based on protein structures.
  • Many of the false positives are known to be
    important for the functional class (even though
    they are not included as part of the class)
Write a Comment
User Comments (0)
About PowerShow.com