Classifying Gene Expression Profiles from Pairwise mRNA Comparisons - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Classifying Gene Expression Profiles from Pairwise mRNA Comparisons

Description:

The problem of Molecular classification. The TSP classifier. Results on three ... Molecular ... Data: G x n matrix, G is number of genes, n is number of samples, ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 20
Provided by: mts3
Category:

less

Transcript and Presenter's Notes

Title: Classifying Gene Expression Profiles from Pairwise mRNA Comparisons


1
Classifying Gene Expression Profiles from
Pairwise mRNA Comparisons
  • Yan Qi
  • Biomedical Engineering Department
  • 10/3/2006

2
Outline
  • The problem of Molecular classification
  • The TSP classifier
  • Results on three Cancer datasets
  • Extensions of the TSP classifier

3
Molecular Classification
  • Objective predict class labels, e.g. cancer
    subtypes, disease states using gene expression
    profiles
  • Data G x n matrix, G is number of genes, n is
    number of samples, each column is a gene
    expression profile

4
Mathematical formulation
  • Gene expression profile X ( X1, X2, , XG)
  • Binary Class label Y 1 or Y 2
  • Classifier A mapping f from X to Y
  • Training dataset
  • A G x n matrix, n n1n2
  • Y G x 1 vector where n1 entries are 1, n2
    entries are 2
  • Learning find a mapping from A to f that
    minimizes generalization error

5
Challenges to standard learning methods
  • Statistical dilemma n ltlt G
  • Examples
  • Consequence Over-fitting hence poor
    generalizability
  • Practical issue complex f and DB
  • Example ANN, SVM, random forests
  • Consequence results are hard to interpret
    biologically, inefficient in diagnostic settings

6
The TSP classifierMotivation and strategy
  • Rank-based scoring exploits the expression levels
    of genes relative to each other and obtains
    invariance to normalization.
  • Reduce model complexity by making the classifier
    parameter free
  • Select informative gene pairs by proper LOOCV and
    construct intuitive and biologically
    interpretable classification rule by voting.

7
TSP classifierrank-based score
  • Idea replace expression value by genes ranks
    within profiles.
  • Feature
  • Score
  • where

8
Gene pair selection
  • TSP
  • The number of TSP and sample size
  • A few when sample size is not too small ( gt102 ).
  • Many when sample size is small ( lt 102 ).
  • Example myocardial tissue gene expression
    profiles
  • G 22283 n112, n210
  • 2460 statistically significant TSPs

9
Classification with one gene pair
  • Let be a unique TSP
  • Suppose
  • TSP classifier
  • Error on training set

10
An Example
11
Classification with multiple gene pairs
majority vote
  • Seek a mapping from outputs of multiple single
    TSP classifiers to a final prediction.
  • Let
  • Output from represent a vote from
    TSP i.
  • Final prediction class that receives the
    majority vote from
  • Assume the features are
    conditionally independent given the class and
    equal, a Naïve Bayes classifier is equivalent to
    using majority vote

12
Classification with multiple TSPsmajority vote
Naïve Bayes classifier
  • Assume
  • Where
  • Let
  • Naïve Bayes Classifier
  • Each TSP contribute equally if

13
Loop of cross-validation
  • Leave-one-out CV
  • Estimated accuracy 1 - e/n where e is total
    number of errors in cross-validation
  • Only the TSPs are determined by CV, unbiased
    error estimate.
  • More complicated models, e.g. ANN and Decision
    tree need to include both model topology and
    parameters in CV loop, more likely to be biased.

14
Three bench-mark cancer datasets
  • Determine lymphnode status in breast tumor
    samples ( West et al. 2001, G7129, n 49 )
  • Classify leukemia subtypes ( Golub et al 1999, G
    7129, n72)
  • Distinguish prostate tumors from normal samples (
    Singh et al. 2002, G 12600, n102)

15
Statistical significance of the score is
evaluated by permutation test
  • Repeat e.g. 1000 times
  • Keep feature matrix A
  • Randomize class labels by keeping
  • n1 and n2 unchanged
  • Get top score Zmax
  • Get histogram of Zmax

16
The top scoring gene pairs for the three cancer
studies
17
What does TSP represent?
  • Change weak predictors into strong predictors?
  • Change reference e.g. gene 3 as reference for
    gene 1 and gene 2
  • Combine two markers e.g. x4 and X5 might be
    individual markers, one for each class, x4-X5 ?

18
Performance of TSP classifier compared with
previous studies
  • Breast cancer k-nearest neighbors (8-26) DLDA
    (8-19) DQDA (11-26)
  • logitboost (9-21)
    random forests (6-20) SVM (7-29)
  • Leukemia correlation analysis weighted vote
    (85 on test set and 95
  • on CV set.
  • Prostate cancer k-nearest neighbour to genes
    chosen by t-statistic

19
Discussion and extensions
  • Multiple class classification
  • Unique TSP when there are gtgt1 TSPs, which is the
    most informative?
  • kTSP there might be many pairs of genes with
    informative ordering, combine this information
    for more accurate classification?
  • Normalization invariance a suitable method to
    integrate heterogeneous microarray datasets where
    experimental conditions and normalization schemes
    differ.
Write a Comment
User Comments (0)
About PowerShow.com