Induction of Comprehensible Models for Gene Expression Data Sets PowerPoint PPT Presentation

presentation player overlay
1 / 17
About This Presentation
Transcript and Presenter's Notes

Title: Induction of Comprehensible Models for Gene Expression Data Sets


1
Induction of Comprehensible Models for Gene
Expression Data Sets
  • Dragan Gamberger, RBI Zagreb
  • Nada Lavrac, IJS Ljubljana
  • Filip elezný, CVUT Prague
  • Jakub Tolar, UMN Minneapolis

2
Modeling Gene Expression Data
  • Predictive classification task
  • Input Gene Expression vector
  • Output Disease Class
  • To train a predictor
  • Use examples of existing GE vectors
  • Associated to known disease class
  • Character
  • Lots of Attributes (eg. 20,000 GE values)
  • Few Examples (eg. 20 patients)

3
Modeling Gene Expression Data
  • Domain prone to overfitting
  • Due to abundance of possible patterns, many seem
    good by chance
  • Poor prediction on unseen examples
  • Mainstream solution
  • Robustness large redundant numeric classifiers
  • Usually 10s 1000s genes employed in
    classification
  • Voting of informative genes
  • Support vector machines

4
Modeling Gene Expression Data
  • Problem
  • Complex / numeric classifiers not appropriate for
    expert evaluation
  • Difficult interpretation
  • Single genes (disease markers) with high voting
    power can be extracted from the predictors
  • But then
  • prediction assessment results no longer valid
  • logical connections are lost
  • (such as G1 expressed AND G2 not expressed)

5
Modeling Gene Expression Data
  • Challenge Can we induce predictors that are
  • Logic rules (? easy to read)
  • Simple (few employed attributes)
  • Accurate (on test examples)
  • Meaningful (for a biologist)
  • ? Induction of Comprehensible Models , Jr.
    Biomed. Informatics (Elsevier)To appear in 2004

?
6
The Methodology
Gene Expression Data
Discretizereal expression values to Absent /
Marginal / Present
Search for Relevant Features
Search for Relevant Logic Rules
Assess Predictive Accuracyon test data
Assess Meaningfulnessby expert interpretation
7
Discretization
  • Converting real expression values tothree
    values
  • A (absent not expressed)
  • M (marginal)
  • P (present - expressed)
  • Using Affymetrix discretization
  • May not be ideal, but ready for improvement

8
Feature Construction
  • Simple form
  • g A
  • g P
  • Marginal values cannot build a feature
  • Relevancy
  • Absolute to small support on the target class or
    to large support on the non-target classes
  • Relative features with a better coverage exists

9
Rule Construction
  • Subgroup Discovery Gamberger, Lavrac JAIR
    17/2002
  • Forms features into conjunctive rules (defining
    subgroups of the target class)
  • Such as (g1002 P AND g211 A)
  • Balances precision and support
  • Precision / Support trade-off is the search
    heuristic
  • May induce impure rules

10
Experimental Domains
  • AML / ALL Distinguish between
  • Acute Myeloid Leukemia
  • Acute Lymphoblastic Leukemia
  • 38 training samples, 34 testing samples, 7129
    attributes (genes measured)
  • MultiClass Distinguish between
  • 14 types of cancer
  • 144 training samples, 54 testing samples,16063
    attributes

11
Impact of Relevancy Filter
  • AML/ALL 7129 original attributes (genes)
  • Multi-Class 16063 original attributes on
    average


2844 Absolutely Irrelevant
3633 Relatively Irrelevant
622Rele vant
72
Absolutely or Relatively Irrelevant
28 Relevant
12
Predictive Performance Assessment
  • AML / ALL Classification

13
Predictive Performance Assessment
  • Multi-Class
  • Classification
  • Quick Growth
  • of Precision
  • with of
  • examples

14
Comparison to Previous Results
  • 9 Chow et al., Physiol Genomics 200119
    Golub et al, Science 199945 Ramaswamy, Procs.
    NAS, 2001

15
Rule Examples (AML/ALL)
  • Co-activity of Leptin and GST described in
    previous biological study Balasubramaniyan,
    Pharm. Research, 2003
  • ? Plausible relevance to AML ?

16
Rule Examples (Multi-Class)
  • Routinely used lymphoma marker
  • Plausible co-factor

17
Conclusions
  • It is feasible to induce simple, logic-based
    classifiers for some GE modeling problems
  • Given with very few positive examples, we could
    not prevent overfitting
  • Given reasonably few examples, we found well
    generalizing, plausible rules
  • Optimism larger data sets can be expected in the
    near future
Write a Comment
User Comments (0)
About PowerShow.com