Induction of Comprehensible Models for Gene Expression Data Sets presentation

About This Presentation

Transcript and Presenter's Notes

Title: Induction of Comprehensible Models for Gene Expression Data Sets

1
Induction of Comprehensible Models for Gene
Expression Data Sets

Dragan Gamberger, RBI Zagreb
Nada Lavrac, IJS Ljubljana
Filip elezný, CVUT Prague
Jakub Tolar, UMN Minneapolis

2
Modeling Gene Expression Data

Predictive classification task
Input Gene Expression vector
Output Disease Class
To train a predictor
Use examples of existing GE vectors
Associated to known disease class
Character
Lots of Attributes (eg. 20,000 GE values)
Few Examples (eg. 20 patients)

3
Modeling Gene Expression Data

Domain prone to overfitting
Due to abundance of possible patterns, many seem
good by chance
Poor prediction on unseen examples
Mainstream solution
Robustness large redundant numeric classifiers
Usually 10s 1000s genes employed in
classification
Voting of informative genes
Support vector machines

4
Modeling Gene Expression Data

Problem
Complex / numeric classifiers not appropriate for
expert evaluation
Difficult interpretation
Single genes (disease markers) with high voting
power can be extracted from the predictors
But then
prediction assessment results no longer valid
logical connections are lost
(such as G1 expressed AND G2 not expressed)

5
Modeling Gene Expression Data

Challenge Can we induce predictors that are
Logic rules (? easy to read)
Simple (few employed attributes)
Accurate (on test examples)
Meaningful (for a biologist)
? Induction of Comprehensible Models , Jr.
Biomed. Informatics (Elsevier)To appear in 2004

?
6
The Methodology
Gene Expression Data
Discretizereal expression values to Absent /
Marginal / Present
Search for Relevant Features
Search for Relevant Logic Rules
Assess Predictive Accuracyon test data
Assess Meaningfulnessby expert interpretation
7
Discretization

Converting real expression values tothree
values
A (absent not expressed)
M (marginal)
P (present - expressed)
Using Affymetrix discretization
May not be ideal, but ready for improvement

8
Feature Construction

Simple form
g A
g P
Marginal values cannot build a feature
Relevancy
Absolute to small support on the target class or
to large support on the non-target classes
Relative features with a better coverage exists

9
Rule Construction

Subgroup Discovery Gamberger, Lavrac JAIR
17/2002
Forms features into conjunctive rules (defining
subgroups of the target class)
Such as (g1002 P AND g211 A)
Balances precision and support
Precision / Support trade-off is the search
heuristic
May induce impure rules

10
Experimental Domains

AML / ALL Distinguish between
Acute Myeloid Leukemia
Acute Lymphoblastic Leukemia
38 training samples, 34 testing samples, 7129
attributes (genes measured)
MultiClass Distinguish between
14 types of cancer
144 training samples, 54 testing samples,16063
attributes

11
Impact of Relevancy Filter

AML/ALL 7129 original attributes (genes)
Multi-Class 16063 original attributes on
average

2844 Absolutely Irrelevant
3633 Relatively Irrelevant
622Rele vant
72
Absolutely or Relatively Irrelevant
28 Relevant
12
Predictive Performance Assessment

AML / ALL Classification

13
Predictive Performance Assessment

Multi-Class
Classification
Quick Growth
of Precision
with of
examples

14
Comparison to Previous Results

9 Chow et al., Physiol Genomics 200119
Golub et al, Science 199945 Ramaswamy, Procs.
NAS, 2001

15
Rule Examples (AML/ALL)

Co-activity of Leptin and GST described in
previous biological study Balasubramaniyan,
Pharm. Research, 2003
? Plausible relevance to AML ?

16
Rule Examples (Multi-Class)

Routinely used lymphoma marker
Plausible co-factor

17
Conclusions

It is feasible to induce simple, logic-based
classifiers for some GE modeling problems
Given with very few positive examples, we could
not prevent overfitting
Given reasonably few examples, we found well
generalizing, plausible rules
Optimism larger data sets can be expected in the
near future

Write a Comment

User Comments (0)

About PowerShow.com

Induction of Comprehensible Models for Gene Expression Data Sets PowerPoint PPT Presentation