Title: Induction of Comprehensible Models for Gene Expression Data Sets
1Induction of Comprehensible Models for Gene
Expression Data Sets
- Dragan Gamberger, RBI Zagreb
- Nada Lavrac, IJS Ljubljana
- Filip elezný, CVUT Prague
- Jakub Tolar, UMN Minneapolis
2Modeling Gene Expression Data
- Predictive classification task
- Input Gene Expression vector
- Output Disease Class
- To train a predictor
- Use examples of existing GE vectors
- Associated to known disease class
- Character
- Lots of Attributes (eg. 20,000 GE values)
- Few Examples (eg. 20 patients)
3Modeling Gene Expression Data
- Domain prone to overfitting
- Due to abundance of possible patterns, many seem
good by chance - Poor prediction on unseen examples
- Mainstream solution
- Robustness large redundant numeric classifiers
- Usually 10s 1000s genes employed in
classification - Voting of informative genes
- Support vector machines
4Modeling Gene Expression Data
- Problem
- Complex / numeric classifiers not appropriate for
expert evaluation - Difficult interpretation
- Single genes (disease markers) with high voting
power can be extracted from the predictors - But then
- prediction assessment results no longer valid
- logical connections are lost
- (such as G1 expressed AND G2 not expressed)
5Modeling Gene Expression Data
- Challenge Can we induce predictors that are
- Logic rules (? easy to read)
- Simple (few employed attributes)
- Accurate (on test examples)
- Meaningful (for a biologist)
- ? Induction of Comprehensible Models , Jr.
Biomed. Informatics (Elsevier)To appear in 2004
?
6The Methodology
Gene Expression Data
Discretizereal expression values to Absent /
Marginal / Present
Search for Relevant Features
Search for Relevant Logic Rules
Assess Predictive Accuracyon test data
Assess Meaningfulnessby expert interpretation
7Discretization
- Converting real expression values tothree
values - A (absent not expressed)
- M (marginal)
- P (present - expressed)
- Using Affymetrix discretization
- May not be ideal, but ready for improvement
8Feature Construction
- Simple form
- g A
- g P
- Marginal values cannot build a feature
- Relevancy
- Absolute to small support on the target class or
to large support on the non-target classes - Relative features with a better coverage exists
9Rule Construction
- Subgroup Discovery Gamberger, Lavrac JAIR
17/2002 - Forms features into conjunctive rules (defining
subgroups of the target class) - Such as (g1002 P AND g211 A)
- Balances precision and support
- Precision / Support trade-off is the search
heuristic - May induce impure rules
10Experimental Domains
- AML / ALL Distinguish between
- Acute Myeloid Leukemia
- Acute Lymphoblastic Leukemia
- 38 training samples, 34 testing samples, 7129
attributes (genes measured) - MultiClass Distinguish between
- 14 types of cancer
- 144 training samples, 54 testing samples,16063
attributes
11Impact of Relevancy Filter
- AML/ALL 7129 original attributes (genes)
- Multi-Class 16063 original attributes on
average
2844 Absolutely Irrelevant
3633 Relatively Irrelevant
622Rele vant
72
Absolutely or Relatively Irrelevant
28 Relevant
12Predictive Performance Assessment
13Predictive Performance Assessment
- Multi-Class
- Classification
- Quick Growth
- of Precision
- with of
- examples
14Comparison to Previous Results
-
- 9 Chow et al., Physiol Genomics 200119
Golub et al, Science 199945 Ramaswamy, Procs.
NAS, 2001
15Rule Examples (AML/ALL)
-
-
- Co-activity of Leptin and GST described in
previous biological study Balasubramaniyan,
Pharm. Research, 2003 - ? Plausible relevance to AML ?
16Rule Examples (Multi-Class)
- Routinely used lymphoma marker
- Plausible co-factor
17Conclusions
- It is feasible to induce simple, logic-based
classifiers for some GE modeling problems - Given with very few positive examples, we could
not prevent overfitting - Given reasonably few examples, we found well
generalizing, plausible rules - Optimism larger data sets can be expected in the
near future