Title: Gene Expression Rule Modeling
1Gene Expression Rule Modeling
- Jonathan Rudolph
- CS Advisor Carolina Ruiz
- BBT Advisor Elizabeth Ryder
- Worcester Polytechnic Institute
- April 18, 2006
2Problem
- The process involved in gene expression is not
fully understood. - DNA contains all genes
- Different cell-types express different genes
- Promoter Region determines transcription
- Motifs short sequences on promoter region
figure taken from http//www.garlandscience.com
3Process overview
- Gene Expression dataset
- Rule Mining Process
- M1first M5second ? Neural
- Rule Modeling Process
- M1first M5second ? Neural
- M8first M8second ? Muscle
- Apply model to test case
- Model Evaluations
4Software overview
Items in blue added by this MQP
- Dharmesh Thakkar MS thesis
- Rule Mining Process
- Added constraints to Keith Pray MS Thesis
- Rule Modeling Process
- Senthil Palanisamys MS Thesis
- Associative Classification
- extended to handle Gene Expression
- CBA model
- TopN CBA model
- Per-Class CBA model
- Model Evaluations
- Added new ways to evaluate E-measure (error) for
unclassified instances.
5Dataset
- Model species C. elegans is a common
roundworm - Genome fully sequenced
- Neural gene expression mapped
Pictures courtesy wormatlas www.wormatlas.org
6Anotated Promoter Regions motifs with orientation
genes
Cell Types
ASH,ASI,ASK HSN,PHA,ADL,ASK ASE,PHA,ASI,ASK
ALM,HSN,PHA,CAN ALM,HSN ALM ALM ALM,PHA
ALM,HSN,CAN ALM,HSN,CAN ASH,ASI,HSN,ASL,CAN,A
SK ALM,ASE ASH,ADL ALM ASE ALM ASH,ASI
,PHA,ADL,ASE,ASK PHA,ADL
Annotated by MAST http//meme.sdsc.edu/
- Dataset
- 36 motifs length 8, 10, 12
- 80 promoter regions
- 4 Cell expression patterns
Dataset created from information available at
www.wormbase.org. Thanks also to the C. elegans
Consortium.
7Rule Mining
- Observe connections between attributes
- Association Rule mining is applicable
- Antecedent gt Consequent
- M1first M5 second ? Neural Supp 0.25
Conf 0.5 - Support
- of instances that contain A U C
- Confidence
- Of instances which contain A, of instances that
contain A U C - P-value
- probability of obtaining result by chance,
assuming null hypothesis is true. - E-measure
- value produced from comparing two sets, Error
measure.
8One rule M21first M12second M10third gt
exprALM
taken from Dharmesh Thakkar's Visualization module
9Model Creation
- If we mine, and have many rules, how do we pick
the best? - Modeling
- selecting rules that work well together
- Descriptive Power
- Use as few rules as possible to classify our
instances - Use rules that represent our data
- Predictive power
- Verify by evaluating those rules on test data
10Models what's out there?
- Simple a model of All Rules
- lots of detailed descriptive power
- makes many predictions
- more errors, higher E-measure
- A little better Top N Rules
- Small number of rules, sorted by confidence
- For large N, same model as All Rules
- More Sophisticated
- CBA
- TopN-CBA
- Per-Class CBA
11(No Transcript)
12CBA
- Sort rules by Confidence, Support, order of
generation - Pass through instances once per rule
- If any instances are covered, mark rule and
remove instances. - If a rule applies to no instances, it is not
marked, and not used in the model. - Repeat until no more rules or no more instances.
- Evaluate set of marked rules incrementally one
rule, two, three, etc. - When the average E-measure for the model of n
rules is worse than the E- measure for n-1 rules,
return the n-1 rules in order as our completed
model.
13CBA
- Sophisticated model Classification Based
Association, CBA - (Liu,Hsu,Ma '98 National Univ. Singapore)
- Accurate rules that cover the most instances
first - Only rules that decrease errors
- Not designed for strong descriptive power
14New Model TopN-CBA
- Problem Existing models provide either
descriptive or predictive power. We want a model
that does both.. - Approach Start with a CBA generated model,
include 1 additional rule from each predicted
class.
15TopN-CBA
- generate CBA model (1)
- separate model by predicted type (blue, red and
green circles) (2) - gather n best rules from each predicted type,
including CBA model rules (for this example n3
(3) - return list of top n rules for each prediction (4)
16TopN-CBA model
17(No Transcript)
18(No Transcript)
19New Model Per Class CBA
- Problem TopN-CBA still adds some redundant
rules. - Approach Separate the rules by class before
selecting, then select separately for each class.
Combine the partial models.
20(No Transcript)
21Model Analysis
- Best Per Class CBA,
- CBA better predictor
- Rules from each cell-type
- few rules (1-50)
- Applicable for
- Brief models
- Summarize each cell-type
- Prediction for low support rules
- Note
- smaller number of motifs, lower minsupport gave
us better models all around.
22Problem revisited
- How will this help us understand gene expression?
- Before
- mine for rules in C. elegans
- Now
- Variety of models
- 25 best rules instead of 14,000
- Ahead
- Conduct lab experiments in the locations and
motifs nominated by the model
23Questions?
24Weka System Mining
Lift Confidence of a rule -----------------------
-- Confidence assuming indep. Consequent from
Antecedent. Lift gt 1 ? interesting
25E-measure
developed by Senthil Palanisamny
- E-measure is a metric for combining Precision and
Recall - it is a 0 1 value
- An E-measure of 0 indicates two identical sets.
- An E-measure of 1 indicates two sets that have no
items in common. - It is calculated from Precision, Recall, and beta
- beta is a way of changing the E-measure to give
a greater penalty to either a low Precision or
low Recall - a beta of 1 gives equal penalty