Title: MotifBooster
1MotifBooster A Boosting Approach for
Constructing TF-DNA Binding Classifiers
Pengyu Hong 10/06/2005
2Motivation
- Understand transcriptional regulation
Gene X
TF
- Model transcriptional regulatory networks
3Motivation
Previous works on motif finding
- AlignACE (Hughes et al 2000)
- ANN-Spec (Workman et al 2000)
- BioProspector (Liu et al 2001)
- Consensus (Hertz et al 1999)
- Gibbs Motif Sampler (Lawrence et al 1993)
- LogicMotif (Keles et al 2004)
- MDScan (Liu et al 2002)
- MEME (Bailey and Elkan 1995)
- Motif Regressor (Colon et al 2003)
-
4Motivation
A widely used model Motif Weight Matrix
(Stormo et al 1982)
1 2 3 4 5 6 7 8 A
0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92 C
-0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07 G
-1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13 T
0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80
Score of the site
10.84
vs. threshold
A sequence is a target if it contains a binding
site (score gt threshold).
Computational ltlt Molecular
5Motivation
Non-linear binding effects, e.g., different
binding modes.
CACCCATACAT
Mode 1
Preferred binding
CATCCGTACAT
Mode 2
CA C/T CC A/G TACAT
CACCCGTACAT
Mode 3
Non-preferred binding
CATCCATACAT
Mode 4
6Modeling
Model a TF-DNA binding classifier as an ensemble
model.
base classifier
weight
ensemble model
7Modeling
The mth base classifier
Sequence scoring function
fm(sik) is a site scoring function (weight matrix
threshold).
The scoring function considers (a) the number of
matching sites (b) the degree of matching
8Training Boosting
Modify the confidence-rated boosting (CRB)
algorithm (Schapire et al. 1999) to train
ensemble models
9Why Boosting?
Booting is a Newton-like technique that
iteratively adds base classifiers to minimize the
upper bound on the training error.
(Schapire et al. 1998)
10Challenges
Positive sequences targets of a TF
Negative sequences
- Sequences are labeled, but not the sites in the
sequences. - Cannot be well separated by the weight matrix
model (linear). - Number of negative sequences gtgt number of
positive sequences.
11Boosting
Initialization
Positive
Negative
- Total weight of the positive samples Total
weight of the negative samples. - Since the motif must be an enriched pattern in
the positive sequences, use Motif Regressor to
find a seed motif matrix W0.
12Boosting
Train a base classifier (BC)
Positive
Negative
- Use the seed matrix W0 ? to initialize the mth
base classifier qm(?) and let ?m1.
- Refine ?m and the parameters of qm(?) to minimize
where yi is the label of Si and dim is the weight
of Si in the mth round.
BC 1
- Negative information is explicitly used to train
qm(?) and ?m.
13Boosting
Adjust sample weights and gives higher weights to
previously misclassified samples.
Positive
Negative
- yi is the label of Si
- dim is the weight of Si in the mth round.
- dim1 is the new weight of Si.
BC 1
14Boosting
Add a new base classifier
Positive
Negative
BC 1
BC 2
15Boosting
Add a new base classifier
Positive
Negative
Decision boundary
16Boosting
Adjust sample weights again
Positive
Negative
Decision boundary
17Boosting
Add one more base classifier
Positive
Negative
BC 3
18Boosting
Add one more base classifier
Positive
Negative
Decision boundary
19Boosting
- Stop if the result is perfect or the performance
on the internal validation sequences drops.
Positive
Negative
Decision boundary
20Results
Data ChIP-chip data of Saccharomyces cerevisiae
(Lee et al. 2002 )
- Positive sequences
- p-value lt 0.001
- Number of positive sequences ? 25.
- Negative sequences
- p-value ? 0.05 ratio ? 1
Got 40 TFs.
21Results
Leave-one-out test results
Boosted models vs. Seed weight matrices
Vertical axis Improvements on specificity
Horizontal axis TFs
22Results
Capture Position-Correlation
RAP1
0
?
Weight Matrix
Base classifier 1
Base classifier 2
Base classifier 3
Boosting
23Results
Capture Position-Correlation
REB1
Weight Matrix
Base classifier 1
Base classifier 2
Boosting