Title: Cis-regultory module
1Cis-regultory module
2TFs often work synergistically
(Harbison 2004)
3Combinatorial control
4l-phase
E coli
lytic growth
lysogenic growth
(source Gary Kaiser)
5l-operon
cro
cI
OR
6l-operon
lysogenic growth
on
off
cro
cI
OR
7l-operon
lytic growth
off
on
cro
cI
OR
OR1
OR2
OR3
8l-operon
lysogenic
Pol II
cro
cI
9Cis-regulatory module (CRM)
- A CRM is a DNA segment, typically a few hundred
base pairs in length containing multiple binding
sites, that recruits several cooperating factors
to a particular genomic location - Ji and Wong (2006)
10Statistical Methods
- Predict modules when the motifs are known.
(simpler) - LRA, by Wasserman and Fickett (1998)
- Predict modules when the motifs also need to be
discovered. (more difficult) - CisModule, by Zhou and Wong (2004)
- EMCModule, by Gupta and Liu (2005)
11LRA
12LRA
Basic idea True regulatory regions are likely to
have multiple motif sites.
Probability for being regulatory
13LRA
Probability for being a regulatory region
regression coefficient
highest motif matching score within a given
sequence
- Training data contain a subset of known
regulatory and control regions.
14Application skeletal-muscle gene regulation
- 5 muscle-specific TFs are known
- Mef-2, Myf, SRF, Tef, Sp-1
- 29 regulatory regions are known.
- Can we predict the regulatory regions just from
sequence motif information?
15Computational Procedure
- Motif matrices are identified by Gibbs sampling
using sequence information from the 29 regulatory
regions. - For some TF, motifs cannot be found by the de
novo approach. Use literature motifs instead. - Top two matching scores for each TF are included
as covariates. - Apply LRA model. Use leave-one-out
cross-validation to evaluate model performance.
16Results
- Single motifs are highly non-specific.
- Simple multi-sites analysis improves specificity
at the cost of reducing sensitivity.
17Results
- Single motifs are highly non-specific.
- Simple multi-sites analysis improves specificity
at the cost of reducing sensitivity.
18Results
- Single motifs are highly non-specific.
- Simple multi-sites analysis improves specificity
at the cost of reducing sensitivity. - Logistic regression further improves specificity
at reduced cost for sensitivity.
19Limitations of LRA
- Motifs must be known in advance.
- When known regulatory sequences are few, it is
difficult to identify motifs by using traditional
methods. - Objective
- Integrating motif discovery and module finding in
a single statistical model.
20De novo module identification
- Two tasks
- Identify TF motifs
- Identify CRMs.
21Why module approach can help motif discovery
- Due to poor specificity, a short sequence can be
enriched simply by chance. - The probability for random matches is much
smaller for motif co-occurrence.
22cisModule
- Basic idea
- A two-level hierarchical mixture model (HMx).
- Level 1 modules ? sequences
(Zhou and Wong 2004)
23cisModule
- Basic idea
- A two-level hierarchical mixture model (HMx).
- Level 1 modules ? sequences
- Level 2 motifs ? modules
(Zhou and Wong 2004)
24HMx Model as a Stochastic Process
- Treat HMx model as a stochastic machinery to
generate sequences. - From the first sequence position, make a series
of random decisions of whether to initiate a
module of length l or generate a letter from the
background model. - Inside a module, If a site for the kth motif was
initiated at position n, then generate wk letters
from its PWM and place them at n, nwk-1,
otherwise generate a letter from the background. - After reaching the end of the current module,
decide whether sampling from the background or
initiating a new module.
(Zhou and Wong 2004)
25Model inference Gibbs sampling
given model parameters, update module/motif
locations
26An numerical experiment
- Merge the 29 regulatory regions with a set of
sequences randomly selected from ENSEMBL
promoters. - Test the ability of cisModule to identify motifs
under noisy environment.
27Results
28Limitations of CisModule
- The length of module, and number of motifs are
externally provided. - Convergence time could be slow. Multiple cycles
are needed each starting from a new seed. - Assuming that combinations of different motifs
are independent.
29EMCModule
- Gupta and Liu (2005) developed a similar approach
called EMCModule. - Main difference
- They use the collection of literature motifs as
initial seeds for motif discovery. - Their method improves the convergence speed.
- Their definition of CRMs are a little different
the number of motifs are fixed within one module,
but the order of and distance between different
motifs can be varied.
30Further issues
- Comparative genomic approach can also be
incorporated into module discovery. (Zhou and
Wong 2007). - The modules identified by these methods can be
viewed as belonging to one type. New methods
need to developed to discover multiple module
types. - While module-based approach is helpful for
finding cooperative motifs, it may hurt discovery
of single motifs.
31(Yuh et al. 1998)
32(Yuh et al. 1998)
33(Yuh et al. 1998)
34(Yuh et al. 1998)
35Reading List
- Wasserman and Fickett (1988)
- LRA. One of the first work on cis-regulatory
modules. - Zhou and Wong (2004)
- cisModule. A statistical method to identify cis-
regulatory modules without knowledge of motif
information. - Yuh et al. (1998)
- An influential biological paper on how
information can be integrated from different
modules to regulate gene expression.