Title: Computational detection of genomic cisregulatory modules applied to body patterning in the early Dro
1Computational detection of genomic cis-regulatory
modules applied to body patterning in the early
Drosophila embryo
- N. Rajewsky, M. Vergassola, U. Gaul and E. Siggia
- Presented by Bin Tan
2Cis-regulatory modules (CRM)
- In higher eukaryotes, many genes show complex
spatial-temporal expression patterns. - Gene transcription regulation apparatus is
largely organized in the form of separable
cis-regulatory modules. - A module integrates inputs from several
transcription factors and regulates another
genes expression, forming a regulatory network.
3Structural features of modules
- Hundreds of nucleotides in length
- Contains binding sites for as many as 4-5
different transcription factors - Possibly multiple binding sites for the same
transcription factor - Certain combinations of binding sites
correlations between different transcription
factors
4Why computational methods?
- Pure experimental methods such as promoter
bashing is tedious. - It is easier to screen a modest list of
candidates suggested by a computational method.
5About this paper
- Uses data on body patterning of the early
Drosophila embryo - Makes statistically significant predictions of
regulatory modules using three different levels
of prior information - Binding sites (motifs)
- Several related modules
- Only genome
6Three levels of prior information1. Binding
sites (motifs)2. Several related modules3. Only
genome
7The Ahab algorithm
- Uses known binding sites (motifs) information
- Scans the genome in windows
- Scores each window according to how well the
sequence can be stochastically generated from the
motifs - Outputs windows with high ranks
8Ahab features
- (As compared to Mobydick)
- Uses positional weight matrices as the motif
model - Introduces a local background to remove influence
from local variations in sequence composition - Allows binding sites to overlap
- Allows weak binding sites to contribute to the
score - No parameter tuning (other than the window size)
9Algorithm details
- Background model k-th order Markov chain (each
nucleotide is only dependent on the preceding k
nucleotides)
10Algorithm details (cont.)
- Sequence Ss1s2..
- Weight matrices w1 w2 .. for motifs
- Background wb
- Probabilistic generation of S
- Choose a motif or background wk1,2,..b with
probability pk - Sample a sequence according to w and append it to
S - Repeat until S reaches a certain length
11Algorithm details (cont.)
- Unknown arameters? p1 p2 .. pb
- Maximize
- Conjugate descent or EM algorithm
12Experiment setup
- Input weight matrices for 8 transcription
factors constructed from 11 modules - Window size 500 bp
- 27 modules known to receive maternal/gap gene
input
13Results
- 146 highly significant modules found
- For 27 known modules
- 116 recovered
- 3 when filtering for at least 3 different factors
- 3 because they contain only other factors
- 4 ranked very low (700)
- For 15 novel predictions
- one of the adjacent genes is patterned in the
blastoderm
14Estimation of positive rate
- Scramble the columns in the weight matrices half
as many predictions - 50 false positive rate - (615)3/(146-11) - 50 positive rate
15Experiment variations
- Remove the least specific matrix (tailless) from
input - 75 of the predictions without using tailless are
also present in the list of 146 - Vary window size to 700bp
- 58 in the list of 146 are also among the top 200
of the 700bp set - Interesting new predictions
16Three levels of prior information1. Binding
sites (motifs)2. Several related modules3. Only
genome
17Motivation
- For most transcription factors, binding site
information is rarely known - Modules obtained by experimental methods (e.g.
promoter bashing) are more common
18The method
- Uses standard motif finders to recover weight
matrices from input modules - Feed the motifs to Ahab to find similarly
regulated genes
19The method (cont.)
- Gibbs sampler algorithm
- Lawrence et al. Detecting subtle sequence
signals a Gibbs sampling strategy for multiple
alignment. (Presented by Xin He) - Customizations
- Search for only one binding site at a time.
- Mask only the central 1-2 bases of each motif
before iterating. - - Results are more reproducible between runs.
- - Motifs are allowed to overlap.
20Experiment results
- Testing on modules with known binding site
information - Gibbs sampling predicts 30-50 of the sequence is
covered by motifs - Gibbs motifs has higher specificity
- Recovers half of the known motifs
- Predicts several new interesting motifs
21Experiment results (cont.)
- Input 3 modules receiving inputs from 6
transcription factors - 6 highly significant weight matrices found
- Kr, Kni, (HbCad) 3 new
- Ahab finds 63 highly significant modules
- 4 overlaps with the input modules
- 13 contiguouss to genes patterned in the
blastoderm - Comparable positive rates
22Three levels of prior information1. Binding
sites (motifs)2. Several related modules3. Only
genome
23The Argos algorithm
- Only uses the genome data (Unsupervised)
- Motivation Is the redundancy of binding sites
inside modules strong enough to predict modules
alone? - The first successful attempt to do this for a
metazoan genome
24The Argos algorithm
- To determine whether a motif is locally
overrepresented Score its frequency in the
sequence against its expected frequency
(according to genome wide background). - Enumerate all possible motifs of length 8.
- Compute their frequency in the genome (background
counts), allowing 2 mutations
25The Argos algorithm (cont.)
- Move a sliding window S over the genome
- Compute a motifs local count c in S
- Compute the motifs expected count from
background - Rank the motifs by their Poisson scores
- The motifs are often related to each other
- Greedily select the top motif and eliminate
related ones (under shifts and up to 4 mutations) - Repeat until 5 motifs have been produced
- Use the sum of the selected motifs scores as the
score for S
26Experiment results
- For a certain set of modules, Argos recovers half
of them - 50 false negative rate - For several genes with 15 known modules, Argos
recovers 7 when looking over 10kbp upstream of
translation start - Genome wide, roughly one module per gene
27Experiment results