Title: Discovering motifs in DNA sequences using AMADEUS
1Discovering motifs in DNA sequences
usingAMADEUS
C. Linhart, Y. Halperin, R. Shamir
TAU workshop, May 07
2Promoter Analysis Exteremely brief intro
- Transcription is regulated primarily by
transcription factors (TFs) proteins that bind
to DNA subsequences, called binding sites (BSs) - TFBSs are located mainly (not always!) in the
genes promoter the DNA sequence upstream the
genes transcription start site (TSS) - TFs can promote or repress transcription
TSS
3Promoter Analysis (cont.)TFBS models
- The BSs of a particular TF share a common
pattern, or motif, which is often modeled using - Consensus string
- TASDAC (SC,G DA,G,T)
- Position weight matrix (PWM / PSSM)
gt Threshold 0.01 TACACC (0.06) TAGAGC
(0.06) TACAAT (0.015)
4Promoter Analysis (cont.) Goals
- Reverse-engineer the transcriptional regulatory
network find the TFs (and their BSs) that
regulate the studied biological process - Input A set of co-expressed genes
- Output Interesting motif(s)
- Known motifs PRIMA, ROVER,
- Novel motifs MEME, AlignACE,
- A group of co-occurring motifs
cis-regulatory module (CRM) MITRA, CREME,
AMADEUS
5Promoter Analysis (cont.) Challenges
- Why is it so difficult?
- BSs are short and degenerate (non-specific)
- Promoters are long complex (hard to model)
- Multiple BSs of several TFs
- Old (non-functional) BSs
- Other genetic/structural signals (e.g., GC
content) - Search space is huge
- 1510 (500 billion) consensus strings of length
10 - 1Kbp promoter ? 2 strands ? 20K genes in human
40 Mbps - Which score to use - what makes a motif
interesting? - Enrichment over-representation w.r.t. BG model
- Location and/or strand bias
- Conservation across related species
6(No Transcript)
7Promoter Analysis (cont.) Status of motif
discovery tools
- Extant tools perform reasonably well for
- Finding known/novel motifs in organisms with
short, simple promoters, e.g., yeast - Identifying some of the known motifs in complex
species, e.g., TFs whose BSs are usually close to
the TSS - but often fail in other cases!
- Each tool is custom-built for a specific target
score - Comparison of tools Tompa et al. 05
8AMADEUS
A Motif Algorithm for Detecting Enrichment in
mUltiple Species
- Research platform
- Extensible add new algs, scores, motif models
- Flexible control params, algs, scores of
execution - Experimental tool
- Sensitive find subtle signals
- Efficient handle huge amount of data
- Informative show lots of info on motifs
- User-friendly nice GUI
9Main features I/O
- Input
- Target set in one or more species (or expression
data) - Sequence region (promoter, 1st intron, 3 UTR, )
- Various parameters
- Output
- Non-redundant set of motifs
- Rich info per output motif
- Graphical motif logo
- Multiple scores combined p-value
- Similarity to known TFBS models
- List of target genes
- BS localization graph
10Main features alg.
- Algorithm Multiple refinement phases
- Each phase receives best candidates of previous
phase,and refines them (e.g., uses a more
complex motif model) - First phases are simple and fast (e.g., try all
k-mers) Last phases are more complex (e.g.,
optimize PWM using EM)
11Main features scores
- Motif scores
- User selects scores to use, a subset of
- Target-set over/under-representation
- Hyper-geometric
- GC-contentlength binned binomial
- Localization
- Strand bias
- (additional scores for expression data)
- Scores are combined into a single p-value
- Dont assume specific models for distribution of
BSs - (most tools use a statistical model, e.g.,
Markov Model, to describe the promoter sequences)
12Main features misc.
- GUI
- Control all parameters
- Save/load parameters from file
- Save textual and graphical output to file
- TFBS viewer
- Other
- Ignore redundant sequences (with identical
subsequence) - Applicable to multiple genome-scale promoter
sequences - Bootstrapping Empirical p-value estimation using
random target sets / shuffled data - Execution modes GUI , batch
- Interoperability Java application
13Combining p-values
- Each motif receives p-values from various sources
(several scores, multiple species) p1,p2,,pn - We combine them into a single p-value p
- p Prob f1? f2?? fn ? p1? p2? ? pn fi
U0,1 - Denote ? p1?p2??pn
- p 1 - ? ? ?(ln 1/?)i/i! , i0,,n-1
- Also developed a weighted version when each
p-value has a different weight
14Results I E2F targets Ren et al. 02
E2F
NF-Y
15Results I G2 G2/M phases of human cell cycle
Whitfield et al. 02
CHR (not in TRANSFAC)
NF-Y
Module CHR and NF-Y motifs co-occur
(Module was reported in Linhart et al., 05,
Tabach et al. 05)
16Results II Localized humanmouse motifs
- Input
- All human mouse promoters (2 x 21K)
- Region -500100 (w.r.t. TSS)
- Total sequence length 26 Mbps
- No target-set / expression data
- Score localization
- Results
- Recovered known TFs Sp1, NF-Y, Elk-1, TATA,
Nrf-1, ATF/CREB, cMyc, RFX1 - Recovered the splice donor site
- Identified several novel motifs
17Results III Yeast TF target sets Harbison et
al. 04
Number of motifs successfully recovered by
AMADEUS and 3 other motif finders (similarity was
computed w.r.t. the motifs reported by Harbison
et al. the 3 top scoring motifs found by each
alg were considered)
18Questions?