Discovering motifs in DNA sequences using AMADEUS - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Discovering motifs in DNA sequences using AMADEUS

Description:

TFBSs are located mainly (not always!) in the gene's promoter the DNA ... MITRA, CREME, ... Promoter Analysis (cont.): Goals. AMADEUS. Why is it so difficult? ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 15
Provided by: tau1
Category:

less

Transcript and Presenter's Notes

Title: Discovering motifs in DNA sequences using AMADEUS


1
Discovering motifs in DNA sequences
usingAMADEUS
C. Linhart, Y. Halperin, R. Shamir
TAU workshop, May 07
2
Promoter Analysis Exteremely brief intro
  • Transcription is regulated primarily by
    transcription factors (TFs) proteins that bind
    to DNA subsequences, called binding sites (BSs)
  • TFBSs are located mainly (not always!) in the
    genes promoter the DNA sequence upstream the
    genes transcription start site (TSS)
  • TFs can promote or repress transcription

TSS
3
Promoter Analysis (cont.)TFBS models
  • The BSs of a particular TF share a common
    pattern, or motif, which is often modeled using
  • Consensus string
  • TASDAC (SC,G DA,G,T)
  • Position weight matrix (PWM / PSSM)

gt Threshold 0.01 TACACC (0.06) TAGAGC
(0.06) TACAAT (0.015)
4
Promoter Analysis (cont.) Goals
  • Reverse-engineer the transcriptional regulatory
    network find the TFs (and their BSs) that
    regulate the studied biological process
  • Input A set of co-expressed genes
  • Output Interesting motif(s)
  • Known motifs PRIMA, ROVER,
  • Novel motifs MEME, AlignACE,
  • A group of co-occurring motifs
    cis-regulatory module (CRM) MITRA, CREME,

AMADEUS
5
Promoter Analysis (cont.) Challenges
  • Why is it so difficult?
  • BSs are short and degenerate (non-specific)
  • Promoters are long complex (hard to model)
  • Multiple BSs of several TFs
  • Old (non-functional) BSs
  • Other genetic/structural signals (e.g., GC
    content)
  • Search space is huge
  • 1510 (500 billion) consensus strings of length
    10
  • 1Kbp promoter ? 2 strands ? 20K genes in human
    40 Mbps
  • Which score to use - what makes a motif
    interesting?
  • Enrichment over-representation w.r.t. BG model
  • Location and/or strand bias
  • Conservation across related species

6
(No Transcript)
7
Promoter Analysis (cont.) Status of motif
discovery tools
  • Extant tools perform reasonably well for
  • Finding known/novel motifs in organisms with
    short, simple promoters, e.g., yeast
  • Identifying some of the known motifs in complex
    species, e.g., TFs whose BSs are usually close to
    the TSS
  • but often fail in other cases!
  • Each tool is custom-built for a specific target
    score
  • Comparison of tools Tompa et al. 05

8
AMADEUS
A Motif Algorithm for Detecting Enrichment in
mUltiple Species
  • Research platform
  • Extensible add new algs, scores, motif models
  • Flexible control params, algs, scores of
    execution
  • Experimental tool
  • Sensitive find subtle signals
  • Efficient handle huge amount of data
  • Informative show lots of info on motifs
  • User-friendly nice GUI

9
Main features I/O
  • Input
  • Target set in one or more species (or expression
    data)
  • Sequence region (promoter, 1st intron, 3 UTR, )
  • Various parameters
  • Output
  • Non-redundant set of motifs
  • Rich info per output motif
  • Graphical motif logo
  • Multiple scores combined p-value
  • Similarity to known TFBS models
  • List of target genes
  • BS localization graph

10
Main features alg.
  • Algorithm Multiple refinement phases
  • Each phase receives best candidates of previous
    phase,and refines them (e.g., uses a more
    complex motif model)
  • First phases are simple and fast (e.g., try all
    k-mers) Last phases are more complex (e.g.,
    optimize PWM using EM)

11
Main features scores
  • Motif scores
  • User selects scores to use, a subset of
  • Target-set over/under-representation
  • Hyper-geometric
  • GC-contentlength binned binomial
  • Localization
  • Strand bias
  • (additional scores for expression data)
  • Scores are combined into a single p-value
  • Dont assume specific models for distribution of
    BSs
  • (most tools use a statistical model, e.g.,
    Markov Model, to describe the promoter sequences)

12
Main features misc.
  • GUI
  • Control all parameters
  • Save/load parameters from file
  • Save textual and graphical output to file
  • TFBS viewer
  • Other
  • Ignore redundant sequences (with identical
    subsequence)
  • Applicable to multiple genome-scale promoter
    sequences
  • Bootstrapping Empirical p-value estimation using
    random target sets / shuffled data
  • Execution modes GUI , batch
  • Interoperability Java application

13
Combining p-values
  • Each motif receives p-values from various sources
    (several scores, multiple species) p1,p2,,pn
  • We combine them into a single p-value p
  • p Prob f1? f2?? fn ? p1? p2? ? pn fi
    U0,1
  • Denote ? p1?p2??pn
  • p 1 - ? ? ?(ln 1/?)i/i! , i0,,n-1
  • Also developed a weighted version when each
    p-value has a different weight

14
Results I E2F targets Ren et al. 02
E2F
NF-Y
15
Results I G2 G2/M phases of human cell cycle
Whitfield et al. 02
CHR (not in TRANSFAC)
NF-Y
Module CHR and NF-Y motifs co-occur
(Module was reported in Linhart et al., 05,
Tabach et al. 05)
16
Results II Localized humanmouse motifs
  • Input
  • All human mouse promoters (2 x 21K)
  • Region -500100 (w.r.t. TSS)
  • Total sequence length 26 Mbps
  • No target-set / expression data
  • Score localization
  • Results
  • Recovered known TFs Sp1, NF-Y, Elk-1, TATA,
    Nrf-1, ATF/CREB, cMyc, RFX1
  • Recovered the splice donor site
  • Identified several novel motifs

17
Results III Yeast TF target sets Harbison et
al. 04
Number of motifs successfully recovered by
AMADEUS and 3 other motif finders (similarity was
computed w.r.t. the motifs reported by Harbison
et al. the 3 top scoring motifs found by each
alg were considered)
18
Questions?
Write a Comment
User Comments (0)
About PowerShow.com