Class: Motif Finding CS-67693, Spring 2005 - PowerPoint PPT Presentation

About This Presentation
Title:

Class: Motif Finding CS-67693, Spring 2005

Description:

School of Computer Science & Engineering. Hebrew University, Jerusalem *Few s ... Enhancer sites Can be very far away (either upstream or downstream) ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 53
Provided by: yoseph3
Category:
Tags: class | finding | motif | spring

less

Transcript and Presenter's Notes

Title: Class: Motif Finding CS-67693, Spring 2005


1
Class Motif FindingCS-67693, Spring 2005
Few slides were adopted and edited from
www.cs.ucsb.edu/ambuj/Courses/
bioinformatics/motif20finding.ppt
  • School of Computer Science Engineering
  • Hebrew University, Jerusalem

2
Background
  • Basic dogma
  • Information is coded in the genome
  • Information includes
  • Where the genes are coded, including
  • Transcription Start
  • UTR
  • Exons and Introns
  • Alternative splicing

3
Eukaryotic Gene
Adapted in part from http//online.itp.ucsb.edu/on
line/infobio01/burge/
4
Background
  • Basic dogma
  • Information is coded in the genome
  • Information includes
  • Where the genes are coded, including
  • Transcription Start
  • UTR
  • Exons and Introns
  • Alternative splicing
  • Functional units in proteins

5
Proteins Local structure motifs
I-sites Library a catalog of local
sequence-structure correlations
diverging type-2 turn
Frayed helix
Type-I hairpin
Serine hairpin
glycine helix N-cap
alpha-alpha corner
Proline helix C-cap
6
Background
  • Basic dogma
  • Information is coded in the genome
  • Information includes
  • Where the genes are coded, including
  • Transcription Start
  • UTR
  • Exons and Introns
  • Alternative splicing
  • Functional units in proteins
  • RNA family structure

7
RNA Multiple Align. structure
Biological Sequence Analysis Durbin, Eddy,
Krogh, Mitchison Cambridge press, 1998
8
Background
  • Basic dogma
  • Information is coded in the genome
  • Information includes
  • Where the genes are coded, including
  • Transcription Start
  • UTR
  • Exons and Introns
  • Alternative splicing
  • Functional units in proteins
  • RNA family structure
  • How to control which gene to turn on/off and when

9
Background
  • In many cases, we can related such functions to
    reappearing motifs in the genome
  • Splice/start/end site signals in coding genes
  • Binding sites of regulatory elements controlling
    transcription of nearby genes
  • A certain function of a protein domain.

The definition of what is a sequence motif
depends on the context !
10
Background
  • Basic dogma
  • Information is coded in the genome
  • Information includes
  • Where the genes are coded, including
  • Transcription Start
  • UTR
  • Exons and Introns
  • Alternative splicing
  • Functional units in proteins
  • RNA family structure
  • How to control which gene to turn on/off and when

Future Classes
11
Expression of Genes in Cells
  • To produce a protein, a gene (DNA) has to be
    converted to an intermediary molecule called RNA,
    in a process called transcription.
  • Each cell contains the same genome. Different
    cells have a different set of genes which are
    turned on (expressed) by allowing the genes to be
    transcribed.
  • Different cells have different mixtures of gene
    regulatory proteins to turn genes on or off.

12
Regulation of Gene Expression
  • Gene regulatory proteins bind to specific places
    (regulatory sites) on DNA. These sites are
    usually close to the gene.

off
gene
site
regulatory protein
on
gene
site
13
Regulatory Sites
  • Regulatory sites are sometimes divided to 2
    types
  • Promoter sites Usually upstream of a gene in
    non-translated (non-coding) regions. In some
    cases, these sites can be in exonic or intronic
    regions.
  • Enhancer sites Can be very far away (either
    upstream or downstream).
  • Regulatory proteins recognize sites by conserved
    DNA patterns, which consist of a short stretch of
    partially specific nucleotide sequences.

14
lac operon in E. coli

15
Figure 13.16 The lac Operon of E. coli
16
Promoter
17

18

19
Transcription Factor Binding Sites
We want to describe this site
Non-coding regions ? gene regulation
20
Difficulty of Finding Regulatory Elements
  • Regulatory sites are short (up to 30
    nucleotides).
  • Non-coding regions are very long (includes all
    regions which are not translated into proteins).
  • Experiments to find regulatory sites are tedious
    and time-consuming. One approach is to mutate
    different combinations of nucleotides until
    functionality changes.
  • We dont have good understanding on what makes a
    site active/how active in terms of the
    chemical/physical constraints

21
Why Not Use Multiple Alignment?
  • The motif is short and may appear at different
    location in different sequences. Most other
    areas are random
  • Not all positions within a binding site should
    be treated in the same way, and usually we dont
    know in advance how. Therefore the use of a
    general scoring matrix is not adequate
  • The problem is made more complicated since not
    every sequence contains a motif, due to
  • The upstream region used may not be long enough
    to include a regulatory site in every sequence
  • Usually, potential co-regulated genes are used to
    construct the sample, which means that we dont
    know for sure whether all these genes are really
    co-regulated

22
Computational Approach
  • Identify a set of genes believed to be controlled
    by the same regulatory mechanism (co-regulated
    genes).
  • Extract regulatory regions of the genes (usually
    upstream sequences) to form a sample of
    sequences.
  • Find some way to identify conserved elements in
    these sequences, resulting in a list of potential
    regulatory sites.

23
How to Find Regulatory Sites
sample
gene
site
gene
site
gene
site
gene
site
gene
site
24
Formulating Motif Finding Task
  • Given a set of sequences, find a common motif
    shared by these sequences.
  • Steps
  • Construct a model of what we mean by common
    motif.
  • Solve the problem within the model on simulated
    samples.
  • Evaluate performance on real life biological
    samples.

25
Formulating Motif Finding Task (2)
  • This means we need to define
  • Input of the algorithm This implicitly defines
    various assumptions we have on the problem (e.g
    do we have different belief for each sequence
    that it belongs to the group?)
  • Type of motif class
  • Search Algorithm How we search the space of
    possible motifs?
  • Scoring function How we score putative motifs?
  • Output of the algorithm Should it give us just
    putative sites or maybe a binding site model to
    predict sites?
  • Evaluation technique How do we test our
    algorithm?

26
Task Definition Example
  • Given a sample of sequences and an unknown
    pattern (motif) that appears at different unknown
    positions in each sequence, can we find the
    unknown pattern?
  • Input a set of sequences, each one with an
    unknown pattern at an unknown position.
  • Output a set of starting positions of the
    pattern in each sequence.

27
Pattern Subsequence
Subsequence AAAAAAAAGGGGGGG
  • atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagata
    aacgtatgaagtacgttagactcggcgccgccgacccctattttttga
    gcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaat
    aAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGG
    GGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
    agctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaac
    caacgcggacccaaaggcaagaccgataaaggagatcccttttgcggt
    aatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAA
    AAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAA
    AAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcg
    caacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGc
    aattatgagagagctaatctatcgcgtgcgtgttcataacttgagttA
    AAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagtt
    aatgctgtatgacactatgtattggcccattggctaaaagcccaactt
    gacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaag
    ggaagctggtgagcaacgacagattcttacgtgcattagctcgcttcc
    ggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

28
Pattern (l,d)
  • First formulated by Pevzner (ISMB 2000)
  • Pattern subsequence of length l and exactly d
    random mismatches in it
  • All other sequence is assumed random
  • Assumes exactly one true occurrence of the
    motif in each sequence
  • atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataa
    acgtatgaagtacgttagactcggcgccgccgacccctattttttgag
    cagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaata
    cAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGa
    GtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
    gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaacc
    aacgcggacccaaaggcaagaccgataaaggagatcccttttgcggta
    atgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAA
    tAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAc
    AAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgc
    aacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGcca
    attatgagagagctaatctatcgcgtgcgtgttcat

AgAAgAAAGGttGGG
All variants of AAAAAAAAGGGGGGG
.......
cAAtAAAAcGGcGGG
29
Formulating Motif Finding Task (2)
  • We need to define
  • Input of the algorithm This implicitly defines
    various assumptions we have on the problem (e.g
    do we have different belief for each sequence
    that it belongs to the group?)
  • Type of motif class
  • Search Algorithm How we search the space of
    possible motifs?
  • Scoring function How we score putative motifs?
  • Output of the algorithm Should it give us just
    putative sites or maybe a binding site model to
    predict sites?
  • Evaluation technique How do we test our
    algorithm?
  • Think
  • How the (l,d) problem defines these ?
  • How does it relate to real biology?

30
How to Define Motif Class?
  • Subsequences ACTCTT
  • IUPAC alphabet A, C, G, T, R,Y, M, K, S, W, B,
    D, H, V, N all subsets of A,C,G,T
  • PSSM / PWM (Position Specific Score Matrix or
    Position Weight Matrix)
  • More general probabilistic/other models e.g.
    using Bayesian Networks modeling language
  • Refined definition based on prior knowledge
  • Homo/Hetro dimers
  • Variable gaps
  • Bias to some characteristic information profile
    (Van, 2003)

31
PSSM Representation of Binding Sites
  • Position Specific Score Matrix each possible
    kmer will get a score for being a binding
    site which is
  • Probabilistic interpretation

wi,c weight of letter c at position i
  • NOTE
  • Independence assumption between biding sites
    positions !
  • The score used in a probabilistic setting is the
    log odds score
  • In many case the BG is a simple, fixed,
    background distribution (Q) over ACGT.
  • The entries in the Matrix can be Pi(a),
    log(Pi(a)) or log(Pi(a)/logQ(a) depending on
    the context of its usage !

32
PSSM Enables representing low/high affinity
in different Positions Trade off Sens. and
Spec. in genomic wide scans- Huge Search space,
how to cover efficiently?
PSSM vs. IUPAC
ABF1 Example (Targets by Lee at el. ,2002)
?
gtYAL011W CGTGTTAGATGA
33
How to Learn PSSM Motif?
  • Easier Task - We have aligned samples to learn
    from
  • We have a set of known BS, all of length k,
    (e.g. verified by some biological experiment)
  • Compute counts for each base in each
    position, and normalize ML estimator
  • N number of sequence, Na number of as in
    position i
  • Note
  • This is the ML solution. As in many other cases,
    this might be problematic when we have very few
    samples to learn from (e.g. we can get
    probability 0 for base A in position i simply
    because we did not see enough examples.)
  • Solution use pseudo counts or some prior (e.g.
    Derichele prior)

34
How to Learn PSSM Motif ? (2)
Remember In the motif finding problem we have a
much harder task The input is a set of (long)
sequence suspected to contain a common motif
(PSSM according to our current model assumption),
but we dont know where ! The output Prediction
of new BS based on our learned PSSM motif
BSModel
Predictions
Input SequenceDark blue are BS positions which
are hidden from us, and we are trying to learn
35
How to Learn PSSM Motif ? (3) MEME Algorithm
(Bailey T.L. and Elkan C.P. 1995 )
  • (Still) one of the most commonly used tools for
    motif (PSSM) search

36
How to Learn PSSM Motif ? (3) MEME Algorithm
(Bailey T.L. and Elkan C.P. 1995 )
  • The basic probabilistic framework used by MEME
  • Input N sequences
  • Assume each has 1 BS
  • Assume a generative model sequence is either
    generated by BS model M (PSSM) or from a fixed
    background distribution BG
  • Assume each sequence has exactly 1 BS in it.
  • Scoring function P(Seq M,BG)
  • Try to maximize likelihood scoring function by
    adjusting Ms (PSSM) parameters.

37
How to Learn PSSM Motif ? (4)
  • Whats the problem? Why is it hard?
  • Think of the positions of the BS in each sequence
    as H were H is a vector of dimension N
  • Given H we have complete data. Then inferring Ms
    ML parameters are just as we saw for the aligned
    case ? easy
  • Problem 1 We dont have H, we are trying to
    learn it too and the ML parameters of M for each
    position become dependent if H is not given? we
    have no close form to compute them analytically
    and going over all possible H assignments is not
    feasible, ? we need to resort to some method to
    search the space of possible assignments to Ms
    parameters
  • Problem 2 The landscape of the likelihood
    function is typically far from convex ? many
    local optima

38
How to Learn PSSM Motif ? (5) MEME Algorithm
  • MEME uses a technique called EM to search the
    space of model Ms parameters
  • EM Expectation Maximization
  • We review how EM is used in the MEME algorithm in
    class.

39
Problems with the MEME other Models
  • Think In light of what we discussed, what
    assumptions are made in this model? What might
    cause us problems in real life data?
  • MEME has also other variants we did not discuss
    here (oops, zoops, etc.)
  • Also EM is very sensitive to starting point ?
    need a good way to find good ones

40
Other Algorithmic Techniques for Motif Finding
  • MEME (Expectation Maximization)
  • GibbsDNA, AlignAce (Gibbs Sampling)
  • CONSENUS (greedy multiple alignment)
  • WINNOWER (Clique finding in graphs)
  • SP-STAR (Sum of pairs scoring)
  • MITRA (Mismatch trees to prune exhaustive search
    space)

More then one way to skin a cat.
41
How to find Binding Sites- Revisited
Classical Solutions
  • Find a common motif in gene set (CONSENSUS,
    MITRA, MEME, AlignACE)
  • Main problem In many cases the motif is common
    not just to the subset of sequences we have, but
    to many other as well ? not a good candidate to
    explain regulation

Discriminative Solutions
Find a common unique motif in genes
Extract the relevant bit from sequences
Promoter
A simple hyper-geometric approach for
discovering putative transcription factor binding
sites WABI 01
42
Finding Discriminative Motifs
Step1
Define Space of Motifs mimic motifs with a
simpler class for efficient search
Step2
Refine Motifs
A simple hyper-geometric approach for
discovering putative transcription factor binding
sites WABI 01
43
Binding Sites - Revisited
  • ? independence
    assumption
  • Two relevant questions
  • Are there dependencies in binding sites?
  • Do we gain an edge in computational tasks if we
    model such dependencies?

?C
?T
gene
A
promoter
binding site
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
44
How to model binding sites ?
represent a distribution of binding sites
Profile Independency model Tree Direct
dependencies Mixture of Profiles Global
dependencies Mixture of Trees Both types of
dependencies
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
45
Learning models Aligned binding sites
Learning Machineryselect maximum likelihood
model
  • Learning procedure for Bayesian networks

Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
46
Arabidopsis ABA binding factor 1(49 examples)
Profile
Test LL per instance -19.93
Test LL per instance -18.70 (1.23)(improvement
in likelihood gt 2-fold)
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
47
Rap1 Example (Harbison at. el.04)(171 expmples)
Profile
48
Likelihood improvement over profiles
Significant improvement in generalization ?
Data often exhibits dependencies
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
49
Learning models unaligned data
  • Use EM algorithm to simultaneously
  • Identify binding site positions
  • Learn a dependency model

EM algorithm
Unaligned Data
Learna model
Identify binding sites
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
50
Evaluating Performance
  • Detect target genes on a genomic scale

ACGTAT..AGGGATGC
GAGC
-1000
0
-473
Probability by binding site model
Scoring rule Crucial issue p-value of scores
Background model (order-3 markov chain)
CIS Compound Importance Sampling Method for
Protein-DNA Binding Site p-value Estimation
Bioinformatics, 2004, ISMB 04
51
Example ROC curve of HSF1
60 FP
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
52
Evaluation Localization Data5-fold Cross
Validation Lee et al 2002
? specificity (TP/Predicted)
? sensitivity (TP/True)
Improvement by Mix of Trees over PSSM
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
53
Motif Finding - Evaluation
  • Still an open problem
  • We have seen several examples on how performance
    can be evaluated in different ways
  • There is (still) no absolute solution for this
  • Main problems
  • no large data sets of known sites
  • no real annotation of negative samples
  • How to define success measure?
  • Difference in input/output assumptions
  • A recent effort in this direction Assessing
    computational tools for the discovery of
    transcription factor binding sites (Nat.
    Biotech. Jan 05)
  • compared publicly available tools on the web on
    (small) data sets of known binding sites based on
    the Transfac D.B
Write a Comment
User Comments (0)
About PowerShow.com