Title: Class: Motif Finding CS-67693, Spring 2005
1Class Motif FindingCS-67693, Spring 2005
Few slides were adopted and edited from
www.cs.ucsb.edu/ambuj/Courses/
bioinformatics/motif20finding.ppt
- School of Computer Science Engineering
- Hebrew University, Jerusalem
2Background
- Basic dogma
- Information is coded in the genome
- Information includes
- Where the genes are coded, including
- Transcription Start
- UTR
- Exons and Introns
- Alternative splicing
3Eukaryotic Gene
Adapted in part from http//online.itp.ucsb.edu/on
line/infobio01/burge/
4Background
- Basic dogma
- Information is coded in the genome
- Information includes
- Where the genes are coded, including
- Transcription Start
- UTR
- Exons and Introns
- Alternative splicing
- Functional units in proteins
5Proteins Local structure motifs
I-sites Library a catalog of local
sequence-structure correlations
diverging type-2 turn
Frayed helix
Type-I hairpin
Serine hairpin
glycine helix N-cap
alpha-alpha corner
Proline helix C-cap
6Background
- Basic dogma
- Information is coded in the genome
- Information includes
- Where the genes are coded, including
- Transcription Start
- UTR
- Exons and Introns
- Alternative splicing
- Functional units in proteins
- RNA family structure
7RNA Multiple Align. structure
Biological Sequence Analysis Durbin, Eddy,
Krogh, Mitchison Cambridge press, 1998
8Background
- Basic dogma
- Information is coded in the genome
- Information includes
- Where the genes are coded, including
- Transcription Start
- UTR
- Exons and Introns
- Alternative splicing
- Functional units in proteins
- RNA family structure
- How to control which gene to turn on/off and when
9Background
- In many cases, we can related such functions to
reappearing motifs in the genome - Splice/start/end site signals in coding genes
- Binding sites of regulatory elements controlling
transcription of nearby genes - A certain function of a protein domain.
The definition of what is a sequence motif
depends on the context !
10Background
- Basic dogma
- Information is coded in the genome
- Information includes
- Where the genes are coded, including
- Transcription Start
- UTR
- Exons and Introns
- Alternative splicing
- Functional units in proteins
- RNA family structure
- How to control which gene to turn on/off and when
Future Classes
11Expression of Genes in Cells
- To produce a protein, a gene (DNA) has to be
converted to an intermediary molecule called RNA,
in a process called transcription. - Each cell contains the same genome. Different
cells have a different set of genes which are
turned on (expressed) by allowing the genes to be
transcribed. - Different cells have different mixtures of gene
regulatory proteins to turn genes on or off.
12Regulation of Gene Expression
- Gene regulatory proteins bind to specific places
(regulatory sites) on DNA. These sites are
usually close to the gene.
off
gene
site
regulatory protein
on
gene
site
13Regulatory Sites
- Regulatory sites are sometimes divided to 2
types - Promoter sites Usually upstream of a gene in
non-translated (non-coding) regions. In some
cases, these sites can be in exonic or intronic
regions. - Enhancer sites Can be very far away (either
upstream or downstream). - Regulatory proteins recognize sites by conserved
DNA patterns, which consist of a short stretch of
partially specific nucleotide sequences.
14lac operon in E. coli
15Figure 13.16 The lac Operon of E. coli
16Promoter
17 18 19Transcription Factor Binding Sites
We want to describe this site
Non-coding regions ? gene regulation
20Difficulty of Finding Regulatory Elements
- Regulatory sites are short (up to 30
nucleotides). - Non-coding regions are very long (includes all
regions which are not translated into proteins). - Experiments to find regulatory sites are tedious
and time-consuming. One approach is to mutate
different combinations of nucleotides until
functionality changes. - We dont have good understanding on what makes a
site active/how active in terms of the
chemical/physical constraints
21Why Not Use Multiple Alignment?
- The motif is short and may appear at different
location in different sequences. Most other
areas are random - Not all positions within a binding site should
be treated in the same way, and usually we dont
know in advance how. Therefore the use of a
general scoring matrix is not adequate - The problem is made more complicated since not
every sequence contains a motif, due to - The upstream region used may not be long enough
to include a regulatory site in every sequence - Usually, potential co-regulated genes are used to
construct the sample, which means that we dont
know for sure whether all these genes are really
co-regulated
22Computational Approach
- Identify a set of genes believed to be controlled
by the same regulatory mechanism (co-regulated
genes). - Extract regulatory regions of the genes (usually
upstream sequences) to form a sample of
sequences. - Find some way to identify conserved elements in
these sequences, resulting in a list of potential
regulatory sites.
23How to Find Regulatory Sites
sample
gene
site
gene
site
gene
site
gene
site
gene
site
24Formulating Motif Finding Task
- Given a set of sequences, find a common motif
shared by these sequences. - Steps
- Construct a model of what we mean by common
motif. - Solve the problem within the model on simulated
samples. - Evaluate performance on real life biological
samples.
25Formulating Motif Finding Task (2)
- This means we need to define
- Input of the algorithm This implicitly defines
various assumptions we have on the problem (e.g
do we have different belief for each sequence
that it belongs to the group?) - Type of motif class
- Search Algorithm How we search the space of
possible motifs? - Scoring function How we score putative motifs?
- Output of the algorithm Should it give us just
putative sites or maybe a binding site model to
predict sites? - Evaluation technique How do we test our
algorithm?
26Task Definition Example
- Given a sample of sequences and an unknown
pattern (motif) that appears at different unknown
positions in each sequence, can we find the
unknown pattern? - Input a set of sequences, each one with an
unknown pattern at an unknown position. - Output a set of starting positions of the
pattern in each sequence.
27Pattern Subsequence
Subsequence AAAAAAAAGGGGGGG
- atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagata
aacgtatgaagtacgttagactcggcgccgccgacccctattttttga
gcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaat
aAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGG
GGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccg
agctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaac
caacgcggacccaaaggcaagaccgataaaggagatcccttttgcggt
aatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAA
AAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAA
AAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcg
caacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGc
aattatgagagagctaatctatcgcgtgcgtgttcataacttgagttA
AAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagtt
aatgctgtatgacactatgtattggcccattggctaaaagcccaactt
gacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaag
ggaagctggtgagcaacgacagattcttacgtgcattagctcgcttcc
ggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
28Pattern (l,d)
- First formulated by Pevzner (ISMB 2000)
- Pattern subsequence of length l and exactly d
random mismatches in it - All other sequence is assumed random
- Assumes exactly one true occurrence of the
motif in each sequence
- atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataa
acgtatgaagtacgttagactcggcgccgccgacccctattttttgag
cagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaata
cAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGa
GtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaacc
aacgcggacccaaaggcaagaccgataaaggagatcccttttgcggta
atgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAA
tAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAc
AAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgc
aacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGcca
attatgagagagctaatctatcgcgtgcgtgttcat
AgAAgAAAGGttGGG
All variants of AAAAAAAAGGGGGGG
.......
cAAtAAAAcGGcGGG
29Formulating Motif Finding Task (2)
- We need to define
- Input of the algorithm This implicitly defines
various assumptions we have on the problem (e.g
do we have different belief for each sequence
that it belongs to the group?) - Type of motif class
- Search Algorithm How we search the space of
possible motifs? - Scoring function How we score putative motifs?
- Output of the algorithm Should it give us just
putative sites or maybe a binding site model to
predict sites? - Evaluation technique How do we test our
algorithm?
- Think
- How the (l,d) problem defines these ?
- How does it relate to real biology?
30How to Define Motif Class?
- Subsequences ACTCTT
- IUPAC alphabet A, C, G, T, R,Y, M, K, S, W, B,
D, H, V, N all subsets of A,C,G,T - PSSM / PWM (Position Specific Score Matrix or
Position Weight Matrix) - More general probabilistic/other models e.g.
using Bayesian Networks modeling language - Refined definition based on prior knowledge
- Homo/Hetro dimers
- Variable gaps
- Bias to some characteristic information profile
(Van, 2003)
31PSSM Representation of Binding Sites
- Position Specific Score Matrix each possible
kmer will get a score for being a binding
site which is - Probabilistic interpretation
wi,c weight of letter c at position i
- NOTE
- Independence assumption between biding sites
positions ! - The score used in a probabilistic setting is the
log odds score - In many case the BG is a simple, fixed,
background distribution (Q) over ACGT. - The entries in the Matrix can be Pi(a),
log(Pi(a)) or log(Pi(a)/logQ(a) depending on
the context of its usage !
32 PSSM Enables representing low/high affinity
in different Positions Trade off Sens. and
Spec. in genomic wide scans- Huge Search space,
how to cover efficiently?
PSSM vs. IUPAC
ABF1 Example (Targets by Lee at el. ,2002)
?
gtYAL011W CGTGTTAGATGA
33How to Learn PSSM Motif?
- Easier Task - We have aligned samples to learn
from - We have a set of known BS, all of length k,
(e.g. verified by some biological experiment) - Compute counts for each base in each
position, and normalize ML estimator - N number of sequence, Na number of as in
position i - Note
- This is the ML solution. As in many other cases,
this might be problematic when we have very few
samples to learn from (e.g. we can get
probability 0 for base A in position i simply
because we did not see enough examples.) - Solution use pseudo counts or some prior (e.g.
Derichele prior)
34How to Learn PSSM Motif ? (2)
Remember In the motif finding problem we have a
much harder task The input is a set of (long)
sequence suspected to contain a common motif
(PSSM according to our current model assumption),
but we dont know where ! The output Prediction
of new BS based on our learned PSSM motif
BSModel
Predictions
Input SequenceDark blue are BS positions which
are hidden from us, and we are trying to learn
35How to Learn PSSM Motif ? (3) MEME Algorithm
(Bailey T.L. and Elkan C.P. 1995 )
- (Still) one of the most commonly used tools for
motif (PSSM) search
36How to Learn PSSM Motif ? (3) MEME Algorithm
(Bailey T.L. and Elkan C.P. 1995 )
- The basic probabilistic framework used by MEME
- Input N sequences
- Assume each has 1 BS
- Assume a generative model sequence is either
generated by BS model M (PSSM) or from a fixed
background distribution BG - Assume each sequence has exactly 1 BS in it.
- Scoring function P(Seq M,BG)
- Try to maximize likelihood scoring function by
adjusting Ms (PSSM) parameters.
37How to Learn PSSM Motif ? (4)
- Whats the problem? Why is it hard?
- Think of the positions of the BS in each sequence
as H were H is a vector of dimension N - Given H we have complete data. Then inferring Ms
ML parameters are just as we saw for the aligned
case ? easy - Problem 1 We dont have H, we are trying to
learn it too and the ML parameters of M for each
position become dependent if H is not given? we
have no close form to compute them analytically
and going over all possible H assignments is not
feasible, ? we need to resort to some method to
search the space of possible assignments to Ms
parameters - Problem 2 The landscape of the likelihood
function is typically far from convex ? many
local optima
38How to Learn PSSM Motif ? (5) MEME Algorithm
- MEME uses a technique called EM to search the
space of model Ms parameters - EM Expectation Maximization
- We review how EM is used in the MEME algorithm in
class.
39Problems with the MEME other Models
- Think In light of what we discussed, what
assumptions are made in this model? What might
cause us problems in real life data? - MEME has also other variants we did not discuss
here (oops, zoops, etc.) - Also EM is very sensitive to starting point ?
need a good way to find good ones
40Other Algorithmic Techniques for Motif Finding
- MEME (Expectation Maximization)
- GibbsDNA, AlignAce (Gibbs Sampling)
- CONSENUS (greedy multiple alignment)
- WINNOWER (Clique finding in graphs)
- SP-STAR (Sum of pairs scoring)
- MITRA (Mismatch trees to prune exhaustive search
space)
More then one way to skin a cat.
41How to find Binding Sites- Revisited
Classical Solutions
- Find a common motif in gene set (CONSENSUS,
MITRA, MEME, AlignACE)
- Main problem In many cases the motif is common
not just to the subset of sequences we have, but
to many other as well ? not a good candidate to
explain regulation
Discriminative Solutions
Find a common unique motif in genes
Extract the relevant bit from sequences
Promoter
A simple hyper-geometric approach for
discovering putative transcription factor binding
sites WABI 01
42Finding Discriminative Motifs
Step1
Define Space of Motifs mimic motifs with a
simpler class for efficient search
Step2
Refine Motifs
A simple hyper-geometric approach for
discovering putative transcription factor binding
sites WABI 01
43Binding Sites - Revisited
- ? independence
assumption - Two relevant questions
- Are there dependencies in binding sites?
- Do we gain an edge in computational tasks if we
model such dependencies?
?C
?T
gene
A
promoter
binding site
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
44How to model binding sites ?
represent a distribution of binding sites
Profile Independency model Tree Direct
dependencies Mixture of Profiles Global
dependencies Mixture of Trees Both types of
dependencies
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
45Learning models Aligned binding sites
Learning Machineryselect maximum likelihood
model
- Learning procedure for Bayesian networks
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
46Arabidopsis ABA binding factor 1(49 examples)
Profile
Test LL per instance -19.93
Test LL per instance -18.70 (1.23)(improvement
in likelihood gt 2-fold)
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
47Rap1 Example (Harbison at. el.04)(171 expmples)
Profile
48Likelihood improvement over profiles
Significant improvement in generalization ?
Data often exhibits dependencies
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
49Learning models unaligned data
- Use EM algorithm to simultaneously
- Identify binding site positions
- Learn a dependency model
EM algorithm
Unaligned Data
Learna model
Identify binding sites
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
50Evaluating Performance
- Detect target genes on a genomic scale
ACGTAT..AGGGATGC
GAGC
-1000
0
-473
Probability by binding site model
Scoring rule Crucial issue p-value of scores
Background model (order-3 markov chain)
CIS Compound Importance Sampling Method for
Protein-DNA Binding Site p-value Estimation
Bioinformatics, 2004, ISMB 04
51Example ROC curve of HSF1
60 FP
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
52Evaluation Localization Data5-fold Cross
Validation Lee et al 2002
? specificity (TP/Predicted)
? sensitivity (TP/True)
Improvement by Mix of Trees over PSSM
Modeling Dependencies in Protein-DNA Binding
Sites, RECOMB 03
53Motif Finding - Evaluation
- Still an open problem
- We have seen several examples on how performance
can be evaluated in different ways - There is (still) no absolute solution for this
- Main problems
- no large data sets of known sites
- no real annotation of negative samples
- How to define success measure?
- Difference in input/output assumptions
-
- A recent effort in this direction Assessing
computational tools for the discovery of
transcription factor binding sites (Nat.
Biotech. Jan 05) - compared publicly available tools on the web on
(small) data sets of known binding sites based on
the Transfac D.B