Title: Motif finding using the EM algorithm MEME
1Motif finding using the EM algorithm MEME
(Bailey Elkan 1995) http//meme.sdsc.edu/meme/i
ntro.html
EM algorithm Expectation-Maximization In one
run, trains the matrix model and identifies
examples of the matrix
MEME works by iteratively refining matrix and
identifying sites 1. Estimate motif
model a. Start with an n-mer seed (random or
specified) b. Build a matrix by incorporating
some of background frequencies 2.
Identify examples of the model a. For every
n-mer in the input set, identify its probability
given the matrix model 3. Re-estimate the
motif model a. Calculate a new matrix, based on
the weighted frequencies of all n-mers in the
set 4. Iteratively refine the matrix and
identify sites until convergence.
2Problem find a 6-mer motif in 4 sequences
S1 GGCTATTGCAGATGACGAGATGAGGCCCAGACC S2
GGATGACTTATATAAAGGACGATAAGAGATGAC S3
CTAGCTCGTAGCTCGTTGAGATGCGCTCCCCGCTC S4
GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
1. MEME uses an initial EM heuristic to estimate
the best Starting-point matrix G 0.26 0.24 0.18
0.26 0.25 0.26 A 0.24 0.26 0.28 0.24 0.25 0.22 T 0
.25 0.23 0.30 0.25 0.25 0.25 C 0.25 0.27 0.24 0.25
0.25 0.27
32. MEME scores the match of all 6-mers to
current matrix
Here, just consider the underlined 6-mers,
Although in reality all 6-mers are scored
GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACTTATAT
AAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCG
CTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
42. MEME scores the match of all 6-mers to
current matrix
GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACTTATAT
AAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCG
CTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
The height of the bases above corresponds to how
much that 6-mer counts in calculating the new
matrix
- Reestimate the matrix based on the
- weighted contribution of all 6 mers
- G 0.29 0.24 0.17 0.27 0.24 0.30
- A 0.22 0.26 0.27 0.22 0.28 0.18
- T 0.24 0.23 0.33 0.23 0.24 0.28
- C 0.24 0.27 0.23 0.28 0.24 0.24
5MEME scores the match of all 6-mers to current
matrix
GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACTTATAT
AAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCG
CTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
6GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACTTATAT
AAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCG
CTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
The height of the bases above corresponds to how
much that 6-mer counts in calculating the new
matrix
- Reestimate the matrix based on the
- weighted contribution of all 6 mers
- G 0.40 0.20 0.15 0.42 0.24 0.30
- A 0.30 0.30 0.20 0.24 0.46 0.18
- T 0.15 0.30 0.45 0.16 0.15 0.28
- C 0.15 0.20 0.20 0.16 0.15 0.24
7MEME scores the match of all 6-mers to current
matrix
GGCTATTGCATATGACGAGATGAGGCCCAGACC GGATGACTTATAT
AAAGGACCGTGATAAGAGATTAC CTAGCTCGTAGCTCGTTGAGATGCG
CTCCCCGCTC GATGACGGAGTATTAAAGACTCGATGAGTTATACGA
Iterations continue until convergence (ie.
numbers dont change much between iterations)
8- Final motif
- G 0.85 0.05 0.10 0.80 0.20 0.35
- A 0.05 0.60 0.10 0.05 0.60 0.10
- T 0.05 0.30 0.70 0.05 0.20 0.10
- C 0.05 0.05 0.10 0.10 0.10 0.35
9An extension of the motif problem finding RNA
motifs
Recently a lot of attention given to small
noncoding RNAs (siRNA, microRNA) that Bind mRNAs
and affect their localization, stability, and
translation
The problem is complicated by the fact that RNA
has more complicated secondary structure, which
is often important in specifying binding.
- One example CMFinder Yao et al. 2006
http//intl-bioinformatics.oxfordjournals.org/cgi/
reprint/22/4/445 - An EM algorithm that also incorporates nt
covariance and structural predictions - Initial sites start with sequences with
predicted secondary structure - Align these sequences with similar secondar
structures - Use EM algorithm to iteratively refine a
probabalistic model of the motifs - Covariate model incorporates both primary
sequences and structure.
10An alternative approach Phylogenetic
footprinting
Rather than look at multiple, different
regulatory regions from one species, look at one
region but across multiple, orthologous regions
from many species. Hypothesis functional
regions of the genome will be conserved more than
nonfunctional regions, due to
selection. Therefore, simply look for regions of
sequence that are conserved above background.
11Simplest case stretches of very highly
conserved sequence
Kellis et al. 2003 Sequencing and comparison of
yeast species to identify genes and regulatory
elements Sequenced 4 closely related
Saccharomyces genomes identified conserved
sequences in multiple alignments of orthologous
sequences from the four species.
12Simplest case stretches of very highly
conserved sequence
- Looked for perfectly conserved short motifs
(triplet-spacer-triplet) in multiple alignments - Extended these seeds to find better boundaries of
largely-conserved sequences. - Once identified, they combined similar
conserved sequences within a genome to get PWMs
13They identified 72 distinct sequence elements
(and refined the list of true yeast genes by
eliminating mis-called S. cerevisiae ORFs)
14Incorporating evolutionary models can improve
motif finding
Remember that evolution acts on functionally
important base pairs Also remember from our
motif finding exercise that not all contiguous
base pairs are equally important (information
content).
Information Profile
bits
Position
15Incorporating evolutionary models can improve
motif finding
Remember that evolution acts on functionally
important base pairs Also remember from our
motif finding exercise that not all contiguous
base pairs are equally important (information
content).
Moses et al. 2003 Position specific variation in
the rate of evolution in transcription factor
binding sites Rate of evolution (ie. degree of
conservation) within a motif is inversely
proportional to the information content
important base pairs evolve slower
16Multiple motif finding methods now work on
multiple alignments of regulatory regions of
coregulated genes. Given 1) group of
regulatory regions of coregulated genes 2)
orthologs of each region, in the form of multiple
alignments
Sinha et al. 2004 PhyME A probabalistic
algorithm for finding motifs in sets of
orthologous sequences Moses et al. 2004
Monkey identification of transcription factor
binding sites in multiple alignments using a
binding site-specific evolutionary
model Siddharthan et al. 2005 PhyloGibbs A
Gibbs sampling motif finder that incorporates
phylogeny.
17Keep in mind that the evolutionary models that
one applies are specific for what one is looking
for (transcription factor binding site, ncRNA,
etc)
Moses et al. 2003 Position specific variation in
the rate of evolution in transcription factor
binding sites Rate of evolution (ie. degree of
conservation) within a motif is inversely
proportional to the information content
important base pairs evolve slower
18VISTA suite for visualizing conservation in
global alignments
Pre-computed multiple global alignments of
mammalian genomes, visualized by conservation
level. -- Uses BLAT local alignment tool to find
seeds of high sequence similarity, then these
seeds are used for global single- or
multiple-genome alignment
Frazer et al. 2004 VISTA computational tools
for comparative genomics
19Which species to compare?
Balance between -- species closely related
enough that 1) Theres enough similar
sequence to get confident pairwise
alignments 2) The sequences of interest and
their corresponding functions have been
conserved -- species distantly enough related
that 1) nonfunctional sequence has had time
to diverge
20Margulies et al. 2005 An initial strategy for
the systematic identification of functional
elements in the human genome by low-redundancy
comparative sequencing.
5 of the human genome is under purifying
selection (estimated through comparisons to the
rat and mouse genomes) only 2 of the human
genome is in coding sequences.
How many genomes and what kind of genomes should
be sequenced to annotate the human genome? --
Theoretical estimates (Eddy 2005 A Model of the
Statistical Power of Comparative Genome
Sequence Analysis) specify what total branch
lengths are required to reach different
degrees of annotation
21Species being sequenced with low coverage by NIH
Comparative Sequencing Program
22Observed sequence alignments for different levels
of sequencing coverage
1X sequencing coverage lt63 of sequences 2X
sequencing coverage covers lt86 of sequences 8X
sequencing coverage covers lt99 of sequence
Modeled using a Poisson distribution