Title: MEME: Multiple Expectation Maximization for Motif Elicitation
1MEME Multiple Expectation Maximization for Motif
Elicitation
- Quinn S. Lewis
- Bioinformatics (COSC6371)
- 28 November 2006
2Problem
A motif is a pattern common to a set of nucleic
or amino acid subsequences which share some
biological property of interest (such as being
DNA binding sites for a regulatory protein).
- In biology terms
- Identify and characterize shared motifs in a set
of unaligned genetic or protein sequences - Only consider contiguous motifs (i.e. insertions
or deletions are not allowed but appearances of a
motif may differ in point mutations) - In computer science terms
- Given a set of strings, find a set of
non-overlapping approximately matching substrings - Approximately matching substrings must all have
the same length
3A Solution
- Discover (conserved) motifs in a group of
unaligned and related sequences (DNA or protein) - Automatically choose the following (with little
or no prior knowledge) - Best width of motifs
- Number of occurrences in each sequence
- Composition of each motif
4Uses
- Find similar biological function and structure in
sequences - Sequence variation can be significant
- Motifs sometimes small (6-8 base pairs)
- e.g. Sequence-specific binding sites for proteins
- Reduce unnecessary wetlab experimentation
5Context
- Local multiple sequence alignment (MSA)
- As opposed to global MSA (e.g. CLUSTAL, T-COFFEE)
- Statistical method
- As opposed to profile or block analysis which
depends on first producing a global MSA - Unsupervised learning algorithm
- As opposed to supervised learning which requires
human intervention
The word meme was coined in 1976 by Richard
Dawkins, author of The Selfish Gene, and refers
to a unit of cultural information transferable
from one mind to another. A meme propagates
itself as a unit of cultural evolution and
diffusion analogous in many ways to the
behavior of the gene. --Wikipedia
6Algorithm Components
- Expectation maximization (EM)
- EM-based heuristic for choosing the starting
point for EM - Maximum likelihood ratio-based (LRT-based)
heuristic for determining the best number of
model free parameters - Multi-start for searching over possible motif
widths - Greedy search for finding multiple motifs
Another explanation for the name MEME it is a
greedy algorithma me! me! algorithm.
7In a perfect world
- If start locations were known, alignment would be
easy - Could leverage the fact that no insertions or
deletions exist in the sequences
8Types of Possible Motif Models
- OOPS
- One occurrence per sequence of the motif in the
dataset - ZOOPS
- Zero or one motif occurrences per dataset
sequence - TCM
- Motif to appear any number of times in a sequence
(two-component mixture)
9Expectation Maximization
- Expectation step initial guess about the
location of a (variable) sequence pattern in a
set of sequences - Maximization step improve/update pattern as set
of sequences is iteratively scanned
10Expectation Maximization Idea
11Expectation Maximization Algorithm
- dataset - unaligned set of sequences (training
data) S1, S2, , Si, , Sn each of length L - W - width of motif
- p - matrix of probabilities that the motif starts
in position j in Si - Z - matrix representing the probability of
character c in column k (the character c will be
A, C, G, or T for DNA sequences or one of the 20
protein characters) - e - epsilon value
12MEME Algorithm
13Contributions
- Subsequence derived starting points for EM
- May be useful with other methods
- Saves time (only need to run EM for one iteration
from each starting point and greedily selecting
the best starting point based on the likelihood
of the learned model) - Little or no prior knowledge requirement
(unsupervised learning) - Drops the assumption that each sequence contains
exactly one appearance of a motif and fit the
n-per model to a dataset can discover motifs in
datasets which contain many sequences which do
not contain the motif - Erase appearances of the motif found after each
pass (using the probabilistic weighting scheme)
finds multiple different motifs and motifs with
multiple parts
14Areas for Improvement
- Allowances for gaps and substitutions in the
conserved regions are not included - Ability to test significance of the pattern is
often not included in the analysis - Erases input data each time a new motif is
discovered using the assumption that this motif
is correct - Limits the model exclusively to the two-component
case - Time complexity
- Overly pessimistic about an alignment
- could conceivably lead to missed signals
15Other Tools
- MAST - http//meme.sdsc.edu
- Uses output of MEME
- Searches biological sequence databases for
sequences that contain one or more of a group of
known motifs - ParaMEME - http//meme.sdsc.edu
- Parallel version of MEME
- Can download run
- Can run from website (http//meme.sdsc.edu)
- MetaMEME - http//metameme.sdsc.edu
- Toolkit for building and using motif-based hidden
Markov models of DNA and proteins - Taverna - http//taverna.sourceforge.net
- Uses MEME (among other bioinformatics application
and data) to create and run workflows