MEME: Multiple Expectation Maximization for Motif Elicitation

About This Presentation

Title:

MEME: Multiple Expectation Maximization for Motif Elicitation

Description:

... set of strings, find a set of non-overlapping approximately matching substrings. Approximately matching substrings must all have the same length ... – PowerPoint PPT presentation

Number of Views:772

Avg rating:3.0/5.0

Slides: 16

Provided by: quinn

Category:

more less

Transcript and Presenter's Notes

Title: MEME: Multiple Expectation Maximization for Motif Elicitation

1
MEME Multiple Expectation Maximization for Motif
Elicitation

Quinn S. Lewis
Bioinformatics (COSC6371)
28 November 2006

2
Problem
A motif is a pattern common to a set of nucleic
or amino acid subsequences which share some
biological property of interest (such as being
DNA binding sites for a regulatory protein).

In biology terms
Identify and characterize shared motifs in a set
of unaligned genetic or protein sequences
Only consider contiguous motifs (i.e. insertions
or deletions are not allowed but appearances of a
motif may differ in point mutations)
In computer science terms
Given a set of strings, find a set of
non-overlapping approximately matching substrings
Approximately matching substrings must all have
the same length

3
A Solution

Discover (conserved) motifs in a group of
unaligned and related sequences (DNA or protein)
Automatically choose the following (with little
or no prior knowledge)
Best width of motifs
Number of occurrences in each sequence
Composition of each motif

4
Uses

Find similar biological function and structure in
sequences
Sequence variation can be significant
Motifs sometimes small (6-8 base pairs)
e.g. Sequence-specific binding sites for proteins
Reduce unnecessary wetlab experimentation

5
Context

Local multiple sequence alignment (MSA)
As opposed to global MSA (e.g. CLUSTAL, T-COFFEE)
Statistical method
As opposed to profile or block analysis which
depends on first producing a global MSA
Unsupervised learning algorithm
As opposed to supervised learning which requires
human intervention

The word meme was coined in 1976 by Richard
Dawkins, author of The Selfish Gene, and refers
to a unit of cultural information transferable
from one mind to another. A meme propagates
itself as a unit of cultural evolution and
diffusion analogous in many ways to the
behavior of the gene. --Wikipedia
6
Algorithm Components

Expectation maximization (EM)
EM-based heuristic for choosing the starting
point for EM
Maximum likelihood ratio-based (LRT-based)
heuristic for determining the best number of
model free parameters
Multi-start for searching over possible motif
widths
Greedy search for finding multiple motifs

Another explanation for the name MEME it is a
greedy algorithma me! me! algorithm.
7
In a perfect world

If start locations were known, alignment would be
easy
Could leverage the fact that no insertions or
deletions exist in the sequences

8
Types of Possible Motif Models

OOPS
One occurrence per sequence of the motif in the
dataset
ZOOPS
Zero or one motif occurrences per dataset
sequence
TCM
Motif to appear any number of times in a sequence
(two-component mixture)

9
Expectation Maximization

Expectation step initial guess about the
location of a (variable) sequence pattern in a
set of sequences
Maximization step improve/update pattern as set
of sequences is iteratively scanned

10
Expectation Maximization Idea
11
Expectation Maximization Algorithm

dataset - unaligned set of sequences (training
data) S1, S2, , Si, , Sn each of length L
W - width of motif
p - matrix of probabilities that the motif starts
in position j in Si
Z - matrix representing the probability of
character c in column k (the character c will be
A, C, G, or T for DNA sequences or one of the 20
protein characters)
e - epsilon value

12
MEME Algorithm
13
Contributions

Subsequence derived starting points for EM
May be useful with other methods
Saves time (only need to run EM for one iteration
from each starting point and greedily selecting
the best starting point based on the likelihood
of the learned model)
Little or no prior knowledge requirement
(unsupervised learning)
Drops the assumption that each sequence contains
exactly one appearance of a motif and fit the
n-per model to a dataset can discover motifs in
datasets which contain many sequences which do
not contain the motif
Erase appearances of the motif found after each
pass (using the probabilistic weighting scheme)
finds multiple different motifs and motifs with
multiple parts

14
Areas for Improvement

Allowances for gaps and substitutions in the
conserved regions are not included
Ability to test significance of the pattern is
often not included in the analysis
Erases input data each time a new motif is
discovered using the assumption that this motif
is correct
Limits the model exclusively to the two-component
case
Time complexity
Overly pessimistic about an alignment
could conceivably lead to missed signals

15
Other Tools

MAST - http//meme.sdsc.edu
Uses output of MEME
Searches biological sequence databases for
sequences that contain one or more of a group of
known motifs
ParaMEME - http//meme.sdsc.edu
Parallel version of MEME
Can download run
Can run from website (http//meme.sdsc.edu)
MetaMEME - http//metameme.sdsc.edu
Toolkit for building and using motif-based hidden
Markov models of DNA and proteins
Taverna - http//taverna.sourceforge.net
Uses MEME (among other bioinformatics application
and data) to create and run workflows