MEME: Multiple Expectation Maximization for Motif Elicitation - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

MEME: Multiple Expectation Maximization for Motif Elicitation

Description:

... set of strings, find a set of non-overlapping approximately matching substrings. Approximately matching substrings must all have the same length ... – PowerPoint PPT presentation

Number of Views:772
Avg rating:3.0/5.0
Slides: 16
Provided by: quinn
Category:

less

Transcript and Presenter's Notes

Title: MEME: Multiple Expectation Maximization for Motif Elicitation


1
MEME Multiple Expectation Maximization for Motif
Elicitation
  • Quinn S. Lewis
  • Bioinformatics (COSC6371)
  • 28 November 2006

2
Problem
A motif is a pattern common to a set of nucleic
or amino acid subsequences which share some
biological property of interest (such as being
DNA binding sites for a regulatory protein).
  • In biology terms
  • Identify and characterize shared motifs in a set
    of unaligned genetic or protein sequences
  • Only consider contiguous motifs (i.e. insertions
    or deletions are not allowed but appearances of a
    motif may differ in point mutations)
  • In computer science terms
  • Given a set of strings, find a set of
    non-overlapping approximately matching substrings
  • Approximately matching substrings must all have
    the same length

3
A Solution
  • Discover (conserved) motifs in a group of
    unaligned and related sequences (DNA or protein)
  • Automatically choose the following (with little
    or no prior knowledge)
  • Best width of motifs
  • Number of occurrences in each sequence
  • Composition of each motif

4
Uses
  • Find similar biological function and structure in
    sequences
  • Sequence variation can be significant
  • Motifs sometimes small (6-8 base pairs)
  • e.g. Sequence-specific binding sites for proteins
  • Reduce unnecessary wetlab experimentation

5
Context
  • Local multiple sequence alignment (MSA)
  • As opposed to global MSA (e.g. CLUSTAL, T-COFFEE)
  • Statistical method
  • As opposed to profile or block analysis which
    depends on first producing a global MSA
  • Unsupervised learning algorithm
  • As opposed to supervised learning which requires
    human intervention

The word meme was coined in 1976 by Richard
Dawkins, author of The Selfish Gene, and refers
to a unit of cultural information transferable
from one mind to another. A meme propagates
itself as a unit of cultural evolution and
diffusion analogous in many ways to the
behavior of the gene. --Wikipedia
6
Algorithm Components
  • Expectation maximization (EM)
  • EM-based heuristic for choosing the starting
    point for EM
  • Maximum likelihood ratio-based (LRT-based)
    heuristic for determining the best number of
    model free parameters
  • Multi-start for searching over possible motif
    widths
  • Greedy search for finding multiple motifs

Another explanation for the name MEME it is a
greedy algorithma me! me! algorithm.
7
In a perfect world
  • If start locations were known, alignment would be
    easy
  • Could leverage the fact that no insertions or
    deletions exist in the sequences

8
Types of Possible Motif Models
  • OOPS
  • One occurrence per sequence of the motif in the
    dataset
  • ZOOPS
  • Zero or one motif occurrences per dataset
    sequence
  • TCM
  • Motif to appear any number of times in a sequence
    (two-component mixture)

9
Expectation Maximization
  • Expectation step initial guess about the
    location of a (variable) sequence pattern in a
    set of sequences
  • Maximization step improve/update pattern as set
    of sequences is iteratively scanned

10
Expectation Maximization Idea
11
Expectation Maximization Algorithm
  • dataset - unaligned set of sequences (training
    data) S1, S2, , Si, , Sn each of length L
  • W - width of motif
  • p - matrix of probabilities that the motif starts
    in position j in Si
  • Z - matrix representing the probability of
    character c in column k (the character c will be
    A, C, G, or T for DNA sequences or one of the 20
    protein characters)
  • e - epsilon value

12
MEME Algorithm
13
Contributions
  • Subsequence derived starting points for EM
  • May be useful with other methods
  • Saves time (only need to run EM for one iteration
    from each starting point and greedily selecting
    the best starting point based on the likelihood
    of the learned model)
  • Little or no prior knowledge requirement
    (unsupervised learning)
  • Drops the assumption that each sequence contains
    exactly one appearance of a motif and fit the
    n-per model to a dataset can discover motifs in
    datasets which contain many sequences which do
    not contain the motif
  • Erase appearances of the motif found after each
    pass (using the probabilistic weighting scheme)
    finds multiple different motifs and motifs with
    multiple parts

14
Areas for Improvement
  • Allowances for gaps and substitutions in the
    conserved regions are not included
  • Ability to test significance of the pattern is
    often not included in the analysis
  • Erases input data each time a new motif is
    discovered using the assumption that this motif
    is correct
  • Limits the model exclusively to the two-component
    case
  • Time complexity
  • Overly pessimistic about an alignment
  • could conceivably lead to missed signals

15
Other Tools
  • MAST - http//meme.sdsc.edu
  • Uses output of MEME
  • Searches biological sequence databases for
    sequences that contain one or more of a group of
    known motifs
  • ParaMEME - http//meme.sdsc.edu
  • Parallel version of MEME
  • Can download run
  • Can run from website (http//meme.sdsc.edu)
  • MetaMEME - http//metameme.sdsc.edu
  • Toolkit for building and using motif-based hidden
    Markov models of DNA and proteins
  • Taverna - http//taverna.sourceforge.net
  • Uses MEME (among other bioinformatics application
    and data) to create and run workflows
Write a Comment
User Comments (0)
About PowerShow.com