Motif finding in biological sequences - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Motif finding in biological sequences

Description:

Non-deterministic. Random starting points. So, answer can be ... NOTE: some of the mentioned web pages only work in Mozilla Firefox at the time of writing. ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 38
Provided by: hen6153
Category:

less

Transcript and Presenter's Notes

Title: Motif finding in biological sequences


1
Motif finding in biological sequences
2
Protein families
  • Protein families have sequences in common
  • Evolutionary conservation
  • Functional parts are most conserved, Domains
  • Structural elements are also conserved
  • Twists / turns / helices etc.
  • How do you find a domain?
  • Multiple alignment
  • Statistical model, based on distribution

3
Methods for Characterizing a Protein Family
  • Objective Given a number of related sequences,
    encapsulate what they have in common in such a
    way that we can recognize other members of the
    family.
  • Some standard methods for characterization
  • Multiple Alignments
  • Regular Expressions
  • Consensus Sequences
  • Hidden Markov Models

4
Multiple sequence alignment
http//en.wikipedia.org/wiki/Multiple_sequence_ali
gnment
5
Sequence logo
6
Statistical model
  • Hidden Markov model (HMM)
  • State machine (explained later) with hidden
    states and emissions (things you can observe,
    ie. amino acids)
  • Simple version in bioinformatics
  • profile hidden Markov model

7
Simplest form matches
8
But also deletions and insertions
  • Skip one location in the profile
  • Insert state
  • Add a gap in the profile
  • Delete state
  • For these situations, states are added to the
    model!

9
A typical profile hmm structure
Insert states
Match states
Delete states
10
State machine
11
Arrows transitions
ROF
P
-
P
R
Y
F
-
12
So what are the uses for this?
  • Statistical representation of a family of
    proteins
  • Check your sequence against a database of hmms

13
Database for domains
  • PFAM (protein families)

14
PFAM sequence search
15
PFAM view family
16
PFAM view db entry
17
Another method BLOCKS
  • Ungapped alignments
  • Automatically annotated (blocks) and annotated
    by hand (Prints)
  • the most highly conserved regions of proteins
  • Generated from InterPro (database at NCBI,
    containing protein families)

18
BLOCKS 2 algorithms
  • MOTIF algorithm
  • Find spaced triplets
  • Ala-Ala-Ala / Ala-x-Ala-Ala / Cys - x(12) - Ala -
    x(3) - Cys
  • Find overlapping triplets
  • Use these as anchors and perform local alignment
    to find additional matches
  • GIBBS sampler
  • Non-deterministic
  • Random starting points
  • So, answer can be different every time

19
(No Transcript)
20
Another method MEME
  • Discover (conserved) motifs in a group of
    unaligned and related sequences (DNA or protein)
  • Automatically choose the following (with little
    or no prior knowledge)
  • Best width of motifs
  • Number of occurrences in each sequence
  • Composition of each motif

21
Context
  • Local multiple sequence alignment (MSA)
  • As opposed to global MSA (e.g. CLUSTAL, T-COFFEE)
  • Statistical method
  • As opposed to profile or block analysis which
    depends on first producing a global MSA
  • Unsupervised learning algorithm
  • As opposed to supervised learning which requires
    human intervention

22
In a perfect world
  • If start locations were known, alignment would be
    easy
  • Could leverage the fact that no insertions or
    deletions exist in the sequences

23
Algorithm Components
  • Expectation maximization (EM)
  • EM-based heuristic for choosing the starting
    point for EM
  • Maximum likelihood ratio-based (LRT-based)
    heuristic for determining the best number of
    model free parameters
  • Multi-start for searching over possible motif
    widths
  • Greedy search for finding multiple motifs

24
Types of Possible Motif Models
  • OOPS
  • One occurrence per sequence of the motif in the
    dataset
  • ZOOPS
  • Zero or one motif occurrences per dataset
    sequence
  • TCM
  • Motif to appear any number of times in a sequence
    (two-component mixture)

25
Expectation Maximization
  • Expectation step initial guess about the
    location of a (variable) sequence pattern in a
    set of sequences
  • Maximization step improve/update pattern as set
    of sequences is iteratively scanned

26
Expectation Maximization Idea
27
Expectation Maximization Algorithm
  • dataset - unaligned set of sequences (training
    data) S1, S2, , Si, , Sn each of length L
  • W - width of motif
  • p - matrix of probabilities that the motif starts
    in position j in Si
  • Z - matrix representing the probability of
    character c in column k (the character c will be
    A, C, G, or T for DNA sequences or one of the 20
    protein characters)
  • e - epsilon value

28
MEME Algorithm
29
Contributions
  • Subsequence derived starting points for EM
  • May be useful with other methods
  • Saves time (only need to run EM for one iteration
    from each starting point and greedily selecting
    the best starting point based on the likelihood
    of the learned model)
  • Little or no prior knowledge requirement
    (unsupervised learning)
  • Drops the assumption that each sequence contains
    exactly one appearance of a motif and fit the
    n-per model to a dataset can discover motifs in
    datasets which contain many sequences which do
    not contain the motif
  • Erase appearances of the motif found after each
    pass (using the probabilistic weighting scheme)
    finds multiple different motifs and motifs with
    multiple parts

30
Other Tools
  • MAST - http//meme.sdsc.edu
  • Uses output of MEME
  • Searches biological sequence databases for
    sequences that contain one or more of a group of
    known motifs
  • ParaMEME - http//meme.sdsc.edu
  • Parallel version of MEME
  • Can download run
  • Can run from website (http//meme.sdsc.edu)
  • MetaMEME - http//metameme.sdsc.edu
  • Toolkit for building and using motif-based hidden
    Markov models of DNA and protein

31
Conclusion
  • There are many ways to search for motifs in
    sequences
  • Weve discussed three
  • BLOCKS
  • Profile Hidden Markov Models based on multiple
    alignments, represented in PFAM
  • Expectation Maximization in MEME

32
(No Transcript)
33
Hands-on
  • Extract the file from the top right of this page,
    put it on your desktop
  • Open the clustalX2 program, that was installed on
    your laptop
  • Open the text file from your desktop in clustalX
    (file menu, load sequences). You will see the
    unaligned sequences
  • Create an alignment by choosing Do complete
    alignment from the alignment menu

34
  • When you made the alignment, it was also saved to
    the desktop, with the extension .aln
  • You can use this alignment to create a sequence
    logo
  • http//weblogo.berkeley.edu/logo.cgi
  • NOTE some of the mentioned web pages only work
    in Mozilla Firefox at the time of writing. So use
    this browser for the exercises

35
  • We are now going to build a profile hidden Markov
    model with your alignment
  • Browse to http//mobyle.pasteur.fr/cgi-bin/Mobyle
    Portal/portal.py?formhmmbuild
  • Load your alignment (.aln) into the web program
  • After a while, the program has created a text
    representation of the profile hidden markov model
  • Save the .hmm file, this is the hmmer file
    format.

36
  • One way to represent the content of the hmm is
    the hmm logo
  • http//www.sanger.ac.uk/cgi-bin/software/analysis/
    logomat-m.cgi
  • The logo represents the frequencies of the model
    emission values. Additional information is shown
    in the form of the width of the letters. Find out
    what this means (hint there is a literature
    reference on the page!)
  • Would you call the alignments global or local?

37
MEME
  • The MEME algorithm tries to find short homologous
    areas in the sequences you provide. It builds a
    model similar to the profile hidden Markov
    models.
  • Using your original sequences (the .txt file),
    create a MEME model.
  • http//meme.sdsc.edu/meme4/intro.html
  • Would you choose the phmm or MEME method to
    discover new domains in a protein family?
Write a Comment
User Comments (0)
About PowerShow.com