Motif finding in biological sequences - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Motif finding in biological sequences

Description:

Non-deterministic. Random starting points. So, answer can be ... NOTE: some of the mentioned web pages only work in Mozilla Firefox at the time of writing. ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 38

Provided by: hen6153

Category:

more less

Transcript and Presenter's Notes

Title: Motif finding in biological sequences

1
Motif finding in biological sequences
2
Protein families

Protein families have sequences in common
Evolutionary conservation
Functional parts are most conserved, Domains
Structural elements are also conserved
Twists / turns / helices etc.
How do you find a domain?
Multiple alignment
Statistical model, based on distribution

3
Methods for Characterizing a Protein Family

Objective Given a number of related sequences,
encapsulate what they have in common in such a
way that we can recognize other members of the
family.
Some standard methods for characterization
Multiple Alignments
Regular Expressions
Consensus Sequences
Hidden Markov Models

4
Multiple sequence alignment
http//en.wikipedia.org/wiki/Multiple_sequence_ali
gnment
5
Sequence logo
6
Statistical model

Hidden Markov model (HMM)
State machine (explained later) with hidden
states and emissions (things you can observe,
ie. amino acids)
Simple version in bioinformatics
profile hidden Markov model

7
Simplest form matches
8
But also deletions and insertions

Skip one location in the profile
Insert state
Add a gap in the profile
Delete state
For these situations, states are added to the
model!

9
A typical profile hmm structure
Insert states
Match states
Delete states
10
State machine
11
Arrows transitions
ROF
P
-
P
R
Y
F
-
12
So what are the uses for this?

Statistical representation of a family of
proteins
Check your sequence against a database of hmms

13
Database for domains

PFAM (protein families)

14
PFAM sequence search
15
PFAM view family
16
PFAM view db entry
17
Another method BLOCKS

Ungapped alignments
Automatically annotated (blocks) and annotated
by hand (Prints)
the most highly conserved regions of proteins
Generated from InterPro (database at NCBI,
containing protein families)

18
BLOCKS 2 algorithms

MOTIF algorithm
Find spaced triplets
Ala-Ala-Ala / Ala-x-Ala-Ala / Cys - x(12) - Ala -
x(3) - Cys
Find overlapping triplets
Use these as anchors and perform local alignment
to find additional matches
GIBBS sampler
Non-deterministic
Random starting points
So, answer can be different every time

19
(No Transcript)
20
Another method MEME

Discover (conserved) motifs in a group of
unaligned and related sequences (DNA or protein)
Automatically choose the following (with little
or no prior knowledge)
Best width of motifs
Number of occurrences in each sequence
Composition of each motif

21
Context

Local multiple sequence alignment (MSA)
As opposed to global MSA (e.g. CLUSTAL, T-COFFEE)
Statistical method
As opposed to profile or block analysis which
depends on first producing a global MSA
Unsupervised learning algorithm
As opposed to supervised learning which requires
human intervention

22
In a perfect world

If start locations were known, alignment would be
easy
Could leverage the fact that no insertions or
deletions exist in the sequences

23
Algorithm Components

Expectation maximization (EM)
EM-based heuristic for choosing the starting
point for EM
Maximum likelihood ratio-based (LRT-based)
heuristic for determining the best number of
model free parameters
Multi-start for searching over possible motif
widths
Greedy search for finding multiple motifs

24
Types of Possible Motif Models

OOPS
One occurrence per sequence of the motif in the
dataset
ZOOPS
Zero or one motif occurrences per dataset
sequence
TCM
Motif to appear any number of times in a sequence
(two-component mixture)

25
Expectation Maximization

Expectation step initial guess about the
location of a (variable) sequence pattern in a
set of sequences
Maximization step improve/update pattern as set
of sequences is iteratively scanned

26
Expectation Maximization Idea
27
Expectation Maximization Algorithm

dataset - unaligned set of sequences (training
data) S1, S2, , Si, , Sn each of length L
W - width of motif
p - matrix of probabilities that the motif starts
in position j in Si
Z - matrix representing the probability of
character c in column k (the character c will be
A, C, G, or T for DNA sequences or one of the 20
protein characters)
e - epsilon value

28
MEME Algorithm
29
Contributions

Subsequence derived starting points for EM
May be useful with other methods
Saves time (only need to run EM for one iteration
from each starting point and greedily selecting
the best starting point based on the likelihood
of the learned model)
Little or no prior knowledge requirement
(unsupervised learning)
Drops the assumption that each sequence contains
exactly one appearance of a motif and fit the
n-per model to a dataset can discover motifs in
datasets which contain many sequences which do
not contain the motif
Erase appearances of the motif found after each
pass (using the probabilistic weighting scheme)
finds multiple different motifs and motifs with
multiple parts

30
Other Tools

MAST - http//meme.sdsc.edu
Uses output of MEME
Searches biological sequence databases for
sequences that contain one or more of a group of
known motifs
ParaMEME - http//meme.sdsc.edu
Parallel version of MEME
Can download run
Can run from website (http//meme.sdsc.edu)
MetaMEME - http//metameme.sdsc.edu
Toolkit for building and using motif-based hidden
Markov models of DNA and protein

31
Conclusion

There are many ways to search for motifs in
sequences
Weve discussed three
BLOCKS
Profile Hidden Markov Models based on multiple
alignments, represented in PFAM
Expectation Maximization in MEME

32
(No Transcript)
33
Hands-on

Extract the file from the top right of this page,
put it on your desktop
Open the clustalX2 program, that was installed on
your laptop
Open the text file from your desktop in clustalX
(file menu, load sequences). You will see the
unaligned sequences
Create an alignment by choosing Do complete
alignment from the alignment menu

When you made the alignment, it was also saved to
the desktop, with the extension .aln
You can use this alignment to create a sequence
logo
http//weblogo.berkeley.edu/logo.cgi
NOTE some of the mentioned web pages only work
in Mozilla Firefox at the time of writing. So use
this browser for the exercises

We are now going to build a profile hidden Markov
model with your alignment
Browse to http//mobyle.pasteur.fr/cgi-bin/Mobyle
Portal/portal.py?formhmmbuild
Load your alignment (.aln) into the web program
After a while, the program has created a text
representation of the profile hidden markov model
Save the .hmm file, this is the hmmer file
format.

One way to represent the content of the hmm is
the hmm logo
http//www.sanger.ac.uk/cgi-bin/software/analysis/
logomat-m.cgi
The logo represents the frequencies of the model
emission values. Additional information is shown
in the form of the width of the letters. Find out
what this means (hint there is a literature
reference on the page!)
Would you call the alignments global or local?

37
MEME

The MEME algorithm tries to find short homologous
areas in the sequences you provide. It builds a
model similar to the profile hidden Markov
models.
Using your original sequences (the .txt file),
create a MEME model.
http//meme.sdsc.edu/meme4/intro.html
Would you choose the phmm or MEME method to
discover new domains in a protein family?

Write a Comment

User Comments (0)