How to find an optimal sequence alignment

About This Presentation

Title:

How to find an optimal sequence alignment

Description:

Global algorithm which gives an overall best fit alignment of the ... Hydrophobicity, hydrophilicity, side-chain length, bulkiness, mol. wt, solubility etc. ... – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 31

Provided by: werner6

Category:

more less

Transcript and Presenter's Notes

Title: How to find an optimal sequence alignment

1
How to find an optimal sequence alignment?

The optimal alignment is the alignment with
the highest score.
Exhaustive enumerations and scoring of all
alignments is not feasible.
Problem is solved by dynamic programming
algorithms which uses the fact that the global
optimal alignment can be constructed from optimal
alignments of subsequences.

2
Most widely used algorithms

Two basic types of algorithms
Needleman-Wunsch algorithm1,2
Global algorithm which gives an overall best fit
alignment of the entire sequence.
rigorous algorithm to find optimal solution.
requires tremendous amount of computing power.
not sensitive for highly diverged sequences
Smith-Waterman algorithm 3
Local alignment procedure which tries to find a
sub sequence (or several small) subsequences of
high similarity.

Ref 1 Needleman, S.B. Wunsch, C.D. 1970. J.
Mol. Biol. 48, 443-453. 2 Gotoh, O. 1982. J.
Mol. Biol. 162, 705-708. 3 Smith, T.F.
Waterman, M.S. 1981. J. Mol. Biol. 147, 195-197.
3
Dynamic Programming (Needleman Wunsch) Algorithm
Global best score for aligning two sequences of
length m and n. Example from the textbook
Durbin et al., Chapt.2 HEAGAWGHEE (m10)
xi i1,...,10 PAWHEAE (n7) yj
j1,..,7 Score matrix Blosum 50 gap cost
d8 example alignment HEAGAWGHEE
P-A--WHEAE What is the score for this
alignment? How to find an optimal alignment? What
is the best possible scoring value?
4
Score matrix for the example
example alignment HEAGAWGHEE
P-A--WHEAE Score
-2-85-8-815-20-16 Total score -3 Is this an
optimal alignment?
5
Needleman Wunsch algorithm

Construct a matrix F(i,j) which is the score of
the best alignment of the partial segment
x1.....i and y 1..j
At each aligned position there are three possible
events.
substitution of xi by yj , change F by
s(xi,yj)
alignment of xi by a gap, penalty d
alignment of yj by a gap, penalty d
Gaps are usually counted with a negative value
often referred to as the gap penalty''

6
Recursive Generation of the F matrix
The maximum score can be found by working
forward along each sequence successively finding
the best score for aligning all subsequences.
The values of the matrix F are stored in a
matrix where each element of F is calculated
as follows
7
Finding the optimal alignment

The element F(i,j) contains the best score for
the alignment of the subalignment and the
alignment path can be determined by tracing back
through the matrix using pointers to indicate
which of the three possibilities was the maximum
at each point (i,j).
Best alignment Score1

8
local alignment
9
Significance of scores

Using a search method (global or local) one
always finds a best hit
What does this mean?
Biological meaningful alignment (homologous
proteins)?
or the best alignment between two unrelated
sequences?
two approaches
Bayesian approach
statistical approach (extreme value distribution)

10
Significance of scores

Probability that two sequences x and y are
related according to a model M as compared to a
random model R.
match model M
random model R
additive scoring system (log-odds ratio)

What scores are statistical significant?

11
Bayesian Approach for Significance

What is the probability that x and y are related?
P(Mx,y)?
P(M) a priori probability that x and y are
related
P(R)1-P(M)
After rearrangement we find
with

12
Importance of prior log-odds ratio

prior log-odds ratio is added to standard score
large number of different alignments for a
possible match
length of the alignment influences score
Even if all sequences in the database are
unrelated, with increasing number of sequences
the score increases
prior log-odds ratio has to be adjusted to the
size of the database
other quantities (E value)
local match adjustment for the length of the
sequences

13
Extreme value distribution (EVD)

search in a database with N sequences
distribution of the maximum of the N scores
asymptotic distribution of the maximum of N
independent normal random variables
Karlin Altschul(1990) The number of unrelated
matches of two sequences with length m and n and
with score greater than S is Poisson distributed
with mean
Probability that for a match with score greater
than S is
Example score of human cytochrome C in
Swiss-Prot

14
Score distribution
15
Heuristic alignment algorithms

Complexity of Needleman-Wunsch , the number of
operations to calculate is O(mn).
Protein sequence database
108 residues a typical protein 103 residues
1011 operations
a) Basic Local Alignment Search Tool (BLAST)
Altschul, Lipman.
search for short stretches with high scoring
matches, use these neighbourhood words as seed
for alignment extensions
b) FASTA Pearson
look-up table of identically matching words (ktup
2)
extension of these hits with ungapped alignments
joining of ungapped regions by dynamic
programming with gaps.

16

Going beyond sequence alignment Motifs, patterns,
profiles
How should we read protein sequences? Is there
are grammar? Can we identify words in this
language? Motifs contiguous segments in a
protein family with statistically significant
sequence conservation. Molegos contiguous
segments with sequence and 3D structure
conservation.
17
Patterns, Profiles and Motifs

Pairwise alignments can find similar sequences
i.e. we can measure the similarity by sequence
identity, similarity score (sum of substitution
matrix values) or probabilities that two
sequences would match in a random model
(similarity)
However we would like to know if the two
sequences have a common evolutionary origin
(homology)
there is no simple general cutoff value (see
example cytochromes)
similarity versus homology
to improve the reliability of the one has to
rely on additional tools patterns, profiles and
motifs

18
Patterns, Profiles and Motifs
Patterns string of sequence elements
characteristics of a protein family e.g. PROSITE
pattern Profiles a quantitative expression for
a position specific signature based on the
fraction of amino acids at each position in the
sequence Motifs any sequence pattern which is
predictive for a protein function, a structural
feature or family membership PCP motifs
quantitative definition of conserved physical
chemical properties in a protein family
(combination of pattern and profiles)
19
Web sites related to motif search

PCPMer http//landau.utmb.edu8080/WebPCPMer/HomeP
age/index.html
used in this course
BLOCKS http//blocks.fhcrc.org/
relies on conserved stretches of protein families
MEME http//meme.sdsc.edu/meme/website/intro.html
combine HMM search with profile methods
PROSITE http//us.expasy.org/prosite/
catalogue of biological important sequence
patterns
HMMER http//hmmer.wustl.edu/
Profile Hidden Markov Models
SAMhttp//www.soe.ucsc.edu/research/compbio/sam.h
tml
several software packages implementing HMM
PFAM http//www.sanger.ac.uk/Software/Pfam/
collection of MSA and hidden Markov models of
many common protein domains

20
PROSITE expert-curated database of patterns
hosted by the Swiss Institute of
Bioinformatics Release 18.19, of 16-Jan-2004
(contains 1241 documentation entries that
describe 1685 different patterns, rules and
profiles/matrices). patterns are usually hand
edited by experts working with proteins in the
family familiar with its activities and areas
needed for activity. language for describing the
patterns all amino acids within can
occour anything but the a.a. in can
occur x(n,m) a spacer of n to m residues can
occur in pattern
21
Example Prosite pattern
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis Cambridge
University Press.
22
profiles
multiple sequence alignment
f(j,b) probability of amino acid b at position
j profile p the expected score for a given
sequence yj to fit into the family
23
Our approach property based motif search

Each amino acid has a set of physical-chemical
properties.
Hydrophobicity, hydrophilicity, side-chain
length, bulkiness, mol. wt, solubility etc.
3D structure is determined by physical-chemical
properties of residues.
Which properties to choose?
How can we represent these properties because of
differences in scales ?

The protein non-folding problem determinants
of disorder Williams RM, Obradovi Z,
Venkatarajan MS et al. 2001 Pac Symp Biocomputing
689-100
24
PCP motifs http//landau.utmb.edu8080/WebPCPMer
/
25
Reduction of the descriptor space for
physical-chemical properties of amino acids
237 physical-chemical properties
5 descriptors
Multidimensional scaling
Mathura, V.S. and W. Braun. 2001. New
quantitative descriptors of amino acids based on
multidimensional scaling of a large number of
physical chemical properties. J Mol Model 7445
453
26
Physical-chemical interpretation of the
descriptors
27
Quantitative definition of PCP- Motifs for a
Protein Family

For each column of the multiple alignment (or
residue positions)
Measure the significance of conservation by the
relative entropy
Calculate the a prior distributions of
the 5 descriptors and compare to the the actual
distributions of the descriptors in each column
Quantitative definition of motifs
Compute average and standard deviations of the
property vector components

Length Cutoff (L)
Minimum number of positions to be included in
the motif.
Gap Cutoff (G)
Number of insignificant positions between two
significant residue positions allowed in a motif.
Ref
VS Mathura, CH Schein, W. Braun (2003).
Bioinformatics, 19, 1381-1390.

28
SIGNIFICANCE OF CONSERVATION
1.0
P(X5)
P(X1)
P(X4)
Frequency
P(X3)
P(X2)
E1
P(X) - Natural frequency of amino acid
occurrence Q(X) - Observed frequency calculated
from the multiple alignment b - One of
the five bins i - Vector E1-E5
29
Illustration of PCP motifs
Multiple sequence alignment of NS3 protein
sequences from flaviviruses
dengue2 HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVL
ALEPGKNPRAVQTKPGLFKTNTGTIG-AVSLDFSPGTSGSPIVDRKGKV
dengue4 HETGRLEPSWADVRNDMISYGGGWRLGDKWDKEEDVQVL
AIEPGKNPKHVQTKPGLFKTLTGEIG-AVTLDFKPGTSGSPIINRKGKV
dengue3 HNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVI
AVEPGKNPKNFQTMPGIFQTTTGEIG-AIALDFKPGTSGSPIINREGKV
Dengue1 YQGKRLEPSWASVKKDLISYGGGWRFQGSWNTGEEVQVI
AVEPGKNPKNVQTAPGTFKTSEGEVG-AIALDFKPGTSGSPIVNREGKI
kunjin SGEGRLDPYWGSVKEDRLCYGGPWKLQHKWNGQDEVQMI
VVEPGKNVKNVQTKPGVFKTPEGEIG-AVTLDFPTGTSGSPIVDKNGDV
japenceph SGEGKLTPYWGSVREDRIAYGGPWRFDRKWNGTDDVQVI
VVEPGKAAVNIQTKPGVFRTPFGEVG-AVSLDYPRGTSGSPILDSNGDI
westnile SGEGRLDPYWGSVKEDRLCYGGPWKLQHKWNGHDEVQMI
VVEPGKNVKNVQTKPGVFKTPEGEIG-AVTLDYPTGTSGSPIVDKNGDV
powassen VEGATSGPYWADVREDVVCYGGAWGLDKKWG-GEVVQVH
AFPPDSGHKIHQCQPGKLNLEGGRVLGAIPIDLPRGTSGSPIINAQGDV
tbe IDDAVAGPYWADVKEDVVCYGGAWSLEEKWK-GETVQVH
AFPPGRAHEVHQCQPGELLLDTGRRIGAVPIDLAKGTSGSPILNSQGVV
PCP-motifs (underlined) are defined as local
maxima in the relative entropy scale. Rel
Entropy HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.00    HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.20 ---KRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.40     ----RIEPSWADVKKDLISYGGGWKLEGEW---EEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.60   ----RIEPSWADVKKDLISYGGGWKLEGEW---EEVQVLA
LEPGKNPRAVQTKPGLF----GTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.80   ----RIEPSWADVKKDLISYGGGW---------EEVQVLA
LEPGKNPRAVQTKPG------------------GTSGSPIVDRKG---2.
00     -------PSWADVKKDLISYGGGW-----------VQVLAL
EPGK----------------------------GTSGSPI--------2.2
0    -------PSWADVKKDLISYGGGW-----------VQVLALE
PG-----------------------------GTSGSPI--------2.40
    -------------KKDLISYGGGW-------------------
------------------------------GTSGSP--------- The
degree of variability is color coded blue
residues most conserved and red as most variable.
This residue coloring can also be projected on a
3-D structure to make a Stereochemical
Variability Plot or SVP.
Increasing conservation
30
Procedure of Functional Annotation

Write a Comment

User Comments (0)

About PowerShow.com

How to find an optimal sequence alignment - PowerPoint PPT Presentation

How to find an optimal sequence alignment

Global algorithm which gives an overall best fit alignment of the ... Hydrophobicity, hydrophilicity, side-chain length, bulkiness, mol. wt, solubility etc. ... – PowerPoint PPT presentation