Title: How to find an optimal sequence alignment
1How to find an optimal sequence alignment?
- The optimal alignment is the alignment with
the highest score. - Exhaustive enumerations and scoring of all
alignments is not feasible. - Problem is solved by dynamic programming
algorithms which uses the fact that the global
optimal alignment can be constructed from optimal
alignments of subsequences. -
2Most widely used algorithms
- Two basic types of algorithms
- Needleman-Wunsch algorithm1,2
- Global algorithm which gives an overall best fit
alignment of the entire sequence. - rigorous algorithm to find optimal solution.
- requires tremendous amount of computing power.
- not sensitive for highly diverged sequences
- Smith-Waterman algorithm 3
- Local alignment procedure which tries to find a
sub sequence (or several small) subsequences of
high similarity.
Ref 1 Needleman, S.B. Wunsch, C.D. 1970. J.
Mol. Biol. 48, 443-453. 2 Gotoh, O. 1982. J.
Mol. Biol. 162, 705-708. 3 Smith, T.F.
Waterman, M.S. 1981. J. Mol. Biol. 147, 195-197.
3Dynamic Programming (Needleman Wunsch) Algorithm
Global best score for aligning two sequences of
length m and n. Example from the textbook
Durbin et al., Chapt.2 HEAGAWGHEE (m10)
xi i1,...,10 PAWHEAE (n7) yj
j1,..,7 Score matrix Blosum 50 gap cost
d8 example alignment HEAGAWGHEE
P-A--WHEAE What is the score for this
alignment? How to find an optimal alignment? What
is the best possible scoring value?
4Score matrix for the example
example alignment HEAGAWGHEE
P-A--WHEAE Score
-2-85-8-815-20-16 Total score -3 Is this an
optimal alignment?
5Needleman Wunsch algorithm
- Construct a matrix F(i,j) which is the score of
the best alignment of the partial segment
x1.....i and y 1..j - At each aligned position there are three possible
events. - substitution of xi by yj , change F by
s(xi,yj) - alignment of xi by a gap, penalty d
- alignment of yj by a gap, penalty d
- Gaps are usually counted with a negative value
often referred to as the gap penalty''
6Recursive Generation of the F matrix
The maximum score can be found by working
forward along each sequence successively finding
the best score for aligning all subsequences.
The values of the matrix F are stored in a
matrix where each element of F is calculated
as follows
7Finding the optimal alignment
- The element F(i,j) contains the best score for
the alignment of the subalignment and the
alignment path can be determined by tracing back
through the matrix using pointers to indicate
which of the three possibilities was the maximum
at each point (i,j). - Best alignment Score1
8local alignment
9Significance of scores
- Using a search method (global or local) one
always finds a best hit - What does this mean?
- Biological meaningful alignment (homologous
proteins)? - or the best alignment between two unrelated
sequences? - two approaches
- Bayesian approach
- statistical approach (extreme value distribution)
10Significance of scores
- Probability that two sequences x and y are
related according to a model M as compared to a
random model R. - match model M
- random model R
- additive scoring system (log-odds ratio)
- What scores are statistical significant?
11Bayesian Approach for Significance
- What is the probability that x and y are related?
P(Mx,y)? - P(M) a priori probability that x and y are
related - P(R)1-P(M)
- After rearrangement we find
- with
12Importance of prior log-odds ratio
- prior log-odds ratio is added to standard score
- large number of different alignments for a
possible match - length of the alignment influences score
- Even if all sequences in the database are
unrelated, with increasing number of sequences
the score increases - prior log-odds ratio has to be adjusted to the
size of the database - other quantities (E value)
- local match adjustment for the length of the
sequences
13Extreme value distribution (EVD)
- search in a database with N sequences
- distribution of the maximum of the N scores
- asymptotic distribution of the maximum of N
independent normal random variables - Karlin Altschul(1990) The number of unrelated
matches of two sequences with length m and n and
with score greater than S is Poisson distributed
with mean - Probability that for a match with score greater
than S is - Example score of human cytochrome C in
Swiss-Prot
14Score distribution
15Heuristic alignment algorithms
- Complexity of Needleman-Wunsch , the number of
operations to calculate is O(mn). - Protein sequence database
- 108 residues a typical protein 103 residues
1011 operations - a) Basic Local Alignment Search Tool (BLAST)
Altschul, Lipman. - search for short stretches with high scoring
matches, use these neighbourhood words as seed
for alignment extensions - b) FASTA Pearson
- look-up table of identically matching words (ktup
2) - extension of these hits with ungapped alignments
- joining of ungapped regions by dynamic
programming with gaps. -
16 Going beyond sequence alignment Motifs, patterns,
profiles
How should we read protein sequences? Is there
are grammar? Can we identify words in this
language? Motifs contiguous segments in a
protein family with statistically significant
sequence conservation. Molegos contiguous
segments with sequence and 3D structure
conservation.
17Patterns, Profiles and Motifs
- Pairwise alignments can find similar sequences
i.e. we can measure the similarity by sequence
identity, similarity score (sum of substitution
matrix values) or probabilities that two
sequences would match in a random model
(similarity) - However we would like to know if the two
sequences have a common evolutionary origin
(homology) - there is no simple general cutoff value (see
example cytochromes) - similarity versus homology
- to improve the reliability of the one has to
rely on additional tools patterns, profiles and
motifs
18Patterns, Profiles and Motifs
Patterns string of sequence elements
characteristics of a protein family e.g. PROSITE
pattern Profiles a quantitative expression for
a position specific signature based on the
fraction of amino acids at each position in the
sequence Motifs any sequence pattern which is
predictive for a protein function, a structural
feature or family membership PCP motifs
quantitative definition of conserved physical
chemical properties in a protein family
(combination of pattern and profiles)
19Web sites related to motif search
- PCPMer http//landau.utmb.edu8080/WebPCPMer/HomeP
age/index.html - used in this course
- BLOCKS http//blocks.fhcrc.org/
- relies on conserved stretches of protein families
- MEME http//meme.sdsc.edu/meme/website/intro.html
- combine HMM search with profile methods
- PROSITE http//us.expasy.org/prosite/
- catalogue of biological important sequence
patterns - HMMER http//hmmer.wustl.edu/
- Profile Hidden Markov Models
- SAMhttp//www.soe.ucsc.edu/research/compbio/sam.h
tml - several software packages implementing HMM
- PFAM http//www.sanger.ac.uk/Software/Pfam/
- collection of MSA and hidden Markov models of
many common protein domains
20PROSITE expert-curated database of patterns
hosted by the Swiss Institute of
Bioinformatics Release 18.19, of 16-Jan-2004
(contains 1241 documentation entries that
describe 1685 different patterns, rules and
profiles/matrices). patterns are usually hand
edited by experts working with proteins in the
family familiar with its activities and areas
needed for activity. language for describing the
patterns all amino acids within can
occour anything but the a.a. in can
occur x(n,m) a spacer of n to m residues can
occur in pattern
21Example Prosite pattern
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis Cambridge
University Press.
22profiles
multiple sequence alignment
f(j,b) probability of amino acid b at position
j profile p the expected score for a given
sequence yj to fit into the family
23Our approach property based motif search
- Each amino acid has a set of physical-chemical
properties. - Hydrophobicity, hydrophilicity, side-chain
length, bulkiness, mol. wt, solubility etc. - 3D structure is determined by physical-chemical
properties of residues. - Which properties to choose?
- How can we represent these properties because of
differences in scales ?
The protein non-folding problem determinants
of disorder Williams RM, Obradovi Z,
Venkatarajan MS et al. 2001 Pac Symp Biocomputing
689-100
24PCP motifs http//landau.utmb.edu8080/WebPCPMer
/
25Reduction of the descriptor space for
physical-chemical properties of amino acids
237 physical-chemical properties
5 descriptors
Multidimensional scaling
Mathura, V.S. and W. Braun. 2001. New
quantitative descriptors of amino acids based on
multidimensional scaling of a large number of
physical chemical properties. J Mol Model 7445
453
26Physical-chemical interpretation of the
descriptors
27Quantitative definition of PCP- Motifs for a
Protein Family
- For each column of the multiple alignment (or
residue positions) - Measure the significance of conservation by the
relative entropy - Calculate the a prior distributions of
the 5 descriptors and compare to the the actual
distributions of the descriptors in each column - Quantitative definition of motifs
- Compute average and standard deviations of the
property vector components
- Length Cutoff (L)
- Minimum number of positions to be included in
the motif. - Gap Cutoff (G)
- Number of insignificant positions between two
significant residue positions allowed in a motif. - Ref
- VS Mathura, CH Schein, W. Braun (2003).
Bioinformatics, 19, 1381-1390.
28SIGNIFICANCE OF CONSERVATION
1.0
P(X5)
P(X1)
P(X4)
Frequency
P(X3)
P(X2)
E1
P(X) - Natural frequency of amino acid
occurrence Q(X) - Observed frequency calculated
from the multiple alignment b - One of
the five bins i - Vector E1-E5
29Illustration of PCP motifs
Multiple sequence alignment of NS3 protein
sequences from flaviviruses
dengue2 HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVL
ALEPGKNPRAVQTKPGLFKTNTGTIG-AVSLDFSPGTSGSPIVDRKGKV
dengue4 HETGRLEPSWADVRNDMISYGGGWRLGDKWDKEEDVQVL
AIEPGKNPKHVQTKPGLFKTLTGEIG-AVTLDFKPGTSGSPIINRKGKV
dengue3 HNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVI
AVEPGKNPKNFQTMPGIFQTTTGEIG-AIALDFKPGTSGSPIINREGKV
Dengue1 YQGKRLEPSWASVKKDLISYGGGWRFQGSWNTGEEVQVI
AVEPGKNPKNVQTAPGTFKTSEGEVG-AIALDFKPGTSGSPIVNREGKI
kunjin SGEGRLDPYWGSVKEDRLCYGGPWKLQHKWNGQDEVQMI
VVEPGKNVKNVQTKPGVFKTPEGEIG-AVTLDFPTGTSGSPIVDKNGDV
japenceph SGEGKLTPYWGSVREDRIAYGGPWRFDRKWNGTDDVQVI
VVEPGKAAVNIQTKPGVFRTPFGEVG-AVSLDYPRGTSGSPILDSNGDI
westnile SGEGRLDPYWGSVKEDRLCYGGPWKLQHKWNGHDEVQMI
VVEPGKNVKNVQTKPGVFKTPEGEIG-AVTLDYPTGTSGSPIVDKNGDV
powassen VEGATSGPYWADVREDVVCYGGAWGLDKKWG-GEVVQVH
AFPPDSGHKIHQCQPGKLNLEGGRVLGAIPIDLPRGTSGSPIINAQGDV
tbe IDDAVAGPYWADVKEDVVCYGGAWSLEEKWK-GETVQVH
AFPPGRAHEVHQCQPGELLLDTGRRIGAVPIDLAKGTSGSPILNSQGVV
PCP-motifs (underlined) are defined as local
maxima in the relative entropy scale. Rel
Entropy HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.00 HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.20 ---KRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.40 ----RIEPSWADVKKDLISYGGGWKLEGEW---EEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.60 ----RIEPSWADVKKDLISYGGGWKLEGEW---EEVQVLA
LEPGKNPRAVQTKPGLF----GTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.80 ----RIEPSWADVKKDLISYGGGW---------EEVQVLA
LEPGKNPRAVQTKPG------------------GTSGSPIVDRKG---2.
00 -------PSWADVKKDLISYGGGW-----------VQVLAL
EPGK----------------------------GTSGSPI--------2.2
0 -------PSWADVKKDLISYGGGW-----------VQVLALE
PG-----------------------------GTSGSPI--------2.40
-------------KKDLISYGGGW-------------------
------------------------------GTSGSP--------- The
degree of variability is color coded blue
residues most conserved and red as most variable.
This residue coloring can also be projected on a
3-D structure to make a Stereochemical
Variability Plot or SVP.
Increasing conservation
30Procedure of Functional Annotation