How to find an optimal sequence alignment - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

How to find an optimal sequence alignment

Description:

Global algorithm which gives an overall best fit alignment of the ... Hydrophobicity, hydrophilicity, side-chain length, bulkiness, mol. wt, solubility etc. ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 31
Provided by: werner6
Category:

less

Transcript and Presenter's Notes

Title: How to find an optimal sequence alignment


1
How to find an optimal sequence alignment?
  •      The optimal alignment is the alignment with
    the highest score.
  • Exhaustive enumerations and scoring of all
    alignments is not feasible.
  • Problem is solved by dynamic programming
    algorithms which uses the fact that the global
    optimal alignment can be constructed from optimal
    alignments of subsequences.
  •  

2
Most widely used algorithms
  • Two basic types of algorithms
  • Needleman-Wunsch algorithm1,2
  • Global algorithm which gives an overall best fit
    alignment of the entire sequence.
  • rigorous algorithm to find optimal solution.
  • requires tremendous amount of computing power.
  • not sensitive for highly diverged sequences
  • Smith-Waterman algorithm 3
  • Local alignment procedure which tries to find a
    sub sequence (or several small) subsequences of
    high similarity.

Ref 1 Needleman, S.B. Wunsch, C.D. 1970. J.
Mol. Biol. 48, 443-453. 2 Gotoh, O. 1982. J.
Mol. Biol. 162, 705-708. 3 Smith, T.F.
Waterman, M.S. 1981. J. Mol. Biol. 147, 195-197.
3
Dynamic Programming (Needleman Wunsch) Algorithm
Global best score for aligning two sequences of
length m and n. Example from the textbook
Durbin et al., Chapt.2 HEAGAWGHEE (m10)
xi i1,...,10 PAWHEAE (n7) yj
j1,..,7 Score matrix Blosum 50 gap cost
d8 example alignment HEAGAWGHEE
P-A--WHEAE What is the score for this
alignment? How to find an optimal alignment? What
is the best possible scoring value?
4
Score matrix for the example
example alignment HEAGAWGHEE
P-A--WHEAE Score
-2-85-8-815-20-16 Total score -3 Is this an
optimal alignment?
5
Needleman Wunsch algorithm
  • Construct a matrix F(i,j) which is the score of
    the best alignment of the partial segment
    x1.....i and y 1..j
  • At each aligned position there are three possible
    events.
  • substitution of xi by yj , change F by
    s(xi,yj)
  • alignment of xi by a gap, penalty d
  • alignment of yj by a gap, penalty d
  • Gaps are usually counted with a negative value
    often referred to as the gap penalty''

6
Recursive Generation of the F matrix
The maximum score can be found by working
forward along each sequence successively finding
the best score for aligning all subsequences.
The values of the matrix F are stored in a
matrix where each element of F is calculated
as follows
7
Finding the optimal alignment
  • The element F(i,j) contains the best score for
    the alignment of the subalignment and the
    alignment path can be determined by tracing back
    through the matrix using pointers to indicate
    which of the three possibilities was the maximum
    at each point (i,j).
  • Best alignment Score1

8
local alignment
9
Significance of scores
  • Using a search method (global or local) one
    always finds a best hit
  • What does this mean?
  • Biological meaningful alignment (homologous
    proteins)?
  • or the best alignment between two unrelated
    sequences?
  • two approaches
  • Bayesian approach
  • statistical approach (extreme value distribution)

10
Significance of scores
  • Probability that two sequences x and y are
    related according to a model M as compared to a
    random model R.
  • match model M
  • random model R
  • additive scoring system (log-odds ratio)
  • What scores are statistical significant?

11
Bayesian Approach for Significance
  • What is the probability that x and y are related?
    P(Mx,y)?
  • P(M) a priori probability that x and y are
    related
  • P(R)1-P(M)
  • After rearrangement we find
  • with

12
Importance of prior log-odds ratio
  • prior log-odds ratio is added to standard score
  • large number of different alignments for a
    possible match
  • length of the alignment influences score
  • Even if all sequences in the database are
    unrelated, with increasing number of sequences
    the score increases
  • prior log-odds ratio has to be adjusted to the
    size of the database
  • other quantities (E value)
  • local match adjustment for the length of the
    sequences

13
Extreme value distribution (EVD)
  • search in a database with N sequences
  • distribution of the maximum of the N scores
  • asymptotic distribution of the maximum of N
    independent normal random variables
  • Karlin Altschul(1990) The number of unrelated
    matches of two sequences with length m and n and
    with score greater than S is Poisson distributed
    with mean
  • Probability that for a match with score greater
    than S is
  • Example score of human cytochrome C in
    Swiss-Prot

14
Score distribution
15
Heuristic alignment algorithms
  • Complexity of Needleman-Wunsch , the number of
    operations to calculate is O(mn).
  • Protein sequence database
  • 108 residues a typical protein 103 residues
    1011 operations
  • a)     Basic Local Alignment Search Tool (BLAST)
    Altschul, Lipman.
  • search for short stretches with high scoring
    matches, use these neighbourhood words as seed
    for alignment extensions
  • b)      FASTA Pearson
  • look-up table of identically matching words (ktup
    2)
  • extension of these hits with ungapped alignments
  • joining of ungapped regions by dynamic
    programming with gaps.
  •  

16

Going beyond sequence alignment Motifs, patterns,
profiles
How should we read protein sequences? Is there
are grammar? Can we identify words in this
language? Motifs contiguous segments in a
protein family with statistically significant
sequence conservation. Molegos contiguous
segments with sequence and 3D structure
conservation.
17
Patterns, Profiles and Motifs
  • Pairwise alignments can find similar sequences
    i.e. we can measure the similarity by sequence
    identity, similarity score (sum of substitution
    matrix values) or probabilities that two
    sequences would match in a random model
    (similarity)
  • However we would like to know if the two
    sequences have a common evolutionary origin
    (homology)
  • there is no simple general cutoff value (see
    example cytochromes)
  • similarity versus homology
  • to improve the reliability of the one has to
    rely on additional tools patterns, profiles and
    motifs

18
Patterns, Profiles and Motifs
Patterns string of sequence elements
characteristics of a protein family e.g. PROSITE
pattern Profiles a quantitative expression for
a position specific signature based on the
fraction of amino acids at each position in the
sequence Motifs any sequence pattern which is
predictive for a protein function, a structural
feature or family membership PCP motifs
quantitative definition of conserved physical
chemical properties in a protein family
(combination of pattern and profiles)
19
Web sites related to motif search
  • PCPMer http//landau.utmb.edu8080/WebPCPMer/HomeP
    age/index.html
  • used in this course
  • BLOCKS http//blocks.fhcrc.org/
  • relies on conserved stretches of protein families
  • MEME http//meme.sdsc.edu/meme/website/intro.html
  • combine HMM search with profile methods
  • PROSITE http//us.expasy.org/prosite/
  • catalogue of biological important sequence
    patterns
  • HMMER http//hmmer.wustl.edu/
  • Profile Hidden Markov Models
  • SAMhttp//www.soe.ucsc.edu/research/compbio/sam.h
    tml
  • several software packages implementing HMM
  • PFAM http//www.sanger.ac.uk/Software/Pfam/
  • collection of MSA and hidden Markov models of
    many common protein domains

20
PROSITE expert-curated database of patterns
hosted by the Swiss Institute of
Bioinformatics Release 18.19, of 16-Jan-2004
(contains 1241 documentation entries that
describe 1685 different patterns, rules and
profiles/matrices). patterns are usually hand
edited by experts working with proteins in the
family familiar with its activities and areas
needed for activity. language for describing the
patterns all amino acids within can
occour anything but the a.a. in can
occur x(n,m) a spacer of n to m residues can
occur in pattern
21
Example Prosite pattern
Ref - Durbin, Eddy, Krogh, Mitchison,
Biological sequence analysis Cambridge
University Press.
22
profiles
multiple sequence alignment
f(j,b) probability of amino acid b at position
j profile p the expected score for a given
sequence yj to fit into the family
23
Our approach property based motif search
  • Each amino acid has a set of physical-chemical
    properties.
  • Hydrophobicity, hydrophilicity, side-chain
    length, bulkiness, mol. wt, solubility etc.
  • 3D structure is determined by physical-chemical
    properties of residues.
  • Which properties to choose?
  • How can we represent these properties because of
    differences in scales ?

The protein non-folding problem determinants
of disorder Williams RM, Obradovi Z,
Venkatarajan MS et al. 2001 Pac Symp Biocomputing
689-100
24
PCP motifs http//landau.utmb.edu8080/WebPCPMer
/
25
Reduction of the descriptor space for
physical-chemical properties of amino acids
237 physical-chemical properties
5 descriptors
Multidimensional scaling
Mathura, V.S. and W. Braun. 2001. New
quantitative descriptors of amino acids based on
multidimensional scaling of a large number of
physical chemical properties. J Mol Model 7445
453
26
Physical-chemical interpretation of the
descriptors
27
Quantitative definition of PCP- Motifs for a
Protein Family
  • For each column of the multiple alignment (or
    residue positions)
  • Measure the significance of conservation by the
    relative entropy
  • Calculate the a prior distributions of
    the 5 descriptors and compare to the the actual
    distributions of the descriptors in each column
  • Quantitative definition of motifs
  • Compute average and standard deviations of the
    property vector components
  • Length Cutoff (L)
  • Minimum number of positions to be included in
    the motif.
  • Gap Cutoff (G)
  • Number of insignificant positions between two
    significant residue positions allowed in a motif.
  • Ref
  • VS Mathura, CH Schein, W. Braun (2003).
    Bioinformatics, 19, 1381-1390.

28
SIGNIFICANCE OF CONSERVATION
1.0
P(X5)
P(X1)
P(X4)
Frequency
P(X3)
P(X2)
E1
P(X) - Natural frequency of amino acid
occurrence Q(X) - Observed frequency calculated
from the multiple alignment b - One of
the five bins i - Vector E1-E5
29
Illustration of PCP motifs
Multiple sequence alignment of NS3 protein
sequences from flaviviruses
dengue2 HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVL
ALEPGKNPRAVQTKPGLFKTNTGTIG-AVSLDFSPGTSGSPIVDRKGKV
dengue4 HETGRLEPSWADVRNDMISYGGGWRLGDKWDKEEDVQVL
AIEPGKNPKHVQTKPGLFKTLTGEIG-AVTLDFKPGTSGSPIINRKGKV
dengue3 HNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVI
AVEPGKNPKNFQTMPGIFQTTTGEIG-AIALDFKPGTSGSPIINREGKV
Dengue1 YQGKRLEPSWASVKKDLISYGGGWRFQGSWNTGEEVQVI
AVEPGKNPKNVQTAPGTFKTSEGEVG-AIALDFKPGTSGSPIVNREGKI
kunjin SGEGRLDPYWGSVKEDRLCYGGPWKLQHKWNGQDEVQMI
VVEPGKNVKNVQTKPGVFKTPEGEIG-AVTLDFPTGTSGSPIVDKNGDV
japenceph SGEGKLTPYWGSVREDRIAYGGPWRFDRKWNGTDDVQVI
VVEPGKAAVNIQTKPGVFRTPFGEVG-AVSLDYPRGTSGSPILDSNGDI
westnile SGEGRLDPYWGSVKEDRLCYGGPWKLQHKWNGHDEVQMI
VVEPGKNVKNVQTKPGVFKTPEGEIG-AVTLDYPTGTSGSPIVDKNGDV
powassen VEGATSGPYWADVREDVVCYGGAWGLDKKWG-GEVVQVH
AFPPDSGHKIHQCQPGKLNLEGGRVLGAIPIDLPRGTSGSPIINAQGDV
tbe IDDAVAGPYWADVKEDVVCYGGAWSLEEKWK-GETVQVH
AFPPGRAHEVHQCQPGELLLDTGRRIGAVPIDLAKGTSGSPILNSQGVV
PCP-motifs (underlined) are defined as local
maxima in the relative entropy scale. Rel
Entropy  HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.00    HKGKRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.20    ---KRIEPSWADVKKDLISYGGGWKLEGEWKEGEEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.40     ----RIEPSWADVKKDLISYGGGWKLEGEW---EEVQVLA
LEPGKNPRAVQTKPGLFKTNTGTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.60    ----RIEPSWADVKKDLISYGGGWKLEGEW---EEVQVLA
LEPGKNPRAVQTKPGLF----GTIGAVSLDFSPGTSGSPIVDRKGKVV 1
.80   ----RIEPSWADVKKDLISYGGGW---------EEVQVLA
LEPGKNPRAVQTKPG------------------GTSGSPIVDRKG---2.
00     -------PSWADVKKDLISYGGGW-----------VQVLAL
EPGK----------------------------GTSGSPI--------2.2
0    -------PSWADVKKDLISYGGGW-----------VQVLALE
PG-----------------------------GTSGSPI--------2.40
    -------------KKDLISYGGGW-------------------
------------------------------GTSGSP--------- The
degree of variability is color coded blue
residues most conserved and red as most variable.
This residue coloring can also be projected on a
3-D structure to make a Stereochemical
Variability Plot or SVP.
Increasing conservation
30
Procedure of Functional Annotation
Write a Comment
User Comments (0)
About PowerShow.com