Title: Pairwise alignment
1Pairwise alignment
2Sequence to function
3Many biosequences are related
Of mice and men - myoglobin
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK
GLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
K SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKH
KIP SEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATK
HKIP VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMAS
NYKELGFQG VKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRN
DIAAKYKELGFQG
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK
GLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
K SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKH
KIP SEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATK
HKIP VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMAS
NYKELGFQG VKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRN
DIAAKYKELGFQG
4What is homology?
- In biological terms homology means evolutionary
related - In biosequence homologous sequences are
similar - We infer homology by comparing biosequences using
sequence comparisons (alignments)
5Why look for homology?
- If two sequences are homologous on the sequence
level what conclusions? - Three-dimensional structure will be similar
- Biological functions are very likely to be
related but not always - Multiple alignments of homologous sequences can
help identify structure/functional critical
residues
6Evolution
- Evolution is driven by mutations in genes -
changes in proteins. - Changes in single bases
- Changes in multiple bases
- Addition/deletion of multiple bases
- Moving of part of a gene
- Copying part or all of a gene
- We are looking at chemical evolution
7Sequence differences
Deduction
Evolution
8Orthologs vs. paralogs
Paralogs
Orthologs
9Origins of genes with similar sequence
Paralogs
Orthologs
Analogous
Xenologous
10Modular proteins
- Most proteins larger than 20 kDa are built from
modules
C-type lectin
Fibronectin
EGF domains
11Residue distribution in 3D
- Structures are hydrophilic on the outside and
hydrophobic on the inside
Hydrophobic Small Acidamide Basic Active site
Chymotrypsin
12Homology search methods
- Dot-PlotSimple, graphical, visual
- Dynamic programmingNeedleman-WunchSmith-Waterman
- k-tuple programsFastABLAST
- Alignment
- Matrices
- PAM250
- BLOSUM62
13Dot-plot
A D K F H K E A C T E
X D X K X X A X
X A X X K
X E X A X X
14Dot-plot
X Human Factor X Y Human Factor IX (Christmas
factor)
Direct comparison No matrix
Direct comparison Compare 2
15DotLet
Distribution of scores of all residue pairs
Log of distribution
Protein C vs factor X
16DotLet - optimizing
17DotLet low e-score
MBL vs. Immunolectin Bombyx mori
18DotLet - optimizing
19Alignment
- An alignment is best fit between two or more
biosequences. - Proteins having close relationships can be
identified by comparing identical
residues.SEDEMKASEDLKKHGATVLTALGGILKKKGHHEA
SEEDMKGSEDLKKHGCTVL
TALGTILKKKGQHAAHuman and murine myoglobin
20Alignment - similarities
- When relationships are more distant you introduce
similaritiesbovine GVTTSDVVVAGEFDQGSSSEKIQKLK
IAKVFKNSKYNSL ... . ...
.... Cod NVKNYHRVVLGEHDRSSNSEGVQVMT
VGQVFKHPRYNGF - When doing the comparison by hand you often use
chemical similaritiesL/I/V F/Y/W R/K
E/D/Q/N G/A
21Global/local alignment
- For global alignments you try to align complete
sequences - For local alignments you only align part of the
sequence
Best suited for pattern matching
22Alignment
- In order to maximise an alignment, gaps have to
be introducedChymo QDKTGFHFCGGSLINENWVVTAAHCGVT
TS-DVVVAGEFDQGSSSEKIQKLKIAKVFKNS .
. . . ...
. Fac X INEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVG
DRNTEQEEGGEAVHEVEVVIKHNChymo KYNSLTINNDITLLKLSTA
ASFSQTVSAVCLPS---ASDDFAAGTTCVTTGWGLTRYTNA
.. .. .... .. .
. .. Fac X RFTKETYDFDIAVLRLKTPITFRMNVAPAC
LPERDWAESTLMTQKTGIVSGFGRTHEKGRChymo
NTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASG--VSSCMGDSGG
PLVCKKNGAW . .... . .
. . . Fac X
QS-TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGG
PHVTRFKDTY Chymo TLVGIVSWGSSTCSTSTPGVYARVTA
LVNWVQQTLAAN-------------------
.. . ... . Fac X
FVTGIVSWGESCARKGKYGIYTKVTAFLKWIDRSMKTRGLPKAKSHAPEV
ITSSPLK
ws gap penalty g gap opening r gap
extension x extension length
Gap penalties are usually implemented as ws g
rx
Bovine chymotrypsin / human factor X
23Alignment
- In the three-dimensional structure, gaps are
usually located to peripherial loop structures.
Bovine chymotrypsin Insertion and deletion
positions (compared to Factor X) are in green.
24The twilight zone
After insertion of gaps, two random sequences can
be expected to be 10-20 identical! Two sequences
of length 100 residues that are 25 identical are
likely to be genuinely related. If in the 15-25
region, it is worthwhile to do additional tests.
25Matrices
- Instead of just counting amino acid substitutions
(e.g. conserved vs. non-conserved residues) we
can use biological information - The PAM series of substitution matrices are based
on global alignment of a limited set of closely
related protein sequences (replacements are
counted on the brances of a phylogenetic tree).
The PAM series constitute Point Accepted
Mutations - The BLOSUM series of substitution matrices are
based highly conserved regions in a series of
alignments forbidden to contains gaps. The
BLOSUMxx series contains blocks with at most xx
identity.
26PAM250
C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1
1 1 2 T -2 0 0 1 1 3 D -5 1 -1 0 0
0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0
0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H
-3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0
-1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1
-1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2
-2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1
-2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2
-2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3
-2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6
-5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3
-3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7
-6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0
17 C G P S A T D E N Q H K R V M
I L F Y W
27BLOSUM62
G 7P -2 9 D -1 -1 7 E -2 0 2 6 N 0 -2 2
0 6 H -2 -2 0 0 1 10 Q -2 -1 0 2 0 1 6 K
-2 -1 0 1 0 -1 1 5 R -2 -2 -1 0 0 0 1 3
7 S 0 -1 0 0 1 -1 0 -1 -1 4 T -2 -1 -1 -1
0 -2 -1 -1 -1 2 5 A 0 -1 -2 -1 -1 -2 -1 -1 -2
1 0 5 M -2 -2 -3 -2 -2 0 0 -1 -1 -2 -1 -1
6 V -3 -3 -3 -3 -3 -3 -3 -2 -2 -1 0 0 1 5 I
-4 -2 -4 -3 -2 -3 -2 -3 -3 -2 -1 -1 2 3 5 L -3
-3 -3 -2 -3 -2 -2 -3 -2 -3 -1 -1 2 1 2 5 F -3
-3 -4 -3 -2 -2 -4 -3 -2 -2 -1 -2 0 0 0 1 8 Y
-3 -3 -2 -2 -2 2 -1 -1 -1 -2 -1 -2 0 -1 0 0
3 8 W -2 -3 -4 -3 -4 -3 -2 -2 -2 -4 -3 -2 -2 -3
-2 -2 1 3 15 C -3 -4 -3 -3 -2 -3 -3 -3 -3 -1
-1 -1 -2 -1 -3 -2 -2 -3 -5 12 G P D E N H
Q K R S T A M V I L F Y W C
28The genetic code
29Needleman-Wunch
- The best global alignment can be found as the
best path through a substitution matrix. The
optiomal path can be found by incremental
extension of a subpath. - The Smith-Waterman algorithm extends this to
optimal local alignment. - The main disadvantage is that finding the optimal
alignment is very computing intensive
30 Needleman-Wunch alignment scheme
31Needleman-Wunch
32Smith-Waterman
33FastA - BLAST
- Present day homology search methods are all
word-based. The idea is that a true relationship
will have at least one word (two or more
residues) in common. - FastA use the parameter ktup for number of
identical residues that form the basis of a local
alignment. - BLAST has extended this to neighborhood words.
For a given word size W the score has to be at
least T using a substitution matrix. W is usually
kept constant while varying T.The original
implementation did not allow for gaps, but has
been implemented from v. 2.0.
34Alignments - seeding
Smith-Waterman
Word seeding
35Clustering and extension
Word clusters Words on the same diagonal
Extending words
36Multiple alignments
Rat_CALRETICULIN ----MLLSVPLLLGLLGLAAAD-------
---------------------------PAIYFKEQFLDGDAWTNR-----
----WVESKHKSD--FGKFVL Human_CALRETICULIN
----MLLSVPLLLGLLGLAVAE----------------------------
------PAVYFKEQFLDGDGWTSR---------WIESKHKSD--FGKFVL
RAT_CALNEXIN MEGKWLLCLLLVLGTAAIQAHDGHDDD
MIDIEDDLDDVIEEVEDSKSKSDTSTPPSPKVTYKAPVPTGEVYFADSFD
RGSLSGWILSKAKKDDTDDEIAK Human_CALNEXIN
MEGKWLLCMLLVLGTAIVEAHDGHDDDVIDIEDDLDDVIEEVEDSKPDT-
TAPPSSPKVTYKAPVPTGEVYFADSFDRGTLSGWILSKAKKDDTDDEIAK
. .
.
. .. Prim.cons.
MEGK2LL2V2L2LG22GLAA2DGHDDD2IDIEDDLDDVIEEVEDSK222D
T22P2SP2V22K22222G2V22A2SFDRG2LSGWI2SK2K2DDT222222
Rat_CALRETICULIN SSGKFYGDQEK------DKGLQTSQD
ARFYALSARF-EPFSNKGQTLVVQFTVKHEQNIDCGGGYVKLFPGG--LD
QKDMHGDSEYNIMFGPDICGPGTK Human_CALRETICULIN
SSGKFYGDEEK------DKGLQTSQDARFYALSASF-EPFSNKGQTLVVQ
FTVKHEQNIDCGGGYVKLFPNS--LDQTDMHGDSEYNIMFGPDICGPGTK
RAT_CALNEXIN YDGKWEVDEMKETKLPGDKGLVLMSRA
KHHAISAKLNKPFLFDTKPLIVQYEVNFQNGIECGGAYVKLLSKTSELNL
DQFHDKTPYTIMFGPDKCG-EDY Human_CALNEXIN
YDGKWEVEEMKESKLPGDKGLVLMSRAKHHAISAKLNKPFLFDTKPLIVQ
YEVNFQNGIECGGAYVKLLSKTPELNLDQFHDKTPYTIMFGPDKCG-EDY
. .
. . . ....
.. . Prim.cons.
22GK222DE2KE2KLPGDKGL22222A222A2SAK2N2PF222222L2VQ
22V22222I2CGG2YVKL22KT2EL22D22H2222Y2IMFGPD2CGP222
Rat_CALRETICULIN KVHVIFNYKGKNVLINKDIRCK----
------DDEFTHLYTLIVRPDNTYEVKIDNSQVESGSLEDDWD--FLPPK
KIKDPDAAKPEDWDERAKIDDPTD Human_CALRETICULIN
KVHVIFNYKGKNVLINKDIRCK----------DDEFTHLYTLIVRPDNTY
EVKIDNSQVESGSLEDDWD--FLPPKKIKDPDASKPEDWDERAKIDDPTD
RAT_CALNEXIN KLHFIFRHKNPKTGVYEEKHAKRPDAD
LKTYFTDKKTHLYTLILNPDNSFEILVDQSVVNSGNLLNDMTPPVNPSRE
IEDPEDRKPEDWDERPKIADPDA Human_CALNEXIN
KLHFIFRHKNPKTGIYEEKHAKRPDADLKTYFTDKKTHLYTLILNPDNSF
EILVDQSVVNSGNLLNDMTPPVNPSREIEDPEDRKPEDWDERPKIPDPEA
... . .
. . .
. . Prim.cons.
K2H2IF22K22222I222222KRPDADLKTYF2D22THLYTLI22PDN22
E222D2S2V2SG2L22D22PP22P222I2DP22RKPEDWDER2KIDDPT2
Rat_CALRETICULIN SKPEDWDK------------------
---PEHIPDPDAKKPEDWDEEMDGEWEP-------------------PVI
QNPEYKGEWKPRQIDNPDYKGTWI Human_CALRETICULIN
SKPEDWDK---------------------PEHIPDPDAKKPEDWDEEMDG
EWEP-------------------PVIQNPEYKGEWKPRQIDNPDYKGTWI
RAT_CALNEXIN VKPDDWDEDAPSKIPDEEATKPEGWLDD
EPEYIPDPDAEKPEDWDEDMDGEWEAPQIANPKCESAPGCGVWQRPMIDN
PNYKGKWKPPMIDNPNYQGIWK Human_CALNEXIN
VKPDDWDEDAPAKIPDEEATKPEGWLDDEPEYVPDPDAEKPEDWDEDMDG
EWEAPQIANPRCESAPGCGVWQRPVIDNPNYKGKWKPPMIDNPSYQGIWK
.
. Prim.cons.
2KP2DWD2DAP2KIPDEEATKPEGWLDDEPE2IPDPDA2KPEDWDE2MDG
EWE2PQIANP2CESAPGCGVWQRPVI2NP2YKG2WKP22IDNPDY2G2W2
37Implementing a search algorithm
- SensitivityYou dont want to miss anything.
- SelectivityYou dont want any false positives.
- SpeedYou dont want to wait.
38BLAST
- BLAST Basic Local Alignment Sequence Tool
- Although the program handles both nucleotide and
protein sequences you should translate to protein
sequence if you have a coding region. - Always use the longest sequence possible.
- Concentrate on regions having conserved
residues Trp, Cys, Phe, Tyr, Pro. - You cannot expect to locate a sequence in the
database if your search sequence is very short
(residues. - If not successful, try varying the parameters.
39BLAST
- You start out by selecting the program to run
(old style) - BLASTP protein against protein
- BLASTN nucleotide against nucleotide
- BLASTX nucleotide translated in three frames
against protein - TBLASTN protein sequence against nucleotide
translated in all three frames. - TBLASTX six-frame translation of query sequence
against six-frame translation of nucleotide
database
40BLAST
41BLAST
42BLAST
43BLAST parameters
- The Expect threshold is the cut-off value for
expected scores. Select a high value (10000) when
searching for low similarity (e.g. short
sequences) a low value (10) when searching for
high similarity (long sequences). - Word size (2 or 3) is the size of the sequence
chosen for initial comparison (word size W). For
short sequences choose 2.
44BLAST parameters
- The Scoring matrix can be changed to search for
closely matching or diverging sequences. BLOSUM90
has been calculated for analysing very similar
sequences while BLOSUM30 for highly diverging
sequences. Low PAM values are for higher degrees
of similarity. - The Gap Costs finetunes the search for divergent
sequences when having a long search
sequence.High values give a more stringent
search.Use high values for short sequences and
low values for long sequences. - Compositional adjustments changes the scoring
matrix to reflect the composition of the
sequences to be compared.
45BLAST parameters
- Filtering will remove low complexity regions
(regions that have residues that are
overrepresented) and repetitive elements (like
Alu repeats in nucleotide sequences). Should be
turned off for short sequences. - Mask for lookup Extensions of the initial hit
are not removed by filtering. - Mask lower case A sequence in upper case can be
marked for filtering by changing part to lower
case.
46BLAST
47Output options
48Waiting for results
49Results
- The graphical viewer displays hit sequences in
color lines corresponding to score and position
in sequence. Mouse-over displays target sequence
name, and mouse click displays alignment.
50Results
- The results are listed with the highest scores at
the top. - Each line contains the accession number followed
by name of protein, the score (higher is better)
and the E-value (Expected the number of
expected occurrences of a sequence with the given
composition in the requested database). - The E-value should be as low as possible (
51Results
- One or more HSPs (High Scoring Sequence Pair) is
shown at after the scoring list for each
protein.Query input sequence Sbjct database
sequence Similar residues are marked by
and are counted as part of positives
52Result distance tree
53BLAST distance tree