Pairwise alignment presentation

About This Presentation

Transcript and Presenter's Notes

Title: Pairwise alignment

1
Pairwise alignment
2
Sequence to function
3
Many biosequences are related
Of mice and men - myoglobin
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK
GLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
K SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKH
KIP SEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATK
HKIP VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMAS
NYKELGFQG VKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRN
DIAAKYKELGFQG
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK
GLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
K SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKH
KIP SEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATK
HKIP VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMAS
NYKELGFQG VKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRN
DIAAKYKELGFQG
4
What is homology?

In biological terms homology means evolutionary
related
In biosequence homologous sequences are
similar
We infer homology by comparing biosequences using
sequence comparisons (alignments)

5
Why look for homology?

If two sequences are homologous on the sequence
level what conclusions?
Three-dimensional structure will be similar
Biological functions are very likely to be
related but not always
Multiple alignments of homologous sequences can
help identify structure/functional critical
residues

6
Evolution

Evolution is driven by mutations in genes -
changes in proteins.
Changes in single bases
Changes in multiple bases
Addition/deletion of multiple bases
Moving of part of a gene
Copying part or all of a gene
We are looking at chemical evolution

7
Sequence differences
Deduction
Evolution
8
Orthologs vs. paralogs
Paralogs
Orthologs
9
Origins of genes with similar sequence
Paralogs
Orthologs
Analogous
Xenologous
10
Modular proteins

Most proteins larger than 20 kDa are built from
modules

C-type lectin
Fibronectin
EGF domains
11
Residue distribution in 3D

Structures are hydrophilic on the outside and
hydrophobic on the inside

Hydrophobic Small Acidamide Basic Active site
Chymotrypsin
12
Homology search methods

Dot-PlotSimple, graphical, visual
Dynamic programmingNeedleman-WunchSmith-Waterman
k-tuple programsFastABLAST

Alignment
Matrices
PAM250
BLOSUM62

13
Dot-plot
A D K F H K E A C T E
X D X K X X A X
X A X X K
X E X A X X
14
Dot-plot
X Human Factor X Y Human Factor IX (Christmas
factor)
Direct comparison No matrix
Direct comparison Compare 2
15
DotLet
Distribution of scores of all residue pairs
Log of distribution
Protein C vs factor X
16
DotLet - optimizing
17
DotLet low e-score
MBL vs. Immunolectin Bombyx mori
18
DotLet - optimizing
19
Alignment

An alignment is best fit between two or more
biosequences.
Proteins having close relationships can be
identified by comparing identical
residues.SEDEMKASEDLKKHGATVLTALGGILKKKGHHEA
SEEDMKGSEDLKKHGCTVL
TALGTILKKKGQHAAHuman and murine myoglobin

20
Alignment - similarities

When relationships are more distant you introduce
similaritiesbovine GVTTSDVVVAGEFDQGSSSEKIQKLK
IAKVFKNSKYNSL ... . ...
.... Cod NVKNYHRVVLGEHDRSSNSEGVQVMT
VGQVFKHPRYNGF
When doing the comparison by hand you often use
chemical similaritiesL/I/V F/Y/W R/K
E/D/Q/N G/A

21
Global/local alignment

For global alignments you try to align complete
sequences
For local alignments you only align part of the
sequence

Best suited for pattern matching
22
Alignment

In order to maximise an alignment, gaps have to
be introducedChymo QDKTGFHFCGGSLINENWVVTAAHCGVT
TS-DVVVAGEFDQGSSSEKIQKLKIAKVFKNS .
. . . ...
. Fac X INEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVG
DRNTEQEEGGEAVHEVEVVIKHNChymo KYNSLTINNDITLLKLSTA
ASFSQTVSAVCLPS---ASDDFAAGTTCVTTGWGLTRYTNA
.. .. .... .. .
. .. Fac X RFTKETYDFDIAVLRLKTPITFRMNVAPAC
LPERDWAESTLMTQKTGIVSGFGRTHEKGRChymo
NTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASG--VSSCMGDSGG
PLVCKKNGAW . .... . .
. . . Fac X
QS-TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGG
PHVTRFKDTY Chymo TLVGIVSWGSSTCSTSTPGVYARVTA
LVNWVQQTLAAN-------------------
.. . ... . Fac X
FVTGIVSWGESCARKGKYGIYTKVTAFLKWIDRSMKTRGLPKAKSHAPEV
ITSSPLK

ws gap penalty g gap opening r gap
extension x extension length
Gap penalties are usually implemented as ws g
rx
Bovine chymotrypsin / human factor X
23
Alignment

In the three-dimensional structure, gaps are
usually located to peripherial loop structures.

Bovine chymotrypsin Insertion and deletion
positions (compared to Factor X) are in green.
24
The twilight zone
After insertion of gaps, two random sequences can
be expected to be 10-20 identical! Two sequences
of length 100 residues that are 25 identical are
likely to be genuinely related. If in the 15-25
region, it is worthwhile to do additional tests.
25
Matrices

Instead of just counting amino acid substitutions
(e.g. conserved vs. non-conserved residues) we
can use biological information
The PAM series of substitution matrices are based
on global alignment of a limited set of closely
related protein sequences (replacements are
counted on the brances of a phylogenetic tree).
The PAM series constitute Point Accepted
Mutations
The BLOSUM series of substitution matrices are
based highly conserved regions in a series of
alignments forbidden to contains gaps. The
BLOSUMxx series contains blocks with at most xx
identity.

26
PAM250
C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1
1 1 2 T -2 0 0 1 1 3 D -5 1 -1 0 0
0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0
0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H
-3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0
-1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1
-1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2
-2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1
-2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2
-2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3
-2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6
-5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3
-3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7
-6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0
17 C G P S A T D E N Q H K R V M
I L F Y W
27
BLOSUM62
G 7P -2 9 D -1 -1 7 E -2 0 2 6 N 0 -2 2
0 6 H -2 -2 0 0 1 10 Q -2 -1 0 2 0 1 6 K
-2 -1 0 1 0 -1 1 5 R -2 -2 -1 0 0 0 1 3
7 S 0 -1 0 0 1 -1 0 -1 -1 4 T -2 -1 -1 -1
0 -2 -1 -1 -1 2 5 A 0 -1 -2 -1 -1 -2 -1 -1 -2
1 0 5 M -2 -2 -3 -2 -2 0 0 -1 -1 -2 -1 -1
6 V -3 -3 -3 -3 -3 -3 -3 -2 -2 -1 0 0 1 5 I
-4 -2 -4 -3 -2 -3 -2 -3 -3 -2 -1 -1 2 3 5 L -3
-3 -3 -2 -3 -2 -2 -3 -2 -3 -1 -1 2 1 2 5 F -3
-3 -4 -3 -2 -2 -4 -3 -2 -2 -1 -2 0 0 0 1 8 Y
-3 -3 -2 -2 -2 2 -1 -1 -1 -2 -1 -2 0 -1 0 0
3 8 W -2 -3 -4 -3 -4 -3 -2 -2 -2 -4 -3 -2 -2 -3
-2 -2 1 3 15 C -3 -4 -3 -3 -2 -3 -3 -3 -3 -1
-1 -1 -2 -1 -3 -2 -2 -3 -5 12 G P D E N H
Q K R S T A M V I L F Y W C
28
The genetic code
29
Needleman-Wunch

The best global alignment can be found as the
best path through a substitution matrix. The
optiomal path can be found by incremental
extension of a subpath.
The Smith-Waterman algorithm extends this to
optimal local alignment.
The main disadvantage is that finding the optimal
alignment is very computing intensive

30

Needleman-Wunch alignment scheme
31
Needleman-Wunch
32
Smith-Waterman
33
FastA - BLAST

Present day homology search methods are all
word-based. The idea is that a true relationship
will have at least one word (two or more
residues) in common.
FastA use the parameter ktup for number of
identical residues that form the basis of a local
alignment.
BLAST has extended this to neighborhood words.
For a given word size W the score has to be at
least T using a substitution matrix. W is usually
kept constant while varying T.The original
implementation did not allow for gaps, but has
been implemented from v. 2.0.

34
Alignments - seeding
Smith-Waterman
Word seeding
35
Clustering and extension
Word clusters Words on the same diagonal
Extending words
36
Multiple alignments
Rat_CALRETICULIN ----MLLSVPLLLGLLGLAAAD-------
---------------------------PAIYFKEQFLDGDAWTNR-----
----WVESKHKSD--FGKFVL Human_CALRETICULIN
----MLLSVPLLLGLLGLAVAE----------------------------
------PAVYFKEQFLDGDGWTSR---------WIESKHKSD--FGKFVL
RAT_CALNEXIN MEGKWLLCLLLVLGTAAIQAHDGHDDD
MIDIEDDLDDVIEEVEDSKSKSDTSTPPSPKVTYKAPVPTGEVYFADSFD
RGSLSGWILSKAKKDDTDDEIAK Human_CALNEXIN
MEGKWLLCMLLVLGTAIVEAHDGHDDDVIDIEDDLDDVIEEVEDSKPDT-
TAPPSSPKVTYKAPVPTGEVYFADSFDRGTLSGWILSKAKKDDTDDEIAK
. .
.
. .. Prim.cons.
MEGK2LL2V2L2LG22GLAA2DGHDDD2IDIEDDLDDVIEEVEDSK222D
T22P2SP2V22K22222G2V22A2SFDRG2LSGWI2SK2K2DDT222222
Rat_CALRETICULIN SSGKFYGDQEK------DKGLQTSQD
ARFYALSARF-EPFSNKGQTLVVQFTVKHEQNIDCGGGYVKLFPGG--LD
QKDMHGDSEYNIMFGPDICGPGTK Human_CALRETICULIN
SSGKFYGDEEK------DKGLQTSQDARFYALSASF-EPFSNKGQTLVVQ
FTVKHEQNIDCGGGYVKLFPNS--LDQTDMHGDSEYNIMFGPDICGPGTK
RAT_CALNEXIN YDGKWEVDEMKETKLPGDKGLVLMSRA
KHHAISAKLNKPFLFDTKPLIVQYEVNFQNGIECGGAYVKLLSKTSELNL
DQFHDKTPYTIMFGPDKCG-EDY Human_CALNEXIN
YDGKWEVEEMKESKLPGDKGLVLMSRAKHHAISAKLNKPFLFDTKPLIVQ
YEVNFQNGIECGGAYVKLLSKTPELNLDQFHDKTPYTIMFGPDKCG-EDY
. .
. . . ....
.. . Prim.cons.
22GK222DE2KE2KLPGDKGL22222A222A2SAK2N2PF222222L2VQ
22V22222I2CGG2YVKL22KT2EL22D22H2222Y2IMFGPD2CGP222
Rat_CALRETICULIN KVHVIFNYKGKNVLINKDIRCK----
------DDEFTHLYTLIVRPDNTYEVKIDNSQVESGSLEDDWD--FLPPK
KIKDPDAAKPEDWDERAKIDDPTD Human_CALRETICULIN
KVHVIFNYKGKNVLINKDIRCK----------DDEFTHLYTLIVRPDNTY
EVKIDNSQVESGSLEDDWD--FLPPKKIKDPDASKPEDWDERAKIDDPTD
RAT_CALNEXIN KLHFIFRHKNPKTGVYEEKHAKRPDAD
LKTYFTDKKTHLYTLILNPDNSFEILVDQSVVNSGNLLNDMTPPVNPSRE
IEDPEDRKPEDWDERPKIADPDA Human_CALNEXIN
KLHFIFRHKNPKTGIYEEKHAKRPDADLKTYFTDKKTHLYTLILNPDNSF
EILVDQSVVNSGNLLNDMTPPVNPSREIEDPEDRKPEDWDERPKIPDPEA
... . .
. . .
. . Prim.cons.
K2H2IF22K22222I222222KRPDADLKTYF2D22THLYTLI22PDN22
E222D2S2V2SG2L22D22PP22P222I2DP22RKPEDWDER2KIDDPT2
Rat_CALRETICULIN SKPEDWDK------------------
---PEHIPDPDAKKPEDWDEEMDGEWEP-------------------PVI
QNPEYKGEWKPRQIDNPDYKGTWI Human_CALRETICULIN
SKPEDWDK---------------------PEHIPDPDAKKPEDWDEEMDG
EWEP-------------------PVIQNPEYKGEWKPRQIDNPDYKGTWI
RAT_CALNEXIN VKPDDWDEDAPSKIPDEEATKPEGWLDD
EPEYIPDPDAEKPEDWDEDMDGEWEAPQIANPKCESAPGCGVWQRPMIDN
PNYKGKWKPPMIDNPNYQGIWK Human_CALNEXIN
VKPDDWDEDAPAKIPDEEATKPEGWLDDEPEYVPDPDAEKPEDWDEDMDG
EWEAPQIANPRCESAPGCGVWQRPVIDNPNYKGKWKPPMIDNPSYQGIWK

.
. Prim.cons.
2KP2DWD2DAP2KIPDEEATKPEGWLDDEPE2IPDPDA2KPEDWDE2MDG
EWE2PQIANP2CESAPGCGVWQRPVI2NP2YKG2WKP22IDNPDY2G2W2

37
Implementing a search algorithm

SensitivityYou dont want to miss anything.
SelectivityYou dont want any false positives.
SpeedYou dont want to wait.

38
BLAST

BLAST Basic Local Alignment Sequence Tool
Although the program handles both nucleotide and
protein sequences you should translate to protein
sequence if you have a coding region.
Always use the longest sequence possible.
Concentrate on regions having conserved
residues Trp, Cys, Phe, Tyr, Pro.
You cannot expect to locate a sequence in the
database if your search sequence is very short
(residues.
If not successful, try varying the parameters.

39
BLAST

You start out by selecting the program to run
(old style)
BLASTP protein against protein
BLASTN nucleotide against nucleotide
BLASTX nucleotide translated in three frames
against protein
TBLASTN protein sequence against nucleotide
translated in all three frames.
TBLASTX six-frame translation of query sequence
against six-frame translation of nucleotide
database

40
BLAST
41
BLAST
42
BLAST
43
BLAST parameters

The Expect threshold is the cut-off value for
expected scores. Select a high value (10000) when
searching for low similarity (e.g. short
sequences) a low value (10) when searching for
high similarity (long sequences).
Word size (2 or 3) is the size of the sequence
chosen for initial comparison (word size W). For
short sequences choose 2.

44
BLAST parameters

The Scoring matrix can be changed to search for
closely matching or diverging sequences. BLOSUM90
has been calculated for analysing very similar
sequences while BLOSUM30 for highly diverging
sequences. Low PAM values are for higher degrees
of similarity.
The Gap Costs finetunes the search for divergent
sequences when having a long search
sequence.High values give a more stringent
search.Use high values for short sequences and
low values for long sequences.
Compositional adjustments changes the scoring
matrix to reflect the composition of the
sequences to be compared.

45
BLAST parameters

Filtering will remove low complexity regions
(regions that have residues that are
overrepresented) and repetitive elements (like
Alu repeats in nucleotide sequences). Should be
turned off for short sequences.
Mask for lookup Extensions of the initial hit
are not removed by filtering.
Mask lower case A sequence in upper case can be
marked for filtering by changing part to lower
case.

46
BLAST
47
Output options
48
Waiting for results
49
Results

The graphical viewer displays hit sequences in
color lines corresponding to score and position
in sequence. Mouse-over displays target sequence
name, and mouse click displays alignment.

50
Results

The results are listed with the highest scores at
the top.
Each line contains the accession number followed
by name of protein, the score (higher is better)
and the E-value (Expected the number of
expected occurrences of a sequence with the given
composition in the requested database).
The E-value should be as low as possible (

51
Results

One or more HSPs (High Scoring Sequence Pair) is
shown at after the scoring list for each
protein.Query input sequence Sbjct database
sequence Similar residues are marked by
and are counted as part of positives

52
Result distance tree
53
BLAST distance tree

Write a Comment

User Comments (0)

About PowerShow.com

Pairwise alignment PowerPoint PPT Presentation