Pairwise alignment - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Pairwise alignment

Description:

C G P S A T D E N Q H K R V M I L F Y W. BLOSUM62. G 7. P -2 9. D -1 -1 7. E -2 0 2 6. N 0 -2 2 0 6 ... Q K R S T A M V I L F Y W C. The genetic code ... – PowerPoint PPT presentation

Number of Views:416
Avg rating:3.0/5.0
Slides: 53
Provided by: peterh7
Category:

less

Transcript and Presenter's Notes

Title: Pairwise alignment


1
Pairwise alignment
2
Sequence to function
3
Many biosequences are related
Of mice and men - myoglobin
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK
GLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
K SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKH
KIP SEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATK
HKIP VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMAS
NYKELGFQG VKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRN
DIAAKYKELGFQG
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK
GLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
K SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKH
KIP SEEDMKGSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATK
HKIP VKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMAS
NYKELGFQG VKYLEFISEIIIEVLKKRHSGDFGADAQGAMSKALELFRN
DIAAKYKELGFQG
4
What is homology?
  • In biological terms homology means evolutionary
    related
  • In biosequence homologous sequences are
    similar
  • We infer homology by comparing biosequences using
    sequence comparisons (alignments)

5
Why look for homology?
  • If two sequences are homologous on the sequence
    level what conclusions?
  • Three-dimensional structure will be similar
  • Biological functions are very likely to be
    related but not always
  • Multiple alignments of homologous sequences can
    help identify structure/functional critical
    residues

6
Evolution
  • Evolution is driven by mutations in genes -
    changes in proteins.
  • Changes in single bases
  • Changes in multiple bases
  • Addition/deletion of multiple bases
  • Moving of part of a gene
  • Copying part or all of a gene
  • We are looking at chemical evolution

7
Sequence differences
Deduction
Evolution
8
Orthologs vs. paralogs
Paralogs
Orthologs
9
Origins of genes with similar sequence
Paralogs
Orthologs
Analogous
Xenologous
10
Modular proteins
  • Most proteins larger than 20 kDa are built from
    modules

C-type lectin
Fibronectin
EGF domains
11
Residue distribution in 3D
  • Structures are hydrophilic on the outside and
    hydrophobic on the inside

Hydrophobic Small Acidamide Basic Active site
Chymotrypsin
12
Homology search methods
  • Dot-PlotSimple, graphical, visual
  • Dynamic programmingNeedleman-WunchSmith-Waterman
  • k-tuple programsFastABLAST
  • Alignment
  • Matrices
  • PAM250
  • BLOSUM62

13
Dot-plot
A D K F H K E A C T E
X D X K X X A X
X A X X K
X E X A X X
14
Dot-plot
X Human Factor X Y Human Factor IX (Christmas
factor)
Direct comparison No matrix
Direct comparison Compare 2
15
DotLet
Distribution of scores of all residue pairs
Log of distribution
Protein C vs factor X
16
DotLet - optimizing
17
DotLet low e-score
MBL vs. Immunolectin Bombyx mori
18
DotLet - optimizing
19
Alignment
  • An alignment is best fit between two or more
    biosequences.
  • Proteins having close relationships can be
    identified by comparing identical
    residues.SEDEMKASEDLKKHGATVLTALGGILKKKGHHEA
    SEEDMKGSEDLKKHGCTVL
    TALGTILKKKGQHAAHuman and murine myoglobin

20
Alignment - similarities
  • When relationships are more distant you introduce
    similaritiesbovine GVTTSDVVVAGEFDQGSSSEKIQKLK
    IAKVFKNSKYNSL ... . ...
    .... Cod NVKNYHRVVLGEHDRSSNSEGVQVMT
    VGQVFKHPRYNGF
  • When doing the comparison by hand you often use
    chemical similaritiesL/I/V F/Y/W R/K
    E/D/Q/N G/A

21
Global/local alignment
  • For global alignments you try to align complete
    sequences
  • For local alignments you only align part of the
    sequence

Best suited for pattern matching
22
Alignment
  • In order to maximise an alignment, gaps have to
    be introducedChymo QDKTGFHFCGGSLINENWVVTAAHCGVT
    TS-DVVVAGEFDQGSSSEKIQKLKIAKVFKNS .
    . . . ...
    . Fac X INEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVG
    DRNTEQEEGGEAVHEVEVVIKHNChymo KYNSLTINNDITLLKLSTA
    ASFSQTVSAVCLPS---ASDDFAAGTTCVTTGWGLTRYTNA
    .. .. .... .. .
    . .. Fac X RFTKETYDFDIAVLRLKTPITFRMNVAPAC
    LPERDWAESTLMTQKTGIVSGFGRTHEKGRChymo
    NTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASG--VSSCMGDSGG
    PLVCKKNGAW . .... . .
    . . . Fac X
    QS-TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGG
    PHVTRFKDTY Chymo TLVGIVSWGSSTCSTSTPGVYARVTA
    LVNWVQQTLAAN-------------------
    .. . ... . Fac X
    FVTGIVSWGESCARKGKYGIYTKVTAFLKWIDRSMKTRGLPKAKSHAPEV
    ITSSPLK

ws gap penalty g gap opening r gap
extension x extension length
Gap penalties are usually implemented as ws g
rx
Bovine chymotrypsin / human factor X
23
Alignment
  • In the three-dimensional structure, gaps are
    usually located to peripherial loop structures.

Bovine chymotrypsin Insertion and deletion
positions (compared to Factor X) are in green.
24
The twilight zone
After insertion of gaps, two random sequences can
be expected to be 10-20 identical! Two sequences
of length 100 residues that are 25 identical are
likely to be genuinely related. If in the 15-25
region, it is worthwhile to do additional tests.
25
Matrices
  • Instead of just counting amino acid substitutions
    (e.g. conserved vs. non-conserved residues) we
    can use biological information
  • The PAM series of substitution matrices are based
    on global alignment of a limited set of closely
    related protein sequences (replacements are
    counted on the brances of a phylogenetic tree).
    The PAM series constitute Point Accepted
    Mutations
  • The BLOSUM series of substitution matrices are
    based highly conserved regions in a series of
    alignments forbidden to contains gaps. The
    BLOSUMxx series contains blocks with at most xx
    identity.

26
PAM250
C 12 G -3 5 P -3 -1 6 S 0 1 1 1 A -2 1
1 1 2 T -2 0 0 1 1 3 D -5 1 -1 0 0
0 4 E -5 0 -1 0 0 0 3 4 N -4 0 -1 1 0
0 2 1 2 Q -5 -1 0 -1 0 -1 2 2 1 4 H
-3 -2 0 -1 -1 -1 1 1 2 3 6 K -5 -2 -1 0
-1 0 0 0 1 1 0 5 R -4 -3 0 0 -2 -1 -1
-1 0 1 2 3 6 V -2 -1 -1 -1 0 0 -2 -2 -2
-2 -2 -2 -2 4 M -5 -3 -2 -2 -1 -1 -3 -2 0 -1
-2 0 0 2 6 I -2 -3 -2 -1 -1 0 -2 -2 -2 -2
-2 -2 -2 4 2 5 L -6 -4 -3 -3 -2 -2 -4 -3 -3
-2 -2 -3 -3 2 4 2 6 F -4 -5 -5 -3 -4 -3 -6
-5 -4 -5 -2 -5 -4 -1 0 1 2 9 Y 0 -5 -5 -3 -3
-3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10 W -8 -7
-6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0
17 C G P S A T D E N Q H K R V M
I L F Y W
27
BLOSUM62
G 7P -2 9 D -1 -1 7 E -2 0 2 6 N 0 -2 2
0 6 H -2 -2 0 0 1 10 Q -2 -1 0 2 0 1 6 K
-2 -1 0 1 0 -1 1 5 R -2 -2 -1 0 0 0 1 3
7 S 0 -1 0 0 1 -1 0 -1 -1 4 T -2 -1 -1 -1
0 -2 -1 -1 -1 2 5 A 0 -1 -2 -1 -1 -2 -1 -1 -2
1 0 5 M -2 -2 -3 -2 -2 0 0 -1 -1 -2 -1 -1
6 V -3 -3 -3 -3 -3 -3 -3 -2 -2 -1 0 0 1 5 I
-4 -2 -4 -3 -2 -3 -2 -3 -3 -2 -1 -1 2 3 5 L -3
-3 -3 -2 -3 -2 -2 -3 -2 -3 -1 -1 2 1 2 5 F -3
-3 -4 -3 -2 -2 -4 -3 -2 -2 -1 -2 0 0 0 1 8 Y
-3 -3 -2 -2 -2 2 -1 -1 -1 -2 -1 -2 0 -1 0 0
3 8 W -2 -3 -4 -3 -4 -3 -2 -2 -2 -4 -3 -2 -2 -3
-2 -2 1 3 15 C -3 -4 -3 -3 -2 -3 -3 -3 -3 -1
-1 -1 -2 -1 -3 -2 -2 -3 -5 12 G P D E N H
Q K R S T A M V I L F Y W C
28
The genetic code
29
Needleman-Wunch
  • The best global alignment can be found as the
    best path through a substitution matrix. The
    optiomal path can be found by incremental
    extension of a subpath.
  • The Smith-Waterman algorithm extends this to
    optimal local alignment.
  • The main disadvantage is that finding the optimal
    alignment is very computing intensive

30

Needleman-Wunch alignment scheme
31
Needleman-Wunch
32
Smith-Waterman
33
FastA - BLAST
  • Present day homology search methods are all
    word-based. The idea is that a true relationship
    will have at least one word (two or more
    residues) in common.
  • FastA use the parameter ktup for number of
    identical residues that form the basis of a local
    alignment.
  • BLAST has extended this to neighborhood words.
    For a given word size W the score has to be at
    least T using a substitution matrix. W is usually
    kept constant while varying T.The original
    implementation did not allow for gaps, but has
    been implemented from v. 2.0.

34
Alignments - seeding
Smith-Waterman
Word seeding
35
Clustering and extension
Word clusters Words on the same diagonal
Extending words
36
Multiple alignments
Rat_CALRETICULIN ----MLLSVPLLLGLLGLAAAD-------
---------------------------PAIYFKEQFLDGDAWTNR-----
----WVESKHKSD--FGKFVL Human_CALRETICULIN
----MLLSVPLLLGLLGLAVAE----------------------------
------PAVYFKEQFLDGDGWTSR---------WIESKHKSD--FGKFVL
RAT_CALNEXIN MEGKWLLCLLLVLGTAAIQAHDGHDDD
MIDIEDDLDDVIEEVEDSKSKSDTSTPPSPKVTYKAPVPTGEVYFADSFD
RGSLSGWILSKAKKDDTDDEIAK Human_CALNEXIN
MEGKWLLCMLLVLGTAIVEAHDGHDDDVIDIEDDLDDVIEEVEDSKPDT-
TAPPSSPKVTYKAPVPTGEVYFADSFDRGTLSGWILSKAKKDDTDDEIAK
. .
.
. .. Prim.cons.
MEGK2LL2V2L2LG22GLAA2DGHDDD2IDIEDDLDDVIEEVEDSK222D
T22P2SP2V22K22222G2V22A2SFDRG2LSGWI2SK2K2DDT222222
Rat_CALRETICULIN SSGKFYGDQEK------DKGLQTSQD
ARFYALSARF-EPFSNKGQTLVVQFTVKHEQNIDCGGGYVKLFPGG--LD
QKDMHGDSEYNIMFGPDICGPGTK Human_CALRETICULIN
SSGKFYGDEEK------DKGLQTSQDARFYALSASF-EPFSNKGQTLVVQ
FTVKHEQNIDCGGGYVKLFPNS--LDQTDMHGDSEYNIMFGPDICGPGTK
RAT_CALNEXIN YDGKWEVDEMKETKLPGDKGLVLMSRA
KHHAISAKLNKPFLFDTKPLIVQYEVNFQNGIECGGAYVKLLSKTSELNL
DQFHDKTPYTIMFGPDKCG-EDY Human_CALNEXIN
YDGKWEVEEMKESKLPGDKGLVLMSRAKHHAISAKLNKPFLFDTKPLIVQ
YEVNFQNGIECGGAYVKLLSKTPELNLDQFHDKTPYTIMFGPDKCG-EDY
. .
. . . ....
.. . Prim.cons.
22GK222DE2KE2KLPGDKGL22222A222A2SAK2N2PF222222L2VQ
22V22222I2CGG2YVKL22KT2EL22D22H2222Y2IMFGPD2CGP222
Rat_CALRETICULIN KVHVIFNYKGKNVLINKDIRCK----
------DDEFTHLYTLIVRPDNTYEVKIDNSQVESGSLEDDWD--FLPPK
KIKDPDAAKPEDWDERAKIDDPTD Human_CALRETICULIN
KVHVIFNYKGKNVLINKDIRCK----------DDEFTHLYTLIVRPDNTY
EVKIDNSQVESGSLEDDWD--FLPPKKIKDPDASKPEDWDERAKIDDPTD
RAT_CALNEXIN KLHFIFRHKNPKTGVYEEKHAKRPDAD
LKTYFTDKKTHLYTLILNPDNSFEILVDQSVVNSGNLLNDMTPPVNPSRE
IEDPEDRKPEDWDERPKIADPDA Human_CALNEXIN
KLHFIFRHKNPKTGIYEEKHAKRPDADLKTYFTDKKTHLYTLILNPDNSF
EILVDQSVVNSGNLLNDMTPPVNPSREIEDPEDRKPEDWDERPKIPDPEA
... . .
. . .
. . Prim.cons.
K2H2IF22K22222I222222KRPDADLKTYF2D22THLYTLI22PDN22
E222D2S2V2SG2L22D22PP22P222I2DP22RKPEDWDER2KIDDPT2
Rat_CALRETICULIN SKPEDWDK------------------
---PEHIPDPDAKKPEDWDEEMDGEWEP-------------------PVI
QNPEYKGEWKPRQIDNPDYKGTWI Human_CALRETICULIN
SKPEDWDK---------------------PEHIPDPDAKKPEDWDEEMDG
EWEP-------------------PVIQNPEYKGEWKPRQIDNPDYKGTWI
RAT_CALNEXIN VKPDDWDEDAPSKIPDEEATKPEGWLDD
EPEYIPDPDAEKPEDWDEDMDGEWEAPQIANPKCESAPGCGVWQRPMIDN
PNYKGKWKPPMIDNPNYQGIWK Human_CALNEXIN
VKPDDWDEDAPAKIPDEEATKPEGWLDDEPEYVPDPDAEKPEDWDEDMDG
EWEAPQIANPRCESAPGCGVWQRPVIDNPNYKGKWKPPMIDNPSYQGIWK


.
. Prim.cons.
2KP2DWD2DAP2KIPDEEATKPEGWLDDEPE2IPDPDA2KPEDWDE2MDG
EWE2PQIANP2CESAPGCGVWQRPVI2NP2YKG2WKP22IDNPDY2G2W2

37
Implementing a search algorithm
  • SensitivityYou dont want to miss anything.
  • SelectivityYou dont want any false positives.
  • SpeedYou dont want to wait.

38
BLAST
  • BLAST Basic Local Alignment Sequence Tool
  • Although the program handles both nucleotide and
    protein sequences you should translate to protein
    sequence if you have a coding region.
  • Always use the longest sequence possible.
  • Concentrate on regions having conserved
    residues Trp, Cys, Phe, Tyr, Pro.
  • You cannot expect to locate a sequence in the
    database if your search sequence is very short
    (residues.
  • If not successful, try varying the parameters.

39
BLAST
  • You start out by selecting the program to run
    (old style)
  • BLASTP protein against protein
  • BLASTN nucleotide against nucleotide
  • BLASTX nucleotide translated in three frames
    against protein
  • TBLASTN protein sequence against nucleotide
    translated in all three frames.
  • TBLASTX six-frame translation of query sequence
    against six-frame translation of nucleotide
    database

40
BLAST
41
BLAST
42
BLAST
43
BLAST parameters
  • The Expect threshold is the cut-off value for
    expected scores. Select a high value (10000) when
    searching for low similarity (e.g. short
    sequences) a low value (10) when searching for
    high similarity (long sequences).
  • Word size (2 or 3) is the size of the sequence
    chosen for initial comparison (word size W). For
    short sequences choose 2.

44
BLAST parameters
  • The Scoring matrix can be changed to search for
    closely matching or diverging sequences. BLOSUM90
    has been calculated for analysing very similar
    sequences while BLOSUM30 for highly diverging
    sequences. Low PAM values are for higher degrees
    of similarity.
  • The Gap Costs finetunes the search for divergent
    sequences when having a long search
    sequence.High values give a more stringent
    search.Use high values for short sequences and
    low values for long sequences.
  • Compositional adjustments changes the scoring
    matrix to reflect the composition of the
    sequences to be compared.

45
BLAST parameters
  • Filtering will remove low complexity regions
    (regions that have residues that are
    overrepresented) and repetitive elements (like
    Alu repeats in nucleotide sequences). Should be
    turned off for short sequences.
  • Mask for lookup Extensions of the initial hit
    are not removed by filtering.
  • Mask lower case A sequence in upper case can be
    marked for filtering by changing part to lower
    case.

46
BLAST
47
Output options
48
Waiting for results
49
Results
  • The graphical viewer displays hit sequences in
    color lines corresponding to score and position
    in sequence. Mouse-over displays target sequence
    name, and mouse click displays alignment.

50
Results
  • The results are listed with the highest scores at
    the top.
  • Each line contains the accession number followed
    by name of protein, the score (higher is better)
    and the E-value (Expected the number of
    expected occurrences of a sequence with the given
    composition in the requested database).
  • The E-value should be as low as possible (

51
Results
  • One or more HSPs (High Scoring Sequence Pair) is
    shown at after the scoring list for each
    protein.Query input sequence Sbjct database
    sequence Similar residues are marked by
    and are counted as part of positives

52
Result distance tree
53
BLAST distance tree
Write a Comment
User Comments (0)
About PowerShow.com