Title: Sequence comparison and Phylogeny
1Sequence comparisonand Phylogeny
based on Chapter 4 Lesk, Introduction to
Bioinformatics
2Contents
- Motivation
- Sequence comparison and alignments
- Dot plots
- Dynamic programming
- Substitution matrices
- Dynamic programming Local and global alignments
and gaps - BLAST
- Significance of alignments
- Multiple sequence alignments
- Phylogenetic trees
3Motivation
- From where are we?
- Recent Africa vs. Multi-regional Hypothese
- In 1999 Encephalitis caused by the West Nile
Virus broke out in New York. How did the virus
come to New York? - How did the nucleus get into the eucaryotic
cells? - To answer such questions we will need sequence
comparison and phylogenetic trees
4Sequence Similarity Searches
- Sequence similarity can be clue to common
evolutionary ancestor - E.g. globin genes in chimpanzees and humans
- or common function
- E.g. v-sys onco genes in simian sarcoma virus
leading to cancer in monkeys and the seemingly
unrelated growth stimulating hormone PDGF, which
stimulates cell growth (first success of
similarity idea, 1983) - In general
- If an unknown sequences is found, deduce its
function/structure indirectly by finding similar
sequences, whose function/structure is known - Assumption Evolution changes sequences slowly
often maintaining main features of a sequences
function/structure
5Sequence alignment
- Substitutions, insertions and deletions can be
interpreted in evolutionary terms - But distinguish chance similarity and real
biological relationship
CCGTAA
TCGTAA
CCGTAT
TCGTAC
TTGTAA
TCGTAG
TAGTAC
6Evolution
- Convergent evolution same sequence evolved from
different ancestors - Back evolution - mutate to a previous sequence
CCGTAA
TCGTAA
CCGTAT
TCGTAC
TAGTAA
TCGTAG
TAGTAC
TAGTAC
CCGTAA
7Similarity vs. Homology
- Any sequence can be similar
- Sequences homologues if evolved from common
ancestor - Homologous sequences
- Orthologs similar biological function
- Paralogs different biological function (after
gene duplication), e.g. lysozyme and
a-lactalbumin, a mammalian regulatory protein - Assumption Similarity indicator for homology
- Note, altered function of the expressed protein
will determine if the organism will survive to
reproduce, and hence pass on the altered gene
8Sequence alignments
- Given two or more sequences, we wish to
- Measure their similarity
- Determine the residue-residue correspondences
- Observe patterns of conservation and variability
- Infer evolutionary relationships
9What is the best alignment?
- Uninformative -------gctgaacg ctataatc------
- - Without gaps gctgaacg ctataatc
- With gaps gctga-a--cg --ct-ataatc
- Another one gctg-aa-cg -ctataatc-
- Formally The best alignment have only a minimal
number of mismatches (insertions, deletions,
replace) - We need a method to systematically explore and to
compute alignments
10Scores for an alignment
- Percentage of matches
- Score each match, mismatch, gap opening, gap
extension - Example
- match 1
- mismatch -1
- Gap opening -3
- Gap extension -1
- Uninformative 0, score -21 -------gctgaacg
ctataatc------- - Without gaps25,score -4 gctgaacg ctataatc
- With gaps 0, score -23 gctga-a--cg --ct-a
taatc - Another one 50, score-12 gctg-aa-cg -ctat
aatc-
11Scores for an alignment
- Percentage of matches
- Score each match, mismatch, gap opening, gap
extension - Example
- match 2
- mismatch -1
- Gap opening -1
- Gap extension -1
- Uninformative 0, score -17 -------gctgaacg
ctataatc------- - Without gaps25,score -2 gctgaacg ctataatc
- With gaps 0, score -15 gctga-a--cg --ct-a
taatc - Another one 50, score0 gctg-aa-cg -ctataa
tc-
12Dot plots
13Dot plots
- A convenient way of comparing 2 sequences
visually - Use matrix, put 1 sequence on X-axis, 1 on Y-axis
- Cells with
- identical characters filled with a 1,
- non-identical with 0
- (simplest scheme - could have weights)
14Dot plots
15Dot plots
16Interpreting dot plots
- What do identical sequences look like?
- What do unrelated sequences look like?
- What do distantly related sequences look like?
- What does reverse sequence look like?
- Relevant for detections of stems in RNA structure
- What does a palindrome look like?
- Relevant for restriction enzymes
- What do repeats look like?
- What does a protein with domains A and B and
another one with domains B and C look like?
17Dot plot for identical sequences
18Dotplot for unrelated sequences
19Dotplot for distantly related sequences
20Dotplot for reverse sequences
21Dotplot for reverse sequences
- Relevant to identify stems in RNA structures
- Plot sequence against its reverse complement
22Palindromes and restriction enzymes
- Madam, I'm Adam
- Able was I ere I saw Elba (supposedly said by
Napoleon) - Doc note I dissent, a fast never prevents a
fatness, I diet on cod. - Because DNA is double stranded and the strands
run antiparallel, palindromes are defined as any
double stranded DNA in which reading 5 to 3
both are the same - The HindIII cutting site
- 5'-AAGCTT-3'
- 3'-TTCGAA-5'
- The EcoRI cutting site
- 5'-GAATTC-3'
- 3'-CTTAAG-5'
23Dotplot of a Palindrome
24Dotplot of repeats
25Dotplot of Repeats/Palindrome
26Dotplot for shared domain
27Result
- Dot plot
- dorothycrowfoothodgkin
- d
- o
- r
- o
- t
- h
- y
- h
- o
- d
- g
- k
- i
- n
28Result
- Dot plot
- dorothycrowfoothodgkin
- d
- o
- r
- o
- t
- h
- y
- h
- o
- d
- g
- k
- i
- n
29Dotplots
- Window size 15
- Dot if
- 6 matches in window
30(No Transcript)
31gtgi1942644pdb1MEG Crystal Structure Of A
Caricain D158e Mutant In Complex With E-64
Length 216 Score 271 bits (693), Expect
1e-73 Identities 142/216 (65), Positives
168/216 (77), Gaps 4/216 (1) Query 1
IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSE
QELLDCDRRS 60 PE VDWRKGAVTPVQGSCGSC
WAFSAV TEGI KIRTG L SEQELDCRRS Sbjct 1
LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSE
QELVDCERRS 60 Query 61 YGCNGGYPWSALQLVAQYGIHYRN
TYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120
GC GGYP AL VA GIH R YPY Q CR G KT
GV VQP NG Sbjct 61 HGCKGGYPPYALEYVAKNGIHLRSKY
PYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query
121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV--
--GYGPNYILIKNS 176 LL IA QPVSVV G
FQLYGGIF GPCG KVHAV AV G YILIKNS Sbjct
121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGY
GKSGGKGYILIKNS 180 Query 177 WGTGWGENGYIRIKRGTGN
SYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR
GNS GVCGLY SSYP KN Sbjct 181 WGTAWGEKGYIRIKRAPGN
SPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk
gavtpvrhqg scgscwafsa vatveginki rtgklvelse
qelvdcerrs 61 hgckggyppy aleyvakngi
hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn
121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha
vtavgygksg gkgyilikns 181 wgtawgekgy
irikrapgns pgvcglykss yyptkn
32(No Transcript)
33- gtgi2624670pdb1AIM Cruzain Inhibited By
Benzoyl-Tyrosine-Alanine- Fluoromethylketone - Length 215
- Score 121 bits (303), Expect 3e-28
- Identities 78/202 (38), Positives 107/202
(52), Gaps 13/202 (6) - Query 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKI
RTGNLNQYSEQELLDCDRRSY 61 - P VDWR GAVT VKQG CGSCWAFSA E
L SEQ L CD - Sbjct 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFL
AGHPLTNLSEQMLVSCDKTDS 61 - Query 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRY
CRSREKGPYAAKTDGVRQVQP 115 - GCGG A Q YPY EG
C A T V Q - Sbjct 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPP
CTTSGHTVGATITGHVELPQD 121 - Query 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCG
NKVDHAVAAVGYGPN----YI 171 - Q A N PVV A Y GG
DH V VGY Y - Sbjct 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVS
EALDHGVLLVGYNDSAAVPYW 178 - Query 172 LIKNSWGTGWGENGYIRIKRGT 193
34(No Transcript)
35- gi7546546pdb1EF7B Chain B, Crystal
Structure Of Human Cathepsin X - Length 242
- Score 52.0 bits (123), Expect 2e-07
- Identities 60/231 (25), Positives 94/231
(40), Gaps 34/231 (14) - Query 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVT
IEGIIKIRTGNL---NQYSEQ 51 - P DWR V NQ CGSCWA
I I S Q - Sbjct 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSA
MADRINIKRKGAWPSTLLSVQ 60 - Query 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEG
VQRYCR--------SREKGPY 103 - DC C GG S QGI Y
C K - Sbjct 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQA
KDQECDKFNQCGTCNEFKECH 119 - Query 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAA
GKDFQLYRGGIFVGPCGNK-V 157 - A V Y AN PS A
Y GGI - Sbjct 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMAT
ER-LANYTGGIYAEYQDTTYI 178 - Query 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI---
--KRGTGNSYGV 199
36(No Transcript)
37(No Transcript)
38Dynamic programming
39From Dotplots to Alignments
- Obvious best alignment DOROTHYCROWFOOTHODGKIN
DOROTHY--------HODGKIN
40From Dotplots to Alignments
- Find best path from top left corner to bottom
right - Moving east corresponds to - in the second
sequence - Moving south corresponds to - in the first
sequence - Moving southeast corresponds to a match if the
characters are the same or a mismatch otherwise - Can we automate this?
41From Dotplots to Alignments
- Algorithm (Dynamic Programming)
- Insert a row 0 and column 0 initialised with 0
- Starting from the top left, move down row by row
from row 1 and - right column by column from column 1 visiting
each cell - Consider
- The value of the cell north
- The value of the cell west
- The value of the cell northwest if the row/column
character mismatch - 1 the value of the cell northwest if the
row/column character match - Put down the maximum of these values as the value
for the current cell - Trace back the path with the highest values from
the bottom right to the top left and output the
alignment
42From Dotplots to Alignments
- 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3
C4 T5 G6 A7 T
43From Dotplots to Alignments
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 02 T 03 C 04 T 05 G 06 A 07 T 0 - Insert a row 0 and column 0 initialised with 0
44From Dotplots to Alignments
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07
T 0 -
- Consider
- Value north
- Value west
- Value northwest if the row/column character
mismatch - 1 value northwest if the row/column character
match - Put down the maximum of these values for current
celll
45From Dotplots to Alignments
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4 -
46Reading the Alignment
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4 -
-tgcat-a- at-c-tgat
47Reading the Alignment there are more than one
possibility
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4 -
---tgcata atctg-at-
48FormallyLongest Common Subsequence LCS
- What is the length s(V,W) of the longest common
subsequence of two sequencesVv1..vn and
Ww1..wm ? - Find sequences of indices1 i1 lt lt ik n and
1 j1 lt lt jk msuch that vit wjt for 1
t k - How? Dynamic programming
- si,0 s0,j 0 for all 1 i n and 1 j m
and - si-1,jsi,j max si,j-1 si-1,j-1 1,
if vi wj - Then s(V,W) sn,m is the length of the LCS
49Example LCS
- 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3
C4 T5 G6 A7 T
50Example LCS
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 02 T 03 C 04 T 05 G 06 A 07 T 0 - Initialisation si,0 s0,j 0 for all 1 i
n and 1 j m and
51Example LCS
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07
T 0 - Computing each cell si-1,j si,j max
si,j-1 si-1,j-1 1, if vi wj -
52Example LCS
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4 - Computing each cell si-1,j si,j max
si,j-1 si-1,j-1 1, if vi wj -
53LCS Algorithm
- Complexity
- LCS has quadratic complexity O(n m)
- LCS(V,W)
- For i 1 to n
- si,0 0
- For j 1 to m
- s0,j 0
- For i 1 to n
- For j 1 to m
- If vi wj and si-1,j-1 1 si-1,j and si-1,j-1
1 si,j-1 Then - si,j si-1,j-1 1
- bi,j North West
- Else if si-1,j si,j-1 Then
- si,j si-1,j
- bi,j North
- Else
- si,j si,j-1
- bi,j West
- Return s and b
54Printing the alignment of LCS
- PRINT-LCS(b,V,i,j)
- If i0 or j0 Then Return
- If bi,j North West Then
- PRINT-LCS(V,b,i-1,j-1)
- Print vi
- Else if bi,j North Then
- PRINT-LCS(V,b,i-1,j)
- Else
- PRINT-LCS(V,b,i,j-1)
55Rewards/Penalities
- We can use different schemes
- -1 for insert/delete/mismatch
- 1 for match
- Consider
- -1 the value of the cell north
- -1 the value of the cell west
- -1 the value of the cell northwest if the
row/column character mismatch - 1 the value of the cell northwest if the
row/column character match - Put down the maximum of these values as the value
for the current cell
56Reading the Alignment
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 -1 -1 -1 1 0 12 T 0 1 0 -1 0 2 13
C 0 0 -1 1 0 1 14 T 0 1 0 0 0 1 05
G 0 0 2 1 0 0 06 A 0 -1 1 1 2 1 17
T 0 1 0 0 1 3 2 -
---tgcata atctg-at-
57Rewards/Penalities
- Lets refine the schemes
- Transition mutations are more common
- purinelt-gtpurine, alt-gtg
- pyrimidinelt-gtpyrimidine, tlt-gtc
- Transversions (purinelt-gtpyrimidine) are less
common - Use a subsitutation matrix to rate mismatches
- -2 for insert/delete
- Mismatch/match according to substitution matrix
- Consider
- -2 the value of the cell north
- -2 the value of the cell west
- Corresponding value of the substion matrix the
value of the cell northwest - Put down the maximum of these values as the value
for the current cell
58Reading the Alignment
- 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 -2 0 -2 2 0 22 T 0 2 0 0 0 4 23
C 0 0 0 2 0 2 24 T 0 2 0 0 0 2 05
G 0 0 4 2 0 0 26 A 0 -2 2 2 4 2 27
T 0 2 0 2 2 6 4 -
---tgcata atctg-at-
59Substitution matrixes
60How to derive a substitution matrix for amino
acids?
- Amino acids can be classified by physiochemical
properties
61PAM 250 matrix
Cys 12 Ser 0 2 Thr -2 1 3 Pro -3 1 0
6 Ala -2 1 1 1 2 Gly -3 1 0 -1 1 5 Asn -4
1 0 -1 0 0 2 Asp -5 0 0 -1 0 1 2 4 Glu
-5 0 0 -1 0 0 1 3 4 Gln -5 -1 -1 0 0 -1
1 2 2 4 His -3 -1 -1 0 -1 -2 2 1 1 3
6 Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 Lys -5
0 0 -1 -1 -2 1 0 0 1 0 3 5 Met -5 -2 -1
-2 -1 -3 -2 -3 -2 -1 -2 0 0 6 Ile -2 -1 0 -2
-1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 Leu -6 -3 -2 -3
-2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 Val -2 -1 0
-1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 Phe -4
-3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1
9 Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2
-1 -1 -2 7 10 Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5
-3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A
G N D E Q H R K M I L V F Y W
gt0, likely mutation 0, random mutation lt0,
unlikely
62(No Transcript)
63(No Transcript)
64PAM 250 Interpretation
- Immutable
- Cysteine (Avg-2.8) known to have several
unique, indispensable functions - attachment site of heme group in cytochrome and
of iron sulphur FeS in ferredoxins - Cross links in proteins such as chymotrypsin or
ribonuclease - Seldom without unique function
- Glycine (Avg-1.6) small size maybe advantageous
- Mutable
- Serine often functions in active site, but can be
easily replaced - Self-alignment
- Tryptophan with itself scores very high, as W
occurs rarely
65Point Accepted Mutations PAM
- Substitution matrix using explicit evolutionary
model of how amino acids change over time - Use parsimony method to determine frequency of
mutations - Entry in PAM matrix Likelihood ratio for
residues a and b Probability a-b is a mutation /
probability a-b is chance - PAM x Two sequences V, W have evolutionary
distance of x PAM if a series of accepted point
mutations (and no insertions/deletions) converts
V into W averaging to x point mutation per 100
residues - Mutations here mutations in the DNA
- Because of silent mutations and back mutations n
can be gt100 - PAM 250 most commonly used
66PAM and Sequence Similarity
67PAM
- Dayhoff, Eck, Park A model of evolutionary
change in proteins, 1978 - Accepted point mutation substitution of an
amino acid accepted by natureal selection - Assumption X replacing Y as likely as Y
replacing X - Used cytochrome c, hemoglobin, myoglobin, virus
coat proteins, chymotrypsinogen, glyceraldehyde
3-phosphate dehrydogenase, clupeine, insulin,
ferredoxin - Sequences which are too distantly related have
been omitted as they are more likely to contain
multiple mutations per site
68PAM Step 1
- Step 1 Construct a multiple alignment
- Example
- ACGCTAFKI
- GCGCTAFKI
- ACGCTAFKL
- GCGCTGFKI
- GCGCTLFKI
- ASGCTAFKL
- ACACTAFKL
69PAM Step 2
- Create a phylogenetic tree (parsimony method)
ACGCTAFKI A-gtG
I-gtL GCGCTAFKI ACGCTAFKL
A-gtG A-gtL C-gtS G-gtA GCGCTGFKI
GCGCTLFKI ASGCTAFKL ACACTAFKL
70PAM Step 3
- Note, the following variables
- Residue frequency ri is the number of amino acid
i occurring in the sequences, e.g. rA 10 and
rG10 - Number of residues r is the number of overall
amino acids in all sequences, e.g. r63 - Substitutability si is the number of
substitutions in the tree involving amino acid i
, e.g. sA4 - Substituion frequency si,j is the number of
substitutions involving amino acid i and j (i.e.
the number of i?j and j?i ), e.g. sA,G 3 - Number of substitutions s is the number of
overall substitutions, s6
71PAM Step 4
- Relative mutability is the number of times the
amino acid is substituted by any other amino acid
in the tree divided by the total number of
substitutions that could have affected the
residue - Note, it is assumed that substitutions in both
directions are equally likely - mi 100 x ( si x ri ) / ( 2 s x r )
- Example mA 100 x ( 4 x 10 ) / ( 2 x 6 x 63 )
5.3
72PAM Step 5
- Compute mutation probability
- Mi,j mj x si,j / sj
- Example MG,A 5.3 x 3 / 4 3.975
73PAM Step 6
- Finally the entry in the PAM Matrix
- Ri,j log ( Mi,j / ( ri / r ) )
- Example RG,A log ( 3.975 / (10/63) ) 1.4
74PAM Step 6
- For the entries on the diagonal
- mj relative mutability ? 1-mj relative
immutability - Rj,j log ( (1-mj ) / ( rj / r ) )
75BLOSUM
- Different approach to PAM
- BLOcks SUbstitution Matrix (based on BLOCKS
database) - Generation of BLOSUM x
- Group highly similar sequences and replace them
by a representative sequences. - Only consider sequences with no more than x
similarity - Align sequences (no gaps)
- For any pair of amino acids a,b and for all
columns c of the alignment, let q(a,b) be the
number of co-occurrences of a,b in all columns c. - Let p(a) be the overall probability of a
occurring - BLOSUM entry for a,b is log2 ( q(a,b) / (
p(a)p(b) ) ) - BLOSUM 50 and BLOSUM 62 widely used
76LCS Algorithm (Longest Common Subsequence)
Revisited
- Algorithm (Dynamic Programming) with Substitution
Matrix - Insert a row 0 and column 0 initialised with 0
- Starting from the top left, move down row by row
from row 1 and - right column by column from column 1 visiting
each cell - Consider
- The value of the cell north
- The value of the cell west
- The value of the cell northwest if the row/column
character mismatch - s the value of the cell northwest, where s is
the value in the subsitution matrix for the
residues in row/column - Put down the minimum of these values as the value
for the current cell - Trace back the path with the highest values from
the bottom right to the top left and output the
alignment
77LCS Revisited Formally
- What is the length s(V,W) of the longest common
subsequence of two sequencesVv1..vn and
Ww1..wm ? - Find sequences of indices1 i1 lt lt ik n and
1 j1 lt lt jk msuch that vit wjt for 1
t k - How? Dynamic programming
- si,0 s0,j 0 for all 1 i n and 1 j m
and - si-1,jsi,j max si,j-1 si-1,j-1 t,
where t is the value for vi and wj in
the substitution matrix - Then s(V,W) sn,m is the length of the LCS
78Dynamic programming revisitedlocal and global
alignments and gap
79Evolution and Alignments
- Alignments can be interpreted in evolutionary
terms - Identical letters are aligned. Interpretation
part of the same ancestral sequence and not
changed - Non-identical letters are aligned
(substitution)Interpretation Mutation - GapsInterpretation Insertions and deletions
(indels)
80Evolution and Alignments
- Specific problems aligning DNA
- Frame shift
- DNA triplets code amino acids
- Indel of one nucleotide shifts the whole sequence
of triplets - Thus may have a global effect and change all
coded amino acids - Silent mutation
- Substitution in DNA leaves transcribed amino acid
unchanged - Non-sense mutation
- Substitution to stop-codon
81Local and Global Alignments
- Global alignment (Needleham-Wunsch) algorithm
finds overall best alignment - Example members of a protein family, e.g.
globins are very conserved and have the same
length in different organisms from fruit fly to
humans - Local alignment (Smith-Waterman) algorithm finds
locally best alignment - most widely used, as
- e.g. genes from different organisms retain
similar exons, but may have different introns - e.g. homeobox gene, which regulates embryonic
development occurs in many species, but very
different apart from one region called
homeodomain - e.g. proteins share some domains, but not all
82Local Alignment
- LCS s(V,W) computes globally best alignment
- Often it is better to maximise locally, i.e.
compute maximal s(vivi , wj wi ) for all
substrings of V and W - Can we adapt algorithm?
- Global alignment longest path in matrix s from
(0,0) to (n,m) - Local alignment longest path in matrix s from
any (i,j) to any (i,j) - Modify definition of s adding vertex of weight 0
from source to every other vertex, creating a
free jump to any starting position (i,j)
83Local Alignment
- Modify the definition of s as follows
- si,0 s0,j 0 for all 1 i n and 1 j m
and - 0 si-1,jsi,j max si,j-1 si-1,j-1
t, where t is the value for vi wj in the
substitution matrix - Then s(V,W) max si,j is the length of the
local LCS - This computes longest path in edit graph
- Several local alignment may have biological
significance (consider e.g. two multi-domain
proteins whose domains are re-ordered
84Aligning with Gap Penalties
- Gap is sequence of spaces in alignment
- So far, we consider only insertion and deletion
of single nucleotides or amino acids creating
alignments with many gaps - So far, score of a gap of length l is l
- Because insertion/deletion of monomers is
evolutionary slow process, large numbers of gaps
do not make sense - Instead whole substrings will be deleted or
inserted - We can generalise score of a gap to a score
function A B l, where A is the penalty to open
the gap and B is the penalty to extend the gap
85Aligning with Gap Penalties
- High gap penalties result in shorter,
lower-scoring alignments with fewer gaps and - Lower gap penalties give higher-scoring, longer
alignments with more gaps - Gap opening penalty A mainly influences number of
gaps - Gap extension penalty B mainly influences length
of gaps - E.g. if interested in close relationships, then
choose A, B above default values, for distant
relationships decrease default values
86Aligning with Gap Penalties
- Adapt the definition of s as follows
- s-deli,j max s-deli-1,j - B si-1,j
(AB) - s-insi,j max s-insi,j-1 - B si,j-1
(AB) - 0 s-deli,jsi,j max s-insi,j
si-1,j-1 t, where t is the value for vi,
wj in the substitution matrix Then s(V,W)
max si,j is the length of the local LCS with
gap penalties A and B
87FASTA and BLAST
88Motivation
- As in dotplots, the underlying data structure for
dynamic programming is a table - Given two sequences of length n dynamic
programming takes time proportional to n2 - Given a database with m sequences, comparing a
query sequence to the whole database takes time
proportional to m n2 - What does this mean?
- Imagine you need to fill in the tables by hand
and it takes 10 second to fill in one cell - Assume there are 1.000.000 sequences each 100
amino acids long - How long does it take?
89- 1.000.000 x 100 x 100 x 10 sec 1011 sec
27.777.778h 1157407days 3170 years - Even if a computer does not take 10 sec, but just
0.1ms to fill in one cell, it would still be 12
days. - We cannot do something about the database size,
but can we do something about the table size?
90An idea Prune the search space
91Another idea
- Did we formulate the problem correctly?
- Do we need the alignments for all sequences in
the database? - No, only for reasonable hits ? introduce a
threshold - A reasonable alignment will contain short
stretches of perfect matches - Find these first, then extend them to connect
them as best possible
92FASTA and BLAST
- FASTA and BLAST faster than dynamic programming
(5 times and 50 times respectively) - Underlying idea for a heuristic
- High-scoring alignments will contain short
stretches of identical letters, called words - FASTA and BLAST first search for matches of words
of a given length and score threshold - BLAST for words of length 3 for proteins and 11
for DNA - FASTA for words of length 2 for proteins and 6
for DNA - Next, matches are extended to local (BLAST) and
global (FASTA) alignments
93FASTA and BLAST
- More formallyIf the strings Vv1..vm and
Ww1..wm match with at most k mismatches, then
they share an p-tuple for p ?m/(k1)?, i.e.
vi..vil-1 wj..wjl-1 for some 1 i,j m-p1 - FILTRATION ALGORITHM, which detects all matching
words of length m with up to k mismatches - Potential match detection Find all matches of
p-tuples of V,W (can be done in linear time by
inserting them into a hash table) - Potential match verification Verify each
potential match by extending it to the left and
right until either the first k1 mismatches are
found or the beginning or end of the sequences
are found
94Example for BLAST
- Search SWISSPROT for Immunoglobulin
95Example for BLAST
- Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for
P11912
96Example for BLAST
97Example for BLAST Top Hits
- Score E Sequences producing significant
alignments Score E-Value gi547896spP11912C7
9A_HUMAN B-cell antigen receptor comp... 473
e-133 gi728993spP40293C79A_BOVIN B-cell
antigen receptor comp... 312 3e-85
gi126779spP11911C79A_MOUSE B-cell antigen
receptor comp... 278 5e-75 gi728994spP40259C
79B_HUMAN B-cell antigen receptor comp... 55
1e-07 gi125781spP01618KV1_CANFA IG KAPPA
CHAIN V REGION GOM 38 0.019 gi125361spP17948
VGR1_HUMAN Vascular endothelial growth ... 37
0.042 gi549319spP35969VGR1_MOUSE Vascular
endothelial growth ... 36 0.052
gi114764spP15530C79B_MOUSE B-cell antigen
receptor comp... 36 0.064 gi1718161spP53767V
GR1_RAT Vascular endothelial growth f... 35
0.080 gi125735spP01681KV01_RAT Ig kappa
chain V region S211 35 0.095
gi1730075spP01625KV4A_HUMAN IG KAPPA CHAIN
V-IV REGION LEN 34 0.26 gi1718188spP52583VGR2
_COTJA Vascular endothelial growth... 33 0.28
gi125833spP06313KV4B_HUMAN IG KAPPA CHAIN
V-IV REGION J... 33 0.30 gi125806spP01658KV3
F_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30
gi125808spP01659KV3G_MOUSE IG KAPPA CHAIN
V-III REGION ... 33 0.30 gi1172451spQ05793PG
BM_MOUSE Basement membrane-specific ... 33 0.33
gi125850spP01648KV5O_MOUSE Ig kappa chain V-V
region HP... 33 0.36 gi125830spP06312KV40_HU
MAN Ig kappa chain V-IV region p... 33 0.38
gi2501738spQ06639YD03_YEAST Putative 101.7
kDa transcri... 33 0.41
98Example for BLAST Alignment
gtgi126779spP11911C79A_MOUSE B-cell antigen
receptor complex associated protein
alpha-chain precursor (IG-alpha) (MB-1 membrane
glycoprotein) (Surface-IGM-associated protein)
(Membrane-bound immunoglobulin associated
protein) (CD79A) Length 220 Score 278 bits
(711), Expect 5e-75 Identities 150/226 (66),
Positives 165/226 (73), Gaps 6/226
(2) Query 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQA
LWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG
LL LS LGPGCQAL P SL VLGEA C
N Sbjct 1 MPGG----LEALRALPLLLFLSYACLGPGCQAL
RVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query 61
NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQ
EGNESYQQSCG 120 N NTWW L N TWPP
LGPG G L VNK G CV E N SCG Sbjct
56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTG
CQVIE-NNILKRSCG 114 Query 121
TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQ
NEKLGLDAGD 180 TYLRVR P
PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK GD
D Sbjct 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCA
VVPGTLLLFRKRWQNEKFGVDMPD 174 Query 181
EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP
226 YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG
LIGD QLEKP Sbjct 175 DYEDENLYEGLNLDDCSMYEDISRG
LQGTYQDVGNLHIGDAQLEKP 220
99Example for BLAST
- Lineage Report
- root
- . cellular organisms
- . . Eukaryota eukaryotes
- . . . Fungi/Metazoa group eukaryotes
- . . . . Bilateria animals
- . . . . . Coelomata animals
- . . . . . . Gnathostomata vertebrates
- . . . . . . . Tetrapoda vertebrates
- . . . . . . . . Amniota vertebrates
- . . . . . . . . . Eutheria mammals
- . . . . . . . . . . Homo sapiens (man)
---------------------- 473 33 hits mammals
B-cell antigen receptor complex associated
protein alpha-ch - . . . . . . . . . . Bos taurus (bovine)
..................... 312 2 hits mammals
B-cell antigen receptor complex associated
protein alpha-ch - . . . . . . . . . . Mus musculus (mouse)
.................... 278 31 hits mammals
B-cell antigen receptor complex associated
protein alpha-ch - . . . . . . . . . . Canis familiaris (dogs)
................. 37 1 hit mammals
IG KAPPA CHAIN V REGION GOM - . . . . . . . . . . Rattus norvegicus (brown rat)
........... 35 7 hits mammals
Vascular endothelial growth factor receptor 1
precursor (VE - . . . . . . . . . . Oryctolagus cuniculus
(domestic rabbit) . 29 1 hit mammals
IG KAPPA CHAIN V REGION K29-213 - . . . . . . . . . Coturnix japonica
------------------------- 33 2 hits birds
Vascular endothelial growth factor
receptor 2 precursor (VE - . . . . . . . . . Gallus gallus (chickens)
.................. 31 4 hits birds
CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA
PRECURSOR (CNTFR
100How good is an alignment?
- Be careful Fitch/Smith found 17 alignments for
alpha- and beta-chains in chicken haemoglobins - Only one is the correct one (according to the
structure) - Given an alignment, how good is it
- Percentage of matching residues, i.e. number of
matches divided by length of smallest sequence - Advantage independent of sequence length
- E.g. ATC TGAT 4/6 66.67 TGCAT A
- More general also consider gaps, extensions,
101Blast Raw Score
- R a I b X - c O - d G, where
- I is the number of identities in the alignment
and a is the reward for each identity - X is the number of mismatches in the alignment
and b is the reward for each mismatch - O is the number of gaps and c is the penalty for
each gap - G is the number of - characters in the
alignment and d is the penalty for each - The values for a,b,c,d appear at the bottom of a
Blast report. For BLASTn they are a1, b-3, c5,
d2
102Example
- Query 1 atgctctggccacggcacttgcgga
-
- Sbjt107 atgctctggccacggatcttgtgga
- tcccagggtgatctgtgcacctgcgata 53
-
- tccca---tgatatgtgcacctgcgata 156
- R 1 x 46 -3 x 4 - 5 x 1 - 2 x 3 23
- So, given the scores how significant is the
alignment?
103Significance of an alignment
- Significance of an alignment needs to be defined
with respect to a control population - Pairwise alignment How can we get control
population? - Generate sequences randomly? ? Not a good model
of real sequences - Chop up both sequences and randomly reassemble
them - Database search How can we get control
population? - Control whole database
- Align sequence to control population and see how
good result is in comparison - This is captured by Z scores, P-values and
E-values
104Z-score
- Z-score normalises the score S
- Let m be mean of population and std its standard
deviation, then Z-score (S m) / std - Z-score of 0 ? no better than average, hence
might have occurred by chance - The higher the Z-score the better
105P-value
- P-value probability of obtaining a score S
- Range 0 P 1
- Let m be the number of sequences in the control
population with score S - Let p be the size of the control population
- Then P-value m / p
- Rule of thumb
- P 10-100 exact match,
- 10-100 P 10-50 nearly identical (SNPs)
- 10-50 P 10-10 homology certain
- 10-5 P 10-1 usually distant relative
- P gt 10-1 probably insignificant
106E-values
- E-value takes also the database into account
- E-value expected frequency of a score S
- Range 0 E m, where m is the size of the
database - Relationship to P E m P
- E values are calculated from
- the bit score
- the length of the query
- the size of the database
107BLAST Bit score
- The bit score normalizes the raw score S to make
score under different settings comparable - The bit score is obtained from the raw score as
follows - S ( lambda x R - ln(K) ) / ( ln(2),
- where lambda 1.37 and K0.711
- Example
- S ( 1.37 x 23 - ln(0.711) ) / ln(2) 46
108E-value
- The E-value is then calculated as follows
- E m x n x 2 -S , where
- m is the effective length of the query
- n is the effective length of the database
- S is the bit score
- (effective length takes into account that an
alignment cannot start at the end of a sequence) - Example
- m34 (19 nucleotides fewer than the 53 submitted)
- n5,854,611,841
- Result E0.003
109Precision and Recall
- How good are BLAST and FASTA?
- True positives, tp hits which are biologically
meaningful - False positives, fp hits which are not
biologically meaningful - True negatives, tn non-hits which are not
biologically meaningful - False negatives, fn non-hits which are
biologically meaningful - Minimise fp and fn
- Recall tp/(tpfn) (meaningful hits / all
meaningful) - Precision tp/(tpfp) (meaningful hits / all
hits) - But since no objective data available difficult
to judge BLAST and FASTAs sensitivity and
specificity
110Multiple Sequence Alignments
111Multiple Sequence Alignment
- Align more than two sequences
- Choice of sequences
- If too closely related then large redundant
- If very distantly related then difficult to
generate good alignment - Additionally use colour for residues with similar
properties - Yellow Small polar GLy,Ala,Ser,Thr
- Green Hydrophobic Cys,Val,Ile,Leu, Pro,Phe
,Tyr,Met,Trp - Magenta Polar Asn,Gln,His
- Red Negatively charged Asp,Glu
- Blue Positively charged Lys, Arg
112Thioredoxins WCGPCK or R motif
113Thioredoxins Gly/Pro turn
114Thioredoxins every second hydrophobic beta
strand
115Thioredoxins ca. every 4th hydrophobic alpha
helix
116(No Transcript)
117Profiles, PSI-Blast, HMM
118Profiles
- Derive profile from multiple sequence alignment
- Useful to
- Align distantly related sequences
- Conserved regions, which may indicate active site
- Classify subfamilies within homologues
- How can profile be used to search
- Insist on profile (such as WGCPC)? Too strict
- Use frequence distribution of profile
119Consider frequencies
- Score for
- VDFSAS 1316167167
- ADATAA 11601160
- Not good to pick up distant relationships
- Better combine with substitution matrix
- Result position specific substitution matrix
120PSI-Blast
- Globin familiy (oxygen transport ) of proteins
occurs in many species - Proteins have same function and structure and
- But there are pairs of members of the family
sharing less than 10 identical residues - A B C
- PSI-BLAST idea score via intermediaries may be
better than score from direct comparison
50
50
Only 10
121PSI-BLAST
- PSI-BLAST
- 1. BLAST
- 2. Collect top hits
- 3. Build multiple sequence alignment from
significant local matches - 4. Build profile
- 5. Re-probe database with profile
- 6. Go back to 2.
122PSI-BLAST
- But beware of PSI-BLAST
- False positives propagate and spread through
iterations - If protein A consists of domains D and E, and
protein B of domains E and F and protein C of
domain F, then PSI-BLAST will relate A and C
although they do not share any domain
123Hidden Markov Model
- Procedure to generate sequences
- State transition systems with three types of
states - Deletion
- Insertion
- Match, which emits residues
- Follow probability distribution for successor
state - Train model on multiple sequence alignment
124Summary
- Evolutionary model Indels and substitutions
- Homologues vs. similarity
- Dot plots
- Easy visual exploration, but not scalable
- Dynamic programming
- Local, global, gaps
- Substitution matrices (PAM, BLOSUM)
- BLAST and FASTA
- Scores and significance
- Multiple Sequence Alignments
- Profiles, PSI-BLAST, HMM
125Phylogeny
126Motivation
- How did the nucleus get into the eucaryotic
cells? - From where are we?
- Recent Africa vs. Multi-regional Hypothese
- In 1999 Encephalitis caused by the West Nile
Virus broke out in New York. How did the virus
come to New York?
127How did the nucleus get into the eucaryotic cells?
- Simple experiment
- Blast classes genes with related functions in
yeast (Eucaryote) - against Bacteria and
- against Archaea
- And count number of significant hits
128How did the nucleus get into the eucaryotic cells?
- Mitochondria und Energy metabolism
- Significantly more hits in bacteria
- Cell organisation
- Significantly more hits in Archaea
- Fundamental Result without any experiment!
Blue Bacteria Grey Archaea
129Phylogeny
- Taxonomists aim to classify and group organisms
- E.g. Aristoteles, De Partibus Animalium
- Ought we, for instance, to begin by discussing
each separate species man, lion, ox, and the
like taking each kind in hand independently of
the rest, or ought we rather to deal first with
the attributes which they have in common in
virtue of some common element of their nature,
and proceed from this as a basis for
consideration of them separately other
130Schools of Taxonomists
- Goal create taxonomy
- Approach
- Phenotype
- Phylogeny
- 3 schools
- Phenotype only
- Evolutionary TaxonomistsPhenotype ( Phylogeny)
- Cladists Phylogeny (Phenotype)
131Practical Application Westnile virus in NY
- Westnile virus mainly in Africa
- Transmitted by insects and birds
- How did the virus get to NY in 1999
- Hundreds of DNA samples taken
- All 99.8 identical ? single entry to NY!
- Phylogenetic tree allows to deduce origin
132Example Westnil virus in NY
- How can the trees be constructed?
133Three Methods to Generate Phylogenetic Trees
- Distance-based
- Hierarchical clustering
- Character-based
- Parsimony
- Maximum likelihood
134Distance-based Approach
- Single Alignment
- Score 46 matches, 3 mismatches, 1 gap, 3 gap
extensions, - z.B. Score 46x1 - 3x1 - 1x2 - 3x1 38
- Approach
- Define distance between two sequences, e.g.
percentage of mismatches in their alignment - Construct tree, which groups sequences with
minimal distances iteratively together
135Hierarchical Clustering
136Hierarchical Clustering
- Given a distance matrix D(dij) with 1 i,j n
- Result A binary tree of clusters
- Init
- ToDo
- For all i in 1,, n do
- Let ti be a tree without children, i.e. a leaf
- ToDo ToDo ? ti
- Main loop
- While ToDo gt 1 do
- Find i,j such that dij is minimal
- Add a new column and row labelled k (i,j) to D
- For all indices h of D apart from k,i,j do
- dh,k dk,h min dh,i , dh,j // min
single linkage - Let tk be a new tree with children ti and tj
- ToDo ( ToDo ? tk ) - ti ,tj
- Remove columns and rows i,j from D
- Complexity O(n2)
137Hierarchical Clustering
- How to define distance between clusters?
- Single linkage
- dh,k dk,h min dh,i , dh,j
- Example Distance (A,B) to C is 1
- Complete linkage
- dh,k dk,h max dh,i , dh,j
- Example Distance (A,B) is C is 2
- Average linkage
- dh,k dk,h 0.5 dh,i 0.5 dh,j
- Example Distance (A,B) to C is 1.5
- Are dendrograms always the same independent of
the linkage method?
138Parsimony-method
- Approach Generate smallest tree containing all
the sequences as leaves
139Parsimony
- Generate smallest tree
- Informative vs. non-informative sites
- Build pairs with fewest possible substitutions
- Example
- 3 possible trees
- ((a,b),(c,d)) or ((a,c),(b,d)) or ((a,d),(b,c))
- 1,2,3,4 are not informative
- 5,6 are informative
- 5 ((a,b),(c,d))
- 6 ((a,c),(b,d))
140Maximum likelihood
- Assigns quantitative probabilities to mutation
events - Reconstructs ancestors for all nodes in the tree
- Assigns branch lengths based on probabilities of
the mutational events - For each possible tree topology, the assumed
substitution rates are varied to find the
parameters that give the highest likelihood of
producing the observed data
141Problems
- Character-based methods tend to be better (based
on paleontological data) - All make assumptions
- No back mutations
- Same evolutionary rate
142Assessing Quality Bootstrapping
- Given a tree obtained from one of the methods
above - Generate Multiple Alignment
- For a number of interations
- Generate new sequences by selecting columns
(possibly the same column more than once) form
the multiple alignment - Generate tree for the new sequences
- Compare this new tree with the given tree
- For each cluster in the given tree, which also
approach in the new tree, the bootstrap value is
increased - Bootstrap-Value Percentage of trees containing
the same cluster
143From where are we?
- Recent-Africa Hypothesis
- Homo Sapiens came 100-200.000 years ago from
Africa - Multi-regional Hypothesis
- Ancestors of Homo Sapiens left Africa ca.
2.000.000 years ago - Which ones right?
144Experiment
- Mitochondrial DNA form 53 humans in different
regions sequenced - Outgroup Mitochondrial DNA of chimpanzee
145A nice phylogeny (Nature 2004)
Nature October 2004 Volume 431 No. 7012
146Why Mitochondria?
- Simple genetic structure
- No repetitions
- No Pseudo genes
- No Introns
- No recombination
147Molecular Clock
- Based on genetic and paleontological the most
recent common ancestor (mrca) of chimpanzee and
homo sapiens dates back 5.000.000 years - Molecular clock
- ? 1.7 x10-8 nucleotide changes per site and year
- Assumption equal distribution, no silent
mutations - Diversity in Afrikca 3.7 x10-3 nucleotide
changes per site and year - Diversity outside Africa 1.7 x10-3 nucleotide
changes per site and year - Estimated expansion1925 generations ago ca.
40.000 years - Mrca of all humans 171.500 /- 50.000 years ago
- Mrca of African and non-African 52,000 /-
27.500 years ago - Experiment supports recent-Africa hypothesis
148Summary
- Schools of taxonomists
- Assumptions made
- Methods
- Distance-based
- Character-based