Sequence comparison and Phylogeny - PowerPoint PPT Presentation

1 / 148
About This Presentation
Title:

Sequence comparison and Phylogeny

Description:

In 1999 Encephalitis caused by the West Nile Virus broke out in New York. ... E.g. v-sys onco genes in simian sarcoma virus leading to cancer in monkeys and ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 149
Provided by: biotecTu
Category:

less

Transcript and Presenter's Notes

Title: Sequence comparison and Phylogeny


1
Sequence comparisonand Phylogeny
based on Chapter 4 Lesk, Introduction to
Bioinformatics
2
Contents
  • Motivation
  • Sequence comparison and alignments
  • Dot plots
  • Dynamic programming
  • Substitution matrices
  • Dynamic programming Local and global alignments
    and gaps
  • BLAST
  • Significance of alignments
  • Multiple sequence alignments
  • Phylogenetic trees

3
Motivation
  • From where are we?
  • Recent Africa vs. Multi-regional Hypothese
  • In 1999 Encephalitis caused by the West Nile
    Virus broke out in New York. How did the virus
    come to New York?
  • How did the nucleus get into the eucaryotic
    cells?
  • To answer such questions we will need sequence
    comparison and phylogenetic trees

4
Sequence Similarity Searches
  • Sequence similarity can be clue to common
    evolutionary ancestor
  • E.g. globin genes in chimpanzees and humans
  • or common function
  • E.g. v-sys onco genes in simian sarcoma virus
    leading to cancer in monkeys and the seemingly
    unrelated growth stimulating hormone PDGF, which
    stimulates cell growth (first success of
    similarity idea, 1983)
  • In general
  • If an unknown sequences is found, deduce its
    function/structure indirectly by finding similar
    sequences, whose function/structure is known
  • Assumption Evolution changes sequences slowly
    often maintaining main features of a sequences
    function/structure

5
Sequence alignment
  • Substitutions, insertions and deletions can be
    interpreted in evolutionary terms
  • But distinguish chance similarity and real
    biological relationship

CCGTAA
TCGTAA
CCGTAT
TCGTAC
TTGTAA
TCGTAG
TAGTAC
6
Evolution
  • Convergent evolution same sequence evolved from
    different ancestors
  • Back evolution - mutate to a previous sequence

CCGTAA
TCGTAA
CCGTAT
TCGTAC
TAGTAA
TCGTAG
TAGTAC
TAGTAC
CCGTAA
7
Similarity vs. Homology
  • Any sequence can be similar
  • Sequences homologues if evolved from common
    ancestor
  • Homologous sequences
  • Orthologs similar biological function
  • Paralogs different biological function (after
    gene duplication), e.g. lysozyme and
    a-lactalbumin, a mammalian regulatory protein
  • Assumption Similarity indicator for homology
  • Note, altered function of the expressed protein
    will determine if the organism will survive to
    reproduce, and hence pass on the altered gene

8
Sequence alignments
  • Given two or more sequences, we wish to
  • Measure their similarity
  • Determine the residue-residue correspondences
  • Observe patterns of conservation and variability
  • Infer evolutionary relationships

9
What is the best alignment?
  • Uninformative -------gctgaacg ctataatc------
    -
  • Without gaps gctgaacg ctataatc
  • With gaps gctga-a--cg --ct-ataatc
  • Another one gctg-aa-cg -ctataatc-
  • Formally The best alignment have only a minimal
    number of mismatches (insertions, deletions,
    replace)
  • We need a method to systematically explore and to
    compute alignments

10
Scores for an alignment
  • Percentage of matches
  • Score each match, mismatch, gap opening, gap
    extension
  • Example
  • match 1
  • mismatch -1
  • Gap opening -3
  • Gap extension -1
  • Uninformative 0, score -21 -------gctgaacg
    ctataatc-------
  • Without gaps25,score -4 gctgaacg ctataatc
  • With gaps 0, score -23 gctga-a--cg --ct-a
    taatc
  • Another one 50, score-12 gctg-aa-cg -ctat
    aatc-

11
Scores for an alignment
  • Percentage of matches
  • Score each match, mismatch, gap opening, gap
    extension
  • Example
  • match 2
  • mismatch -1
  • Gap opening -1
  • Gap extension -1
  • Uninformative 0, score -17 -------gctgaacg
    ctataatc-------
  • Without gaps25,score -2 gctgaacg ctataatc
  • With gaps 0, score -15 gctga-a--cg --ct-a
    taatc
  • Another one 50, score0 gctg-aa-cg -ctataa
    tc-

12
Dot plots
13
Dot plots
  • A convenient way of comparing 2 sequences
    visually
  • Use matrix, put 1 sequence on X-axis, 1 on Y-axis
  • Cells with
  • identical characters filled with a 1,
  • non-identical with 0
  • (simplest scheme - could have weights)

14
Dot plots
15
Dot plots
16
Interpreting dot plots
  • What do identical sequences look like?
  • What do unrelated sequences look like?
  • What do distantly related sequences look like?
  • What does reverse sequence look like?
  • Relevant for detections of stems in RNA structure
  • What does a palindrome look like?
  • Relevant for restriction enzymes
  • What do repeats look like?
  • What does a protein with domains A and B and
    another one with domains B and C look like?

17
Dot plot for identical sequences
18
Dotplot for unrelated sequences
19
Dotplot for distantly related sequences
20
Dotplot for reverse sequences
21
Dotplot for reverse sequences
  • Relevant to identify stems in RNA structures
  • Plot sequence against its reverse complement

22
Palindromes and restriction enzymes
  • Madam, I'm Adam
  • Able was I ere I saw Elba (supposedly said by
    Napoleon)
  • Doc note I dissent, a fast never prevents a
    fatness, I diet on cod.
  • Because DNA is double stranded and the strands
    run antiparallel, palindromes are defined as any
    double stranded DNA in which reading 5 to 3
    both are the same
  • The HindIII cutting site
  • 5'-AAGCTT-3'
  • 3'-TTCGAA-5'
  • The EcoRI cutting site
  • 5'-GAATTC-3'
  • 3'-CTTAAG-5'

23
Dotplot of a Palindrome
24
Dotplot of repeats
25
Dotplot of Repeats/Palindrome
26
Dotplot for shared domain
27
Result
  • Dot plot
  • dorothycrowfoothodgkin
  • d
  • o
  • r
  • o
  • t
  • h
  • y
  • h
  • o
  • d
  • g
  • k
  • i
  • n

28
Result
  • Dot plot
  • dorothycrowfoothodgkin
  • d
  • o
  • r
  • o
  • t
  • h
  • y
  • h
  • o
  • d
  • g
  • k
  • i
  • n

29
Dotplots
  • Window size 15
  • Dot if
  • 6 matches in window

30
(No Transcript)
31
gtgi1942644pdb1MEG Crystal Structure Of A
Caricain D158e Mutant In Complex With E-64
Length 216 Score 271 bits (693), Expect
1e-73 Identities 142/216 (65), Positives
168/216 (77), Gaps 4/216 (1) Query 1
IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSE
QELLDCDRRS 60 PE VDWRKGAVTPVQGSCGSC
WAFSAV TEGI KIRTG L SEQELDCRRS Sbjct 1
LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSE
QELVDCERRS 60 Query 61 YGCNGGYPWSALQLVAQYGIHYRN
TYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120
GC GGYP AL VA GIH R YPY Q CR G KT
GV VQP NG Sbjct 61 HGCKGGYPPYALEYVAKNGIHLRSKY
PYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query
121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV--
--GYGPNYILIKNS 176 LL IA QPVSVV G
FQLYGGIF GPCG KVHAV AV G YILIKNS Sbjct
121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGY
GKSGGKGYILIKNS 180 Query 177 WGTGWGENGYIRIKRGTGN
SYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR
GNS GVCGLY SSYP KN Sbjct 181 WGTAWGEKGYIRIKRAPGN
SPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk
gavtpvrhqg scgscwafsa vatveginki rtgklvelse
qelvdcerrs 61 hgckggyppy aleyvakngi
hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn
121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha
vtavgygksg gkgyilikns 181 wgtawgekgy
irikrapgns pgvcglykss yyptkn
32
(No Transcript)
33
  • gtgi2624670pdb1AIM Cruzain Inhibited By
    Benzoyl-Tyrosine-Alanine- Fluoromethylketone
  • Length 215
  • Score 121 bits (303), Expect 3e-28
  • Identities 78/202 (38), Positives 107/202
    (52), Gaps 13/202 (6)
  • Query 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKI
    RTGNLNQYSEQELLDCDRRSY 61
  • P VDWR GAVT VKQG CGSCWAFSA E
    L SEQ L CD
  • Sbjct 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFL
    AGHPLTNLSEQMLVSCDKTDS 61
  • Query 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRY
    CRSREKGPYAAKTDGVRQVQP 115
  • GCGG A Q YPY EG
    C A T V Q
  • Sbjct 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPP
    CTTSGHTVGATITGHVELPQD 121
  • Query 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCG
    NKVDHAVAAVGYGPN----YI 171
  • Q A N PVV A Y GG
    DH V VGY Y
  • Sbjct 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVS
    EALDHGVLLVGYNDSAAVPYW 178
  • Query 172 LIKNSWGTGWGENGYIRIKRGT 193

34
(No Transcript)
35
  • gi7546546pdb1EF7B Chain B, Crystal
    Structure Of Human Cathepsin X
  • Length 242
  • Score 52.0 bits (123), Expect 2e-07
  • Identities 60/231 (25), Positives 94/231
    (40), Gaps 34/231 (14)
  • Query 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVT
    IEGIIKIRTGNL---NQYSEQ 51
  • P DWR V NQ CGSCWA
    I I S Q
  • Sbjct 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSA
    MADRINIKRKGAWPSTLLSVQ 60
  • Query 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEG
    VQRYCR--------SREKGPY 103
  • DC C GG S QGI Y
    C K
  • Sbjct 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQA
    KDQECDKFNQCGTCNEFKECH 119
  • Query 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAA
    GKDFQLYRGGIFVGPCGNK-V 157
  • A V Y AN PS A
    Y GGI
  • Sbjct 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMAT
    ER-LANYTGGIYAEYQDTTYI 178
  • Query 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI---
    --KRGTGNSYGV 199

36
(No Transcript)
37
(No Transcript)
38
Dynamic programming
39
From Dotplots to Alignments
  • Obvious best alignment DOROTHYCROWFOOTHODGKIN
    DOROTHY--------HODGKIN

40
From Dotplots to Alignments
  • Find best path from top left corner to bottom
    right
  • Moving east corresponds to - in the second
    sequence
  • Moving south corresponds to - in the first
    sequence
  • Moving southeast corresponds to a match if the
    characters are the same or a mismatch otherwise
  • Can we automate this?

41
From Dotplots to Alignments
  • Algorithm (Dynamic Programming)
  • Insert a row 0 and column 0 initialised with 0
  • Starting from the top left, move down row by row
    from row 1 and
  • right column by column from column 1 visiting
    each cell
  • Consider
  • The value of the cell north
  • The value of the cell west
  • The value of the cell northwest if the row/column
    character mismatch
  • 1 the value of the cell northwest if the
    row/column character match
  • Put down the maximum of these values as the value
    for the current cell
  • Trace back the path with the highest values from
    the bottom right to the top left and output the
    alignment

42
From Dotplots to Alignments
  • 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3
    C4 T5 G6 A7 T

43
From Dotplots to Alignments
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 02 T 03 C 04 T 05 G 06 A 07 T 0
  • Insert a row 0 and column 0 initialised with 0

44
From Dotplots to Alignments
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07
    T 0
  • Consider
  • Value north
  • Value west
  • Value northwest if the row/column character
    mismatch
  • 1 value northwest if the row/column character
    match
  • Put down the maximum of these values for current
    celll

45
From Dotplots to Alignments
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
    C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
    G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
    T 0 1 2 2 3 4 4

46
Reading the Alignment
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
    C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
    G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
    T 0 1 2 2 3 4 4

-tgcat-a- at-c-tgat
47
Reading the Alignment there are more than one
possibility
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
    C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
    G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
    T 0 1 2 2 3 4 4

---tgcata atctg-at-
48
FormallyLongest Common Subsequence LCS
  • What is the length s(V,W) of the longest common
    subsequence of two sequencesVv1..vn and
    Ww1..wm ?
  • Find sequences of indices1 i1 lt lt ik n and
    1 j1 lt lt jk msuch that vit wjt for 1
    t k
  • How? Dynamic programming
  • si,0 s0,j 0 for all 1 i n and 1 j m
    and
  • si-1,jsi,j max si,j-1 si-1,j-1 1,
    if vi wj
  • Then s(V,W) sn,m is the length of the LCS


49
Example LCS
  • 0 1 2 3 4 5 6 T G C A T A0 1 A2 T3
    C4 T5 G6 A7 T

50
Example LCS
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 02 T 03 C 04 T 05 G 06 A 07 T 0
  • Initialisation si,0 s0,j 0 for all 1 i
    n and 1 j m and

51
Example LCS
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07
    T 0
  • Computing each cell si-1,j si,j max
    si,j-1 si-1,j-1 1, if vi wj


52
Example LCS
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
    C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
    G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
    T 0 1 2 2 3 4 4
  • Computing each cell si-1,j si,j max
    si,j-1 si-1,j-1 1, if vi wj


53
LCS Algorithm
  • Complexity
  • LCS has quadratic complexity O(n m)
  • LCS(V,W)
  • For i 1 to n
  • si,0 0
  • For j 1 to m
  • s0,j 0
  • For i 1 to n
  • For j 1 to m
  • If vi wj and si-1,j-1 1 si-1,j and si-1,j-1
    1 si,j-1 Then
  • si,j si-1,j-1 1
  • bi,j North West
  • Else if si-1,j si,j-1 Then
  • si,j si-1,j
  • bi,j North
  • Else
  • si,j si,j-1
  • bi,j West
  • Return s and b

54
Printing the alignment of LCS
  • PRINT-LCS(b,V,i,j)
  • If i0 or j0 Then Return
  • If bi,j North West Then
  • PRINT-LCS(V,b,i-1,j-1)
  • Print vi
  • Else if bi,j North Then
  • PRINT-LCS(V,b,i-1,j)
  • Else
  • PRINT-LCS(V,b,i,j-1)

55
Rewards/Penalities
  • We can use different schemes
  • -1 for insert/delete/mismatch
  • 1 for match
  • Consider
  • -1 the value of the cell north
  • -1 the value of the cell west
  • -1 the value of the cell northwest if the
    row/column character mismatch
  • 1 the value of the cell northwest if the
    row/column character match
  • Put down the maximum of these values as the value
    for the current cell

56
Reading the Alignment
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 -1 -1 -1 1 0 12 T 0 1 0 -1 0 2 13
    C 0 0 -1 1 0 1 14 T 0 1 0 0 0 1 05
    G 0 0 2 1 0 0 06 A 0 -1 1 1 2 1 17
    T 0 1 0 0 1 3 2

---tgcata atctg-at-
57
Rewards/Penalities
  • Lets refine the schemes
  • Transition mutations are more common
  • purinelt-gtpurine, alt-gtg
  • pyrimidinelt-gtpyrimidine, tlt-gtc
  • Transversions (purinelt-gtpyrimidine) are less
    common
  • Use a subsitutation matrix to rate mismatches
  • -2 for insert/delete
  • Mismatch/match according to substitution matrix
  • Consider
  • -2 the value of the cell north
  • -2 the value of the cell west
  • Corresponding value of the substion matrix the
    value of the cell northwest
  • Put down the maximum of these values as the value
    for the current cell

58
Reading the Alignment
  • 0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
    A 0 -2 0 -2 2 0 22 T 0 2 0 0 0 4 23
    C 0 0 0 2 0 2 24 T 0 2 0 0 0 2 05
    G 0 0 4 2 0 0 26 A 0 -2 2 2 4 2 27
    T 0 2 0 2 2 6 4

---tgcata atctg-at-
59
Substitution matrixes
60
How to derive a substitution matrix for amino
acids?
  • Amino acids can be classified by physiochemical
    properties

61
PAM 250 matrix
Cys 12 Ser 0 2 Thr -2 1 3 Pro -3 1 0
6 Ala -2 1 1 1 2 Gly -3 1 0 -1 1 5 Asn -4
1 0 -1 0 0 2 Asp -5 0 0 -1 0 1 2 4 Glu
-5 0 0 -1 0 0 1 3 4 Gln -5 -1 -1 0 0 -1
1 2 2 4 His -3 -1 -1 0 -1 -2 2 1 1 3
6 Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 Lys -5
0 0 -1 -1 -2 1 0 0 1 0 3 5 Met -5 -2 -1
-2 -1 -3 -2 -3 -2 -1 -2 0 0 6 Ile -2 -1 0 -2
-1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 Leu -6 -3 -2 -3
-2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 Val -2 -1 0
-1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 Phe -4
-3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1
9 Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2
-1 -1 -2 7 10 Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5
-3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A
G N D E Q H R K M I L V F Y W
gt0, likely mutation 0, random mutation lt0,
unlikely
62
(No Transcript)
63
(No Transcript)
64
PAM 250 Interpretation
  • Immutable
  • Cysteine (Avg-2.8) known to have several
    unique, indispensable functions
  • attachment site of heme group in cytochrome and
    of iron sulphur FeS in ferredoxins
  • Cross links in proteins such as chymotrypsin or
    ribonuclease
  • Seldom without unique function
  • Glycine (Avg-1.6) small size maybe advantageous
  • Mutable
  • Serine often functions in active site, but can be
    easily replaced
  • Self-alignment
  • Tryptophan with itself scores very high, as W
    occurs rarely

65
Point Accepted Mutations PAM
  • Substitution matrix using explicit evolutionary
    model of how amino acids change over time
  • Use parsimony method to determine frequency of
    mutations
  • Entry in PAM matrix Likelihood ratio for
    residues a and b Probability a-b is a mutation /
    probability a-b is chance
  • PAM x Two sequences V, W have evolutionary
    distance of x PAM if a series of accepted point
    mutations (and no insertions/deletions) converts
    V into W averaging to x point mutation per 100
    residues
  • Mutations here mutations in the DNA
  • Because of silent mutations and back mutations n
    can be gt100
  • PAM 250 most commonly used

66
PAM and Sequence Similarity
67
PAM
  • Dayhoff, Eck, Park A model of evolutionary
    change in proteins, 1978
  • Accepted point mutation substitution of an
    amino acid accepted by natureal selection
  • Assumption X replacing Y as likely as Y
    replacing X
  • Used cytochrome c, hemoglobin, myoglobin, virus
    coat proteins, chymotrypsinogen, glyceraldehyde
    3-phosphate dehrydogenase, clupeine, insulin,
    ferredoxin
  • Sequences which are too distantly related have
    been omitted as they are more likely to contain
    multiple mutations per site

68
PAM Step 1
  • Step 1 Construct a multiple alignment
  • Example
  • ACGCTAFKI
  • GCGCTAFKI
  • ACGCTAFKL
  • GCGCTGFKI
  • GCGCTLFKI
  • ASGCTAFKL
  • ACACTAFKL

69
PAM Step 2
  • Create a phylogenetic tree (parsimony method)

ACGCTAFKI A-gtG
I-gtL GCGCTAFKI ACGCTAFKL
A-gtG A-gtL C-gtS G-gtA GCGCTGFKI
GCGCTLFKI ASGCTAFKL ACACTAFKL
70
PAM Step 3
  • Note, the following variables
  • Residue frequency ri is the number of amino acid
    i occurring in the sequences, e.g. rA 10 and
    rG10
  • Number of residues r is the number of overall
    amino acids in all sequences, e.g. r63
  • Substitutability si is the number of
    substitutions in the tree involving amino acid i
    , e.g. sA4
  • Substituion frequency si,j is the number of
    substitutions involving amino acid i and j (i.e.
    the number of i?j and j?i ), e.g. sA,G 3
  • Number of substitutions s is the number of
    overall substitutions, s6

71
PAM Step 4
  • Relative mutability is the number of times the
    amino acid is substituted by any other amino acid
    in the tree divided by the total number of
    substitutions that could have affected the
    residue
  • Note, it is assumed that substitutions in both
    directions are equally likely
  • mi 100 x ( si x ri ) / ( 2 s x r )
  • Example mA 100 x ( 4 x 10 ) / ( 2 x 6 x 63 )
    5.3

72
PAM Step 5
  • Compute mutation probability
  • Mi,j mj x si,j / sj
  • Example MG,A 5.3 x 3 / 4 3.975

73
PAM Step 6
  • Finally the entry in the PAM Matrix
  • Ri,j log ( Mi,j / ( ri / r ) )
  • Example RG,A log ( 3.975 / (10/63) ) 1.4

74
PAM Step 6
  • For the entries on the diagonal
  • mj relative mutability ? 1-mj relative
    immutability
  • Rj,j log ( (1-mj ) / ( rj / r ) )

75
BLOSUM
  • Different approach to PAM
  • BLOcks SUbstitution Matrix (based on BLOCKS
    database)
  • Generation of BLOSUM x
  • Group highly similar sequences and replace them
    by a representative sequences.
  • Only consider sequences with no more than x
    similarity
  • Align sequences (no gaps)
  • For any pair of amino acids a,b and for all
    columns c of the alignment, let q(a,b) be the
    number of co-occurrences of a,b in all columns c.
  • Let p(a) be the overall probability of a
    occurring
  • BLOSUM entry for a,b is log2 ( q(a,b) / (
    p(a)p(b) ) )
  • BLOSUM 50 and BLOSUM 62 widely used

76
LCS Algorithm (Longest Common Subsequence)
Revisited
  • Algorithm (Dynamic Programming) with Substitution
    Matrix
  • Insert a row 0 and column 0 initialised with 0
  • Starting from the top left, move down row by row
    from row 1 and
  • right column by column from column 1 visiting
    each cell
  • Consider
  • The value of the cell north
  • The value of the cell west
  • The value of the cell northwest if the row/column
    character mismatch
  • s the value of the cell northwest, where s is
    the value in the subsitution matrix for the
    residues in row/column
  • Put down the minimum of these values as the value
    for the current cell
  • Trace back the path with the highest values from
    the bottom right to the top left and output the
    alignment

77
LCS Revisited Formally
  • What is the length s(V,W) of the longest common
    subsequence of two sequencesVv1..vn and
    Ww1..wm ?
  • Find sequences of indices1 i1 lt lt ik n and
    1 j1 lt lt jk msuch that vit wjt for 1
    t k
  • How? Dynamic programming
  • si,0 s0,j 0 for all 1 i n and 1 j m
    and
  • si-1,jsi,j max si,j-1 si-1,j-1 t,
    where t is the value for vi and wj in
    the substitution matrix
  • Then s(V,W) sn,m is the length of the LCS


78
Dynamic programming revisitedlocal and global
alignments and gap
79
Evolution and Alignments
  • Alignments can be interpreted in evolutionary
    terms
  • Identical letters are aligned. Interpretation
    part of the same ancestral sequence and not
    changed
  • Non-identical letters are aligned
    (substitution)Interpretation Mutation
  • GapsInterpretation Insertions and deletions
    (indels)

80
Evolution and Alignments
  • Specific problems aligning DNA
  • Frame shift
  • DNA triplets code amino acids
  • Indel of one nucleotide shifts the whole sequence
    of triplets
  • Thus may have a global effect and change all
    coded amino acids
  • Silent mutation
  • Substitution in DNA leaves transcribed amino acid
    unchanged
  • Non-sense mutation
  • Substitution to stop-codon

81
Local and Global Alignments
  • Global alignment (Needleham-Wunsch) algorithm
    finds overall best alignment
  • Example members of a protein family, e.g.
    globins are very conserved and have the same
    length in different organisms from fruit fly to
    humans
  • Local alignment (Smith-Waterman) algorithm finds
    locally best alignment
  • most widely used, as
  • e.g. genes from different organisms retain
    similar exons, but may have different introns
  • e.g. homeobox gene, which regulates embryonic
    development occurs in many species, but very
    different apart from one region called
    homeodomain
  • e.g. proteins share some domains, but not all

82
Local Alignment
  • LCS s(V,W) computes globally best alignment
  • Often it is better to maximise locally, i.e.
    compute maximal s(vivi , wj wi ) for all
    substrings of V and W
  • Can we adapt algorithm?
  • Global alignment longest path in matrix s from
    (0,0) to (n,m)
  • Local alignment longest path in matrix s from
    any (i,j) to any (i,j)
  • Modify definition of s adding vertex of weight 0
    from source to every other vertex, creating a
    free jump to any starting position (i,j)

83
Local Alignment
  • Modify the definition of s as follows
  • si,0 s0,j 0 for all 1 i n and 1 j m
    and
  • 0 si-1,jsi,j max si,j-1 si-1,j-1
    t, where t is the value for vi wj in the
    substitution matrix
  • Then s(V,W) max si,j is the length of the
    local LCS
  • This computes longest path in edit graph
  • Several local alignment may have biological
    significance (consider e.g. two multi-domain
    proteins whose domains are re-ordered


84
Aligning with Gap Penalties
  • Gap is sequence of spaces in alignment
  • So far, we consider only insertion and deletion
    of single nucleotides or amino acids creating
    alignments with many gaps
  • So far, score of a gap of length l is l
  • Because insertion/deletion of monomers is
    evolutionary slow process, large numbers of gaps
    do not make sense
  • Instead whole substrings will be deleted or
    inserted
  • We can generalise score of a gap to a score
    function A B l, where A is the penalty to open
    the gap and B is the penalty to extend the gap

85
Aligning with Gap Penalties
  • High gap penalties result in shorter,
    lower-scoring alignments with fewer gaps and
  • Lower gap penalties give higher-scoring, longer
    alignments with more gaps
  • Gap opening penalty A mainly influences number of
    gaps
  • Gap extension penalty B mainly influences length
    of gaps
  • E.g. if interested in close relationships, then
    choose A, B above default values, for distant
    relationships decrease default values

86
Aligning with Gap Penalties
  • Adapt the definition of s as follows
  • s-deli,j max s-deli-1,j - B si-1,j
    (AB)
  • s-insi,j max s-insi,j-1 - B si,j-1
    (AB)
  • 0 s-deli,jsi,j max s-insi,j
    si-1,j-1 t, where t is the value for vi,
    wj in the substitution matrix Then s(V,W)
    max si,j is the length of the local LCS with
    gap penalties A and B




87
FASTA and BLAST
88
Motivation
  • As in dotplots, the underlying data structure for
    dynamic programming is a table
  • Given two sequences of length n dynamic
    programming takes time proportional to n2
  • Given a database with m sequences, comparing a
    query sequence to the whole database takes time
    proportional to m n2
  • What does this mean?
  • Imagine you need to fill in the tables by hand
    and it takes 10 second to fill in one cell
  • Assume there are 1.000.000 sequences each 100
    amino acids long
  • How long does it take?

89
  • 1.000.000 x 100 x 100 x 10 sec 1011 sec
    27.777.778h 1157407days 3170 years
  • Even if a computer does not take 10 sec, but just
    0.1ms to fill in one cell, it would still be 12
    days.
  • We cannot do something about the database size,
    but can we do something about the table size?

90
An idea Prune the search space
91
Another idea
  • Did we formulate the problem correctly?
  • Do we need the alignments for all sequences in
    the database?
  • No, only for reasonable hits ? introduce a
    threshold
  • A reasonable alignment will contain short
    stretches of perfect matches
  • Find these first, then extend them to connect
    them as best possible

92
FASTA and BLAST
  • FASTA and BLAST faster than dynamic programming
    (5 times and 50 times respectively)
  • Underlying idea for a heuristic
  • High-scoring alignments will contain short
    stretches of identical letters, called words
  • FASTA and BLAST first search for matches of words
    of a given length and score threshold
  • BLAST for words of length 3 for proteins and 11
    for DNA
  • FASTA for words of length 2 for proteins and 6
    for DNA
  • Next, matches are extended to local (BLAST) and
    global (FASTA) alignments

93
FASTA and BLAST
  • More formallyIf the strings Vv1..vm and
    Ww1..wm match with at most k mismatches, then
    they share an p-tuple for p ?m/(k1)?, i.e.
    vi..vil-1 wj..wjl-1 for some 1 i,j m-p1
  • FILTRATION ALGORITHM, which detects all matching
    words of length m with up to k mismatches
  • Potential match detection Find all matches of
    p-tuples of V,W (can be done in linear time by
    inserting them into a hash table)
  • Potential match verification Verify each
    potential match by extending it to the left and
    right until either the first k1 mismatches are
    found or the beginning or end of the sequences
    are found

94
Example for BLAST
  • Search SWISSPROT for Immunoglobulin

95
Example for BLAST
  • Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for
    P11912

96
Example for BLAST
  • Distribution of Hits

97
Example for BLAST Top Hits
  • Score E Sequences producing significant
    alignments Score E-Value gi547896spP11912C7
    9A_HUMAN B-cell antigen receptor comp... 473
    e-133 gi728993spP40293C79A_BOVIN B-cell
    antigen receptor comp... 312 3e-85
    gi126779spP11911C79A_MOUSE B-cell antigen
    receptor comp... 278 5e-75 gi728994spP40259C
    79B_HUMAN B-cell antigen receptor comp... 55
    1e-07 gi125781spP01618KV1_CANFA IG KAPPA
    CHAIN V REGION GOM 38 0.019 gi125361spP17948
    VGR1_HUMAN Vascular endothelial growth ... 37
    0.042 gi549319spP35969VGR1_MOUSE Vascular
    endothelial growth ... 36 0.052
    gi114764spP15530C79B_MOUSE B-cell antigen
    receptor comp... 36 0.064 gi1718161spP53767V
    GR1_RAT Vascular endothelial growth f... 35
    0.080 gi125735spP01681KV01_RAT Ig kappa
    chain V region S211 35 0.095
    gi1730075spP01625KV4A_HUMAN IG KAPPA CHAIN
    V-IV REGION LEN 34 0.26 gi1718188spP52583VGR2
    _COTJA Vascular endothelial growth... 33 0.28
    gi125833spP06313KV4B_HUMAN IG KAPPA CHAIN
    V-IV REGION J... 33 0.30 gi125806spP01658KV3
    F_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30
    gi125808spP01659KV3G_MOUSE IG KAPPA CHAIN
    V-III REGION ... 33 0.30 gi1172451spQ05793PG
    BM_MOUSE Basement membrane-specific ... 33 0.33
    gi125850spP01648KV5O_MOUSE Ig kappa chain V-V
    region HP... 33 0.36 gi125830spP06312KV40_HU
    MAN Ig kappa chain V-IV region p... 33 0.38
    gi2501738spQ06639YD03_YEAST Putative 101.7
    kDa transcri... 33 0.41

98
Example for BLAST Alignment
gtgi126779spP11911C79A_MOUSE B-cell antigen
receptor complex associated protein
alpha-chain precursor (IG-alpha) (MB-1 membrane
glycoprotein) (Surface-IGM-associated protein)
(Membrane-bound immunoglobulin associated
protein) (CD79A) Length 220 Score 278 bits
(711), Expect 5e-75 Identities 150/226 (66),
Positives 165/226 (73), Gaps 6/226
(2) Query 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQA
LWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG
LL LS LGPGCQAL P SL VLGEA C
N Sbjct 1 MPGG----LEALRALPLLLFLSYACLGPGCQAL
RVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query 61
NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQ
EGNESYQQSCG 120 N NTWW L N TWPP
LGPG G L VNK G CV E N SCG Sbjct
56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTG
CQVIE-NNILKRSCG 114 Query 121
TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQ
NEKLGLDAGD 180 TYLRVR P
PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK GD
D Sbjct 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCA
VVPGTLLLFRKRWQNEKFGVDMPD 174 Query 181
EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP
226 YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG
LIGD QLEKP Sbjct 175 DYEDENLYEGLNLDDCSMYEDISRG
LQGTYQDVGNLHIGDAQLEKP 220
99
Example for BLAST
  • Lineage Report
  • root
  • . cellular organisms
  • . . Eukaryota eukaryotes
  • . . . Fungi/Metazoa group eukaryotes
  • . . . . Bilateria animals
  • . . . . . Coelomata animals
  • . . . . . . Gnathostomata vertebrates
  • . . . . . . . Tetrapoda vertebrates
  • . . . . . . . . Amniota vertebrates
  • . . . . . . . . . Eutheria mammals
  • . . . . . . . . . . Homo sapiens (man)
    ---------------------- 473 33 hits mammals
    B-cell antigen receptor complex associated
    protein alpha-ch
  • . . . . . . . . . . Bos taurus (bovine)
    ..................... 312 2 hits mammals
    B-cell antigen receptor complex associated
    protein alpha-ch
  • . . . . . . . . . . Mus musculus (mouse)
    .................... 278 31 hits mammals
    B-cell antigen receptor complex associated
    protein alpha-ch
  • . . . . . . . . . . Canis familiaris (dogs)
    ................. 37 1 hit mammals
    IG KAPPA CHAIN V REGION GOM
  • . . . . . . . . . . Rattus norvegicus (brown rat)
    ........... 35 7 hits mammals
    Vascular endothelial growth factor receptor 1
    precursor (VE
  • . . . . . . . . . . Oryctolagus cuniculus
    (domestic rabbit) . 29 1 hit mammals
    IG KAPPA CHAIN V REGION K29-213
  • . . . . . . . . . Coturnix japonica
    ------------------------- 33 2 hits birds
    Vascular endothelial growth factor
    receptor 2 precursor (VE
  • . . . . . . . . . Gallus gallus (chickens)
    .................. 31 4 hits birds
    CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA
    PRECURSOR (CNTFR

100
How good is an alignment?
  • Be careful Fitch/Smith found 17 alignments for
    alpha- and beta-chains in chicken haemoglobins
  • Only one is the correct one (according to the
    structure)
  • Given an alignment, how good is it
  • Percentage of matching residues, i.e. number of
    matches divided by length of smallest sequence
  • Advantage independent of sequence length
  • E.g. ATC TGAT 4/6 66.67 TGCAT A
  • More general also consider gaps, extensions,

101
Blast Raw Score
  • R a I b X - c O - d G, where
  • I is the number of identities in the alignment
    and a is the reward for each identity
  • X is the number of mismatches in the alignment
    and b is the reward for each mismatch
  • O is the number of gaps and c is the penalty for
    each gap
  • G is the number of - characters in the
    alignment and d is the penalty for each
  • The values for a,b,c,d appear at the bottom of a
    Blast report. For BLASTn they are a1, b-3, c5,
    d2

102
Example
  • Query 1 atgctctggccacggcacttgcgga
  • Sbjt107 atgctctggccacggatcttgtgga
  • tcccagggtgatctgtgcacctgcgata 53
  • tccca---tgatatgtgcacctgcgata 156
  • R 1 x 46 -3 x 4 - 5 x 1 - 2 x 3 23
  • So, given the scores how significant is the
    alignment?

103
Significance of an alignment
  • Significance of an alignment needs to be defined
    with respect to a control population
  • Pairwise alignment How can we get control
    population?
  • Generate sequences randomly? ? Not a good model
    of real sequences
  • Chop up both sequences and randomly reassemble
    them
  • Database search How can we get control
    population?
  • Control whole database
  • Align sequence to control population and see how
    good result is in comparison
  • This is captured by Z scores, P-values and
    E-values

104
Z-score
  • Z-score normalises the score S
  • Let m be mean of population and std its standard
    deviation, then Z-score (S m) / std
  • Z-score of 0 ? no better than average, hence
    might have occurred by chance
  • The higher the Z-score the better

105
P-value
  • P-value probability of obtaining a score S
  • Range 0 P 1
  • Let m be the number of sequences in the control
    population with score S
  • Let p be the size of the control population
  • Then P-value m / p
  • Rule of thumb
  • P 10-100 exact match,
  • 10-100 P 10-50 nearly identical (SNPs)
  • 10-50 P 10-10 homology certain
  • 10-5 P 10-1 usually distant relative
  • P gt 10-1 probably insignificant

106
E-values
  • E-value takes also the database into account
  • E-value expected frequency of a score S
  • Range 0 E m, where m is the size of the
    database
  • Relationship to P E m P
  • E values are calculated from
  • the bit score
  • the length of the query
  • the size of the database

107
BLAST Bit score
  • The bit score normalizes the raw score S to make
    score under different settings comparable
  • The bit score is obtained from the raw score as
    follows
  • S ( lambda x R - ln(K) ) / ( ln(2),
  • where lambda 1.37 and K0.711
  • Example
  • S ( 1.37 x 23 - ln(0.711) ) / ln(2) 46

108
E-value
  • The E-value is then calculated as follows
  • E m x n x 2 -S , where
  • m is the effective length of the query
  • n is the effective length of the database
  • S is the bit score
  • (effective length takes into account that an
    alignment cannot start at the end of a sequence)
  • Example
  • m34 (19 nucleotides fewer than the 53 submitted)
  • n5,854,611,841
  • Result E0.003

109
Precision and Recall
  • How good are BLAST and FASTA?
  • True positives, tp hits which are biologically
    meaningful
  • False positives, fp hits which are not
    biologically meaningful
  • True negatives, tn non-hits which are not
    biologically meaningful
  • False negatives, fn non-hits which are
    biologically meaningful
  • Minimise fp and fn
  • Recall tp/(tpfn) (meaningful hits / all
    meaningful)
  • Precision tp/(tpfp) (meaningful hits / all
    hits)
  • But since no objective data available difficult
    to judge BLAST and FASTAs sensitivity and
    specificity

110
Multiple Sequence Alignments
111
Multiple Sequence Alignment
  • Align more than two sequences
  • Choice of sequences
  • If too closely related then large redundant
  • If very distantly related then difficult to
    generate good alignment
  • Additionally use colour for residues with similar
    properties
  • Yellow Small polar GLy,Ala,Ser,Thr
  • Green Hydrophobic Cys,Val,Ile,Leu, Pro,Phe
    ,Tyr,Met,Trp
  • Magenta Polar Asn,Gln,His
  • Red Negatively charged Asp,Glu
  • Blue Positively charged Lys, Arg

112
Thioredoxins WCGPCK or R motif
113
Thioredoxins Gly/Pro turn
114
Thioredoxins every second hydrophobic beta
strand
115
Thioredoxins ca. every 4th hydrophobic alpha
helix
116
(No Transcript)
117
Profiles, PSI-Blast, HMM
118
Profiles
  • Derive profile from multiple sequence alignment
  • Useful to
  • Align distantly related sequences
  • Conserved regions, which may indicate active site
  • Classify subfamilies within homologues
  • How can profile be used to search
  • Insist on profile (such as WGCPC)? Too strict
  • Use frequence distribution of profile

119
Consider frequencies
  • Score for
  • VDFSAS 1316167167
  • ADATAA 11601160
  • Not good to pick up distant relationships
  • Better combine with substitution matrix
  • Result position specific substitution matrix

120
PSI-Blast
  • Globin familiy (oxygen transport ) of proteins
    occurs in many species
  • Proteins have same function and structure and
  • But there are pairs of members of the family
    sharing less than 10 identical residues
  • A B C
  • PSI-BLAST idea score via intermediaries may be
    better than score from direct comparison

50
50
Only 10
121
PSI-BLAST
  • PSI-BLAST
  • 1. BLAST
  • 2. Collect top hits
  • 3. Build multiple sequence alignment from
    significant local matches
  • 4. Build profile
  • 5. Re-probe database with profile
  • 6. Go back to 2.

122
PSI-BLAST
  • But beware of PSI-BLAST
  • False positives propagate and spread through
    iterations
  • If protein A consists of domains D and E, and
    protein B of domains E and F and protein C of
    domain F, then PSI-BLAST will relate A and C
    although they do not share any domain

123
Hidden Markov Model
  • Procedure to generate sequences
  • State transition systems with three types of
    states
  • Deletion
  • Insertion
  • Match, which emits residues
  • Follow probability distribution for successor
    state
  • Train model on multiple sequence alignment

124
Summary
  • Evolutionary model Indels and substitutions
  • Homologues vs. similarity
  • Dot plots
  • Easy visual exploration, but not scalable
  • Dynamic programming
  • Local, global, gaps
  • Substitution matrices (PAM, BLOSUM)
  • BLAST and FASTA
  • Scores and significance
  • Multiple Sequence Alignments
  • Profiles, PSI-BLAST, HMM

125
Phylogeny
126
Motivation
  • How did the nucleus get into the eucaryotic
    cells?
  • From where are we?
  • Recent Africa vs. Multi-regional Hypothese
  • In 1999 Encephalitis caused by the West Nile
    Virus broke out in New York. How did the virus
    come to New York?

127
How did the nucleus get into the eucaryotic cells?
  • Simple experiment
  • Blast classes genes with related functions in
    yeast (Eucaryote)
  • against Bacteria and
  • against Archaea
  • And count number of significant hits

128
How did the nucleus get into the eucaryotic cells?
  • Mitochondria und Energy metabolism
  • Significantly more hits in bacteria
  • Cell organisation
  • Significantly more hits in Archaea
  • Fundamental Result without any experiment!

Blue Bacteria Grey Archaea
129
Phylogeny
  • Taxonomists aim to classify and group organisms
  • E.g. Aristoteles, De Partibus Animalium
  • Ought we, for instance, to begin by discussing
    each separate species man, lion, ox, and the
    like taking each kind in hand independently of
    the rest, or ought we rather to deal first with
    the attributes which they have in common in
    virtue of some common element of their nature,
    and proceed from this as a basis for
    consideration of them separately other

130
Schools of Taxonomists
  • Goal create taxonomy
  • Approach
  • Phenotype
  • Phylogeny
  • 3 schools
  • Phenotype only
  • Evolutionary TaxonomistsPhenotype ( Phylogeny)
  • Cladists Phylogeny (Phenotype)

131
Practical Application Westnile virus in NY
  • Westnile virus mainly in Africa
  • Transmitted by insects and birds
  • How did the virus get to NY in 1999
  • Hundreds of DNA samples taken
  • All 99.8 identical ? single entry to NY!
  • Phylogenetic tree allows to deduce origin

132
Example Westnil virus in NY
  • How can the trees be constructed?

133
Three Methods to Generate Phylogenetic Trees
  • Distance-based
  • Hierarchical clustering
  • Character-based
  • Parsimony
  • Maximum likelihood

134
Distance-based Approach
  • Single Alignment
  • Score 46 matches, 3 mismatches, 1 gap, 3 gap
    extensions,
  • z.B. Score 46x1 - 3x1 - 1x2 - 3x1 38
  • Approach
  • Define distance between two sequences, e.g.
    percentage of mismatches in their alignment
  • Construct tree, which groups sequences with
    minimal distances iteratively together

135
Hierarchical Clustering
136
Hierarchical Clustering
  • Given a distance matrix D(dij) with 1 i,j n
  • Result A binary tree of clusters
  • Init
  • ToDo
  • For all i in 1,, n do
  • Let ti be a tree without children, i.e. a leaf
  • ToDo ToDo ? ti
  • Main loop
  • While ToDo gt 1 do
  • Find i,j such that dij is minimal
  • Add a new column and row labelled k (i,j) to D
  • For all indices h of D apart from k,i,j do
  • dh,k dk,h min dh,i , dh,j // min
    single linkage
  • Let tk be a new tree with children ti and tj
  • ToDo ( ToDo ? tk ) - ti ,tj
  • Remove columns and rows i,j from D
  • Complexity O(n2)

137
Hierarchical Clustering
  • How to define distance between clusters?
  • Single linkage
  • dh,k dk,h min dh,i , dh,j
  • Example Distance (A,B) to C is 1
  • Complete linkage
  • dh,k dk,h max dh,i , dh,j
  • Example Distance (A,B) is C is 2
  • Average linkage
  • dh,k dk,h 0.5 dh,i 0.5 dh,j
  • Example Distance (A,B) to C is 1.5
  • Are dendrograms always the same independent of
    the linkage method?

138
Parsimony-method
  • Approach Generate smallest tree containing all
    the sequences as leaves

139
Parsimony
  • Generate smallest tree
  • Informative vs. non-informative sites
  • Build pairs with fewest possible substitutions
  • Example
  • 3 possible trees
  • ((a,b),(c,d)) or ((a,c),(b,d)) or ((a,d),(b,c))
  • 1,2,3,4 are not informative
  • 5,6 are informative
  • 5 ((a,b),(c,d))
  • 6 ((a,c),(b,d))

140
Maximum likelihood
  • Assigns quantitative probabilities to mutation
    events
  • Reconstructs ancestors for all nodes in the tree
  • Assigns branch lengths based on probabilities of
    the mutational events
  • For each possible tree topology, the assumed
    substitution rates are varied to find the
    parameters that give the highest likelihood of
    producing the observed data

141
Problems
  • Character-based methods tend to be better (based
    on paleontological data)
  • All make assumptions
  • No back mutations
  • Same evolutionary rate

142
Assessing Quality Bootstrapping
  • Given a tree obtained from one of the methods
    above
  • Generate Multiple Alignment
  • For a number of interations
  • Generate new sequences by selecting columns
    (possibly the same column more than once) form
    the multiple alignment
  • Generate tree for the new sequences
  • Compare this new tree with the given tree
  • For each cluster in the given tree, which also
    approach in the new tree, the bootstrap value is
    increased
  • Bootstrap-Value Percentage of trees containing
    the same cluster

143
From where are we?
  • Recent-Africa Hypothesis
  • Homo Sapiens came 100-200.000 years ago from
    Africa
  • Multi-regional Hypothesis
  • Ancestors of Homo Sapiens left Africa ca.
    2.000.000 years ago
  • Which ones right?

144
Experiment
  • Mitochondrial DNA form 53 humans in different
    regions sequenced
  • Outgroup Mitochondrial DNA of chimpanzee

145
A nice phylogeny (Nature 2004)
Nature October 2004 Volume 431 No. 7012
146
Why Mitochondria?
  • Simple genetic structure
  • No repetitions
  • No Pseudo genes
  • No Introns
  • No recombination

147
Molecular Clock
  • Based on genetic and paleontological the most
    recent common ancestor (mrca) of chimpanzee and
    homo sapiens dates back 5.000.000 years
  • Molecular clock
  • ? 1.7 x10-8 nucleotide changes per site and year
  • Assumption equal distribution, no silent
    mutations
  • Diversity in Afrikca 3.7 x10-3 nucleotide
    changes per site and year
  • Diversity outside Africa 1.7 x10-3 nucleotide
    changes per site and year
  • Estimated expansion1925 generations ago ca.
    40.000 years
  • Mrca of all humans 171.500 /- 50.000 years ago
  • Mrca of African and non-African 52,000 /-
    27.500 years ago
  • Experiment supports recent-Africa hypothesis

148
Summary
  • Schools of taxonomists
  • Assumptions made
  • Methods
  • Distance-based
  • Character-based
Write a Comment
User Comments (0)
About PowerShow.com