Title: Exon prediction by Genomic Sequence alignment Burkhard Morgenstern and Oliver Rinner
1 Burkhard Morgenstern Institut für Mikrobiologie
und Genetik Molekulare Evolution und
Rekonstruktion von phylogenetischen Bäumen WS
2006/2007
2 - Goal
- Phylogeny reconstruction based on molecular
sequence data (DNA, RNA, protein sequences)
3Multiple sequence alignment
- Molecular phylogeny reconstruction relies on
comparative nucleic acid and protein sequence
analysis - Alignment most important tool for sequence
comparison - Multiple alignment contains more information than
pair-wise alignment
4Tools for multiple sequence alignment
- Y I M Q E V Q Q E R
-
- Sequence duplicates in history (e.g. speciation
event)
5Tools for multiple sequence alignment
6Tools for multiple sequence alignment
- Y I M Q E V Q Q E R
- Y I M Q E V Q Q E R
-
7Tools for multiple sequence alignment
- Y I M Q E A Q Q E R
- Y L M Q E V Q Q E R
- Substitutions occur
8Tools for multiple sequence alignment
- Y I M Q E A Q Q E R
- Y L M Q E V Q Q E R
9Tools for multiple sequence alignment
- YAI M Q E A Q Q E R
- Y L M - - V Q Q E R V
- Insertions/deletions (indels) occur
10Tools for multiple sequence alignment
- YAI M Q E A Q Q E R
- Y L M - - V Q Q E R V
11Tools for multiple sequence alignment
- Y A I M Q E A Q Q E R
- Y L M V Q Q E R V
- because of insertions/deletions sequence
similarity no longer immediately visible!
12Tools for multiple sequence alignment
- Y A I M Q E A Q Q E R -
- Y - L M V - - Q Q E R V
- Alignment brings together related parts of the
sequences by inserting gaps into sequences
13Tools for multiple sequence alignment
- Y A I M Q E A Q Q E R -
- Y - L M V - - Q Q E R V
14Tools for multiple sequence alignment
- Y A I M Q E A Q Q E R -
- Y - L M V - - Q Q E R V
- Mismatches correspond to substitutions
- Gaps correspond to indels
15Tools for multiple sequence alignment
-
- Pairwise alignment alignment of two sequences
- Multiple alignment alignment of N gt 2 sequences
16Tools for multiple sequence alignment
- s1 R Y I M R E A Q Y E S A Q
- s2 R C I V M R E A Y E
- s3 Y I M Q E V Q Q E R
- s4 W R Y I A M R E Q Y E
- Assumtion sequence family related by common
ancestry similarity due to common history - Sequence similarity not obvious (insertions and
deletions may have happened)
17Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Multiple alignment arrangement of sequences by
introducing gaps - Alignment reveals sequence similarities
18Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
19Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
20Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- General information in multiple alignment
- Functionally important regions more conserved
than non-functional regions - Local sequence conservation indicates
functionality!
21Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Phylogeny reconstruction based on multiple
alignment - Estimate pairwise distances between sequences
(distance-based methods for tree reconstruction) - Estimate evloutionary events in evolution
(parsimony and maximum likelihood methods)
22Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Task in bioinformatics Find best multiple
alignment for given sequence set
23Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Astronomical number of possible alignments!
24Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - - - Y E -
- s3 Y I - - - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Astronomical number of possible alignments!
25Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - - - Y E -
- s3 Y I - - - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Computer has to decide which one is best??
26Tools for multiple sequence alignment
- Questions in development of alignment programs
- (1) What is a good alignment?
- ? objective function (score)
- (2) How to find a good alignment?
- ? optimization algorithm
- First question far more important !
27Tools for multiple sequence alignment
- Before defining an objective function (scoring
scheme) - What is a biologically good alignment ??
28Tools for multiple sequence alignment
- Criteria for alignment quality
- 3D-Structure align residues at corresponding
positions in 3D structure of protein!
29Tools for multiple sequence alignment
- Criteria for alignment quality
30Tools for multiple sequence alignment
- Criteria for alignment quality
- 3D-Structure align residues at corresponding
positions in 3D structure of protein!
31Tools for multiple sequence alignment
- Species related by common history
32Tools for multiple sequence alignment
- Genes / proteins related by common history
33Tools for multiple sequence alignment
- Criteria for alignment quality
- 3D-Structure align residues at corresponding
positions in 3D structure of protein! - Evolution align residues with common ancestors!
34Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Alignment hypothesis about sequence evolution
- Mismatches correspond to substitutions
- Gaps correspond to insertions/deletions
35Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - - Y I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Alignment hypothesis about sequence evolution
- Search for most plausible scenario!
- Estimate probabilities for individual
evolutionary events insertions/deletions,
substitutions
36Tools for multiple sequence alignment
- s1 - R Y I - M R E A Q Y E S A Q
- s2 - R C I V M R E A - Y E - - -
- s3 - Y - I - M Q E V Q Q E R - -
- s4 W R Y I A M R E - Q Y E - - -
- Alignment hypothesis about sequence evolution
- Search for most plausible scenario!
- Estimate probabilities for individual
evolutionary events insertions/deletions,
substitutions
37Tools for multiple sequence alignment
- Compute score s(a,b) for degree of similarity
between amino acids a and b based on probability - pa,b
- of substitution
- a ? b (or b ? a)
- (Extremely simplified!)
38Tools for multiple sequence alignment
39Tools for multiple sequence alignment
- Reason for different substitutin probabilities
pa,b - Different physical and chemical properties of
amino acids - Amino acids with similar properties more likely
to be substituted against each other
40(No Transcript)
41Tools for multiple sequence alignment
- Use penalty for gaps introduced into alignment
- Simplest approach linear gap costs penalty
proportional to gap length - Non-linear gap penalties more realistic long gap
caused by single insertion/deletion - Most frequently used affine linear gap
penalties more realistic, but efficient to
calculate!
42- Traditional Objective functions
- Define Score of alignments as
- Sum of individual similarity scores s(a,b)
- Minus gap penalties
- Needleman-Wunsch scoring system for pairwise
alignment (1970)
43Pair-wise sequence alignment
- T Y W I V
- T - - L V
- Example
- Score s(T,T) s(I,L) s (V,V) 2 g
- Assumption linear gap penalty!
44Pair-wise sequence alignment
- T Y W I V
- T - - L V
- Dynamic-programming algorithm finds
- alignment with best score.
- (Needleman and Wunsch, 1970)
45Pair-wise sequence alignment
- T Y W I V
- T - - L V
- Running time proportional to product of sequence
length - Time-complexity O(l1 l2)
46Pair-wise sequence alignment
- Algorithm for pairwise alignment can be
generalized to multiple alignment of N sequences - Time-complexity O(l1 l2 lN)
- Not feasable in reality (too long running time!)
- Heuristic necessary, i.e. fast algorithm that
does not necessarily produce mathematically best
alignment
47Progressive Alignment
- Most popular approach to (global) multiple
sequence alignment - Progressive Alignment
- Since mid-Eighties Feng/Doolittle,
Higgins/Sharp, Taylor,
48Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WWRLNDKEGYVPRNLLGLYP
- AVVIQDNSDIKVVPKAKIIRD
- YAVESEAHPGSFQPVAALERIN
- WLNYNETTGERGDFPGTYVEYIGRKKISP
49Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WWRLNDKEGYVPRNLLGLYP
- AVVIQDNSDIKVVPKAKIIRD
- YAVESEAHPGSFQPVAALERIN
- WLNYNETTGERGDFPGTYVEYIGRKKISP
- Guide tree
50Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WW--RLNDKEGYVPRNLLGLYP-
- AVVIQDNSDIKVVP--KAKIIRD
- YAVESEASFQPVAALERIN
- WLNYNEERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
51Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN
- WW--RLNDKEGYVPRNLLGLYP-
- AVVIQDNSDIKVVP--KAKIIRD
- YAVESEASVQ--PVAALERIN------
- WLN-YNEERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
52Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN-
- WW--RLNDKEGYVPRNLLGLYP-
- AVVIQDNSDIKVVP--KAKIIRD
- YAVESEASVQ--PVAALERIN------
- WLN-YNEERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
53Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN--------
- WW--RLNDKEGYVPRNLLGLYP--------
- AVVIQDNSDIKVVP--KAKIIRD-------
- YAVESEA---SVQ--PVAALERIN------
- WLN-YNE---ERGDFPGTYVEYIGRKKISP
- Profile alignment, once a gap - always a gap
54Progressive Alignment
- WCEAQTKNGQGWVPSNYITPVN--------
- WW--RLNDKEGYVPRNLLGLYP--------
- AVVIQDNSDIKVVP--KAKIIRD-------
- YAVESEA---SVQ--PVAALERIN------
- WLN-YNE---ERGDFPGTYVEYIGRKKISP
- Most important implementation CLUSTAL W
55Progressive Alignment
- CLUSTAL W Thompson et al., 1994 (17.000
citations) - Pairwise distances as 1 - percentage of identity
- Calculate un-rooted tree with Neighbor Joining
- Define root as central position in tree
- Define sequence weights based on tree
- Gap penalties calculated based on various
parameters
56Tools for multiple sequence alignment
- Problems with traditional approach
- Results depend on gap penalty
- Heuristic guide tree determines alignment
alignment used for phylogeny reconstruction - Algorithm produces global alignments.
57Tools for multiple sequence alignment
- Problems with traditional approach
- But
- Many sequence families share only local
similarity - E.g. sequences share one conserved motif
58The DIALIGN approach
-
- Morgenstern, Dress, Werner (1996),
- PNAS 93, 12098-12103
- Combination of global and local methods
- Assemble multiple alignment from
- gap-free local pair-wise alignments
-
- (,,fragments)
-
-
59The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
60The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
61The DIALIGN approach
-
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
62The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
63The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
64The DIALIGN approach
-
- atctaatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
65The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaagagtatcacccctgaattgaataa
-
66The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacccctgaattgaataa
-
67The DIALIGN approach
-
- atc------taatagttaaactcccccgtgcttag
- cagtgcgtgtattactaacggttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
68The DIALIGN approach
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
69The DIALIGN approach
Consistency!
-
- atc------taatagttaaactcccccgtgc-ttag
- cagtgcgtgtattactaac----------gg-ttcaatcgcg
- caaa--gagtatcacc----------cctgaattgaataa
-
70The DIALIGN approach
-
- atc------TAATAGTTAaactccccCGTGC-TTag
- cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
- caaa--GAGTATCAcc----------CCTGaaTTGAATaa
-
71More methods for multiple alignment
- T-Coffee
- PIMA
- Muscle
- Prrp
- Mafft
- ProbCons
-
-
72Substitution matrices
- Similarity score s(a,b) for amino acids a and b
based on probability pa,b of substitution a -gt b - Idea it is more reasonable to align amino acids
that are often replaced by each other!
73Substitution matrices
- Assumptions
- pa,b does not depend on sequence position
- Sequence positions independent of each other
- pa,b pb,a (symmetry!)
74Substitution matrices
- Compute score s(a,b) for degree of similarity
between amino acids a and b -
- Probability pa,b of substitution
- a ? b (or b ? a),
- Frequency qa of a
- Define
- s(a,b) log (pa,b / qa qb)
75 Substitution matrices
76Substitution matrices
-
- To calculate pa,b
- Consider alignments of related proteins and count
substitutions - a ? b (or b ? a)
-
77Substitution matrices
-
- To calculate pa,b
- Consider alignments of related proteins and count
substitutions - a ? b (or b ? a)
- ESWTS-RQWERYTIALMSDQRREVLYWIALY
- ERWTSERQWERYTLALMS-QRREALYWIALY
78Substitution matrices
-
- To calculate pa,b
- Consider alignments of related proteins and count
substitutions - a ? b (or b ? a)
- ESWTS-RQWERYTIALMSDQRREVLYWIALY
- ERWTSERQWERYTLALMS-QRREALYWIALY
79Substitution matrices
- Problems involved
- Probability pa,b depends on time t since
sequences separated in evolution pa,b pa,b
(t) - Protein families contain multiple sequences
phylogenetic tree must be known! - Alignment of protein families must be known!
- Multiple mutations at one sequence position
80Substitution matrices
- M. Dayhoff et al., Atlas of Protein sequence and
Structure, 1978 - PAM matrices
81Substitution matrices
- Calculation of pa,b(t)
- Consider multiple alignments of closely related
protein families - Count occurrence of a and b at corresponding
positions in alignments using phylogenetic tree - Estimate pa,b(t) for small times t
- Calculate conditional probabilities p(ab,t) for
small t - Normalize to distance 1 PAM ( percentage of
accepted mutations) - Calculate p(ab,t) for larger evolutionary
distances by matrix multiplication - Calculate pa,b(t) for larger evolutionary
distances
82Substitution matrices
83Substitution matrices
- Alternative BLOSUM matrices
- S. Henikoff and J.G. Henikoff, PNAS, 1992
- Basis BLOCKS database, gap-free regions of
multiple alignments. - Cluster of sequences if percentage of similarity
gt L - Estimate pa,b(t) directly.
- Default values L 62, L 50