Title: Why is pairwise sequence alignment different
1Lecture 5
- Why is pairwise sequence alignment different
- for proteins and for nucleic acids ?
- General protein introduction.
- Scoring systems and matrices for protein data.
- 3. Wet experience for pairwise sequence
alignment - (for proteins, more options).
- 4. Special Blast pages.
- 5. Why is multiple alignment better ?
- 6. Wet experience for MSA (for proteins).
2Multiple Sequence Alignment Motivation
- Helps identify common structures and functions
- Build gene families.
- Shared homologous regions.
- Conserved regions (consensus).
- Serves as a basis for constructing phylogeny
- (evolutionary) trees from homologous sequences.
3Pair wise Alignment (Reminder)
- Given two sequences (AA or DNA) S and T,
- an alignment is a pair of sequences S, T
- consisting of letters and gaps ( _ ) such that
- 1. S, T have the same length.
- 2. Removing all gaps from S we get S.
- 3. Removing all gaps from T we get T.
- Example
- S ACTG S AC_TG S ACTG S ACTG
- T AGT T A_GT_ T AGT_ T _AGT
.
4- There are several ways to compute
- Multiple Alignment
- Sum of pairs - sum of pairwise distances
- between all pairs of sequences.
- Distance from consensus - the consensus
- is a string of the most common character
- in each column.
5Traveling Salesman Problem (TSP)
- Given
- n nodes.
- Distances for each pair of nodes.
- Find a roundtrip, so that
- Visit each node exactly once.
- Minimal total length.
Well studied NP-complete
6Multiple Sequence Alignment
- Given h sequences S1, S2,, Sh, a multiple
sequence - alignment is a list of h sequences S1,
S2,, Sh - consisting of letters and gaps ( _ ), such that
- 1. S1, S2,, Sh have the same length.
- 2. Removing all gaps from Si we get Si (i
1,h). - Each column contains at least one letter.
- Example
- S1 ACTG S1 AC_TG S1 ACTG_
S1 AC __TG - S2 AGT S2 A_GT_ S2 AGT__
S2 A_ G_T_ - S3 ACCTG S3 ACCTG S3 ACCTG S3
AC _CTG
7MSA and Consensus
A good multiple sequence alignment can be used to
infer a consensus sequence
Example S1 ACTCGT S1
AC_ _TCGT S2 CAGTG S2
_CAGT_G_ S3 ACATCG S3 ACA_TCG_
Consensus
ACAGTCGT
8Similarity Score of MSA
- Each h-tuple of characters in the alignment
- gets a value, depending on its identity.
- The similarity score of the alignment
- is the sum of all h-tuple values.
- A popular way to compute h-tuple values SP
- Sum of Pairs, where each pair gets the
- score from the similarity matrix (PAM, BLOSUM).
Goal Find MSA with maximum similarity score. Bad
News This problem is NP hard.
9The Dimension Problem
If we consider only short sequences and only two
taxa, we can handle the comparison
manually. For example, 2 taxas
matrix But if you were to do this for h75
taxa, you'd have to use 75 dimensional
space. This is tough even for modern computers
!!!
Taxa 1
Taxa 2
10How is Multiple Sequence Alignment Being Done ?
We are given h sequences of length m each, and
want to find the MSA with the maximum similarity
score. A generalization of the the
Smith-Waterman algorithm solves the problem in O
(m h) computational steps. This is infeasible
for moderate values of m and for h gt 3.
NP-hardness implies that exact and efficient
algorithms do not exist. Solution Use
heuristics.
11Multiple Sequence Alignment Heuristics
Example - 4 sequences A, B, C, D.
A.
B D A C
A B C D
Perform all 6 pair wise alignments. Find
scores. Build a similarity tree.
similarity
B.
Multiple alignment following the tree from A.
B
Align most similar pairs allowing gaps to
optimize alignment.
D
A
Align the next most similar pair.
C
Now align the alignments, introducing gaps if
necessary to optimize alignment of (BD) with
(AC).
12Lecture 5
- Why is pairwise sequence alignment different
- for proteins and for nucleic acids ?
- General protein introduction.
- Scoring systems and matrices for protein data.
- 3. Wet experience for pairwise sequence
alignment - (for proteins, more options).
- 4. Special Blast pages.
- 5. Why is multiple alignment better ?
- 6. Wet experience for MSA (for proteins).
13Popular Software for MSA (all heuristic !)
CLUSTAL family
http//www.ebi.ac.uk/clustalw /
http//www.clustalw.genome.ad.jp/
http//bioweb.pasteur.fr/seqanal/interfaces/clusta
lw.html
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
age/NPSA/npsa_clustalw.html
14http//www2.ebi.ac.uk/clustalw/
15Input Format for MSA (ClustalW)
gtworm
16Equivalent PAM and Blosum Matrices
The following matrices are roughly
equivalent... PAM100 gt
Blosum90 PAM120 gt Blosum80
PAM160 gt Blosum60
PAM200 gt Blosum52 PAM250 gt
Blosum45
17Clustal Parameters
- Scoring Matrices Default values are
- DNA DNA Identity matrix.
- Protein Gonnet 250.
- (Gonnet - These matrices were derived using
almost the same procedure as the Dayhoff one
(above) but are much more up to date and are
based on a far larger data set. They appear to be
more sensitive than the Dayhoff series. We use
the GONNET 40, 80, 120, 160, 250 and 350
matrices). - Gap parameters.
- Open 10 Extent 0.05
18Gap Parameters
Gapopen - Set the penalty for opening a gap.
The default value is 10. Endgap - Set the
penalty for closing a gap. Gapext - Set the
penalty for extending a gap. The default value is
0.05. Gapdist - Set the gap separation penalty
- avoiding gaps that are too close. The default
value is 8.
http//www.ebi.ac.uk/2can/tutorials/protein/clusta
lw4.html
19ClustalW on Gluco.- Results
20Cluster Gluco. Results (cont.)
worm
Fruit fly
http//www.ebi.ac.uk/clustalw/help.html
21T-COFFEE Visualization of Multiple Alignment
http//www.ch.embnet.org/
http//www.ch.embnet.org/software/TCoffee.html
Results
More accurate program than ClustalW for sequences
with less than 30 identity, but it slower...
http//www.ch.embnet.org/software/ClustalW.html
22T-COFFEE Results
23BOXSHADE Visualization of Multiple Sequence
Alignment
http//bioweb.pasteur.fr/seqanal/interfaces/boxsha
de-simple.html
24BOXSHADE - Results
25Lecture 5
- Why is pairwise sequence alignment different
- for proteins and for nucleic acids ?
- General protein introduction. Scoring
systems - and matrices for protein data.
- Wet experience for pairwise sequence alignment
- (for proteins, more options).
- Special Blast pages.
- Why is multiple alignment better ?
- Wet experience for MSA (for proteins).