Why is pairwise sequence alignment different - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Why is pairwise sequence alignment different

Description:

... pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA ... worm. The following matrices are roughly equivalent... PAM100 == Blosum90. PAM120 == Blosum80 ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 26

Provided by: bch7

Category:

more less

Transcript and Presenter's Notes

Title: Why is pairwise sequence alignment different

1
Lecture 5

Why is pairwise sequence alignment different
for proteins and for nucleic acids ?
General protein introduction.
Scoring systems and matrices for protein data.
3. Wet experience for pairwise sequence
alignment
(for proteins, more options).
4. Special Blast pages.
5. Why is multiple alignment better ?
6. Wet experience for MSA (for proteins).

2
Multiple Sequence Alignment Motivation

Helps identify common structures and functions
Build gene families.
Shared homologous regions.
Conserved regions (consensus).
Serves as a basis for constructing phylogeny
(evolutionary) trees from homologous sequences.

3
Pair wise Alignment (Reminder)

Given two sequences (AA or DNA) S and T,
an alignment is a pair of sequences S, T
consisting of letters and gaps ( _ ) such that
1. S, T have the same length.
2. Removing all gaps from S we get S.
3. Removing all gaps from T we get T.
Example
S ACTG S AC_TG S ACTG S ACTG
T AGT T A_GT_ T AGT_ T _AGT

.
4

There are several ways to compute
Multiple Alignment
Sum of pairs - sum of pairwise distances
between all pairs of sequences.
Distance from consensus - the consensus
is a string of the most common character
in each column.

5
Traveling Salesman Problem (TSP)

Given
n nodes.
Distances for each pair of nodes.
Find a roundtrip, so that
Visit each node exactly once.
Minimal total length.

Well studied NP-complete
6
Multiple Sequence Alignment

Given h sequences S1, S2,, Sh, a multiple
sequence
alignment is a list of h sequences S1,
S2,, Sh
consisting of letters and gaps ( _ ), such that
1. S1, S2,, Sh have the same length.
2. Removing all gaps from Si we get Si (i
1,h).
Each column contains at least one letter.
Example
S1 ACTG S1 AC_TG S1 ACTG_
S1 AC __TG
S2 AGT S2 A_GT_ S2 AGT__
S2 A_ G_T_
S3 ACCTG S3 ACCTG S3 ACCTG S3
AC _CTG

7
MSA and Consensus
A good multiple sequence alignment can be used to
infer a consensus sequence
Example S1 ACTCGT S1
AC_ _TCGT S2 CAGTG S2
_CAGT_G_ S3 ACATCG S3 ACA_TCG_
Consensus
ACAGTCGT
8
Similarity Score of MSA

Each h-tuple of characters in the alignment
gets a value, depending on its identity.
The similarity score of the alignment
is the sum of all h-tuple values.
A popular way to compute h-tuple values SP
Sum of Pairs, where each pair gets the
score from the similarity matrix (PAM, BLOSUM).

Goal Find MSA with maximum similarity score. Bad
News This problem is NP hard.
9
The Dimension Problem
If we consider only short sequences and only two
taxa, we can handle the comparison
manually. For example, 2 taxas
matrix But if you were to do this for h75
taxa, you'd have to use 75 dimensional
space. This is tough even for modern computers
!!!
Taxa 1
Taxa 2
10
How is Multiple Sequence Alignment Being Done ?
We are given h sequences of length m each, and
want to find the MSA with the maximum similarity
score. A generalization of the the
Smith-Waterman algorithm solves the problem in O
(m h) computational steps. This is infeasible
for moderate values of m and for h gt 3.
NP-hardness implies that exact and efficient
algorithms do not exist. Solution Use
heuristics.
11
Multiple Sequence Alignment Heuristics
Example - 4 sequences A, B, C, D.
A.
B D A C
A B C D
Perform all 6 pair wise alignments. Find
scores. Build a similarity tree.
similarity
B.
Multiple alignment following the tree from A.
B
Align most similar pairs allowing gaps to
optimize alignment.
D
A
Align the next most similar pair.
C
Now align the alignments, introducing gaps if
necessary to optimize alignment of (BD) with
(AC).
12
Lecture 5

Why is pairwise sequence alignment different
for proteins and for nucleic acids ?
General protein introduction.
Scoring systems and matrices for protein data.
3. Wet experience for pairwise sequence
alignment
(for proteins, more options).
4. Special Blast pages.
5. Why is multiple alignment better ?
6. Wet experience for MSA (for proteins).

13
Popular Software for MSA (all heuristic !)
CLUSTAL family
http//www.ebi.ac.uk/clustalw /
http//www.clustalw.genome.ad.jp/
http//bioweb.pasteur.fr/seqanal/interfaces/clusta
lw.html
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
age/NPSA/npsa_clustalw.html
14
http//www2.ebi.ac.uk/clustalw/
15
Input Format for MSA (ClustalW)
gtworm
16
Equivalent PAM and Blosum Matrices
The following matrices are roughly
equivalent... PAM100 gt
Blosum90 PAM120 gt Blosum80
PAM160 gt Blosum60
PAM200 gt Blosum52 PAM250 gt
Blosum45
17
Clustal Parameters

Scoring Matrices Default values are
DNA DNA Identity matrix.
Protein Gonnet 250.
(Gonnet - These matrices were derived using
almost the same procedure as the Dayhoff one
(above) but are much more up to date and are
based on a far larger data set. They appear to be
more sensitive than the Dayhoff series. We use
the GONNET 40, 80, 120, 160, 250 and 350
matrices).
Gap parameters.
Open 10 Extent 0.05

18
Gap Parameters
Gapopen - Set the penalty for opening a gap.
The default value is 10. Endgap - Set the
penalty for closing a gap. Gapext - Set the
penalty for extending a gap. The default value is
0.05. Gapdist - Set the gap separation penalty
- avoiding gaps that are too close. The default
value is 8.
http//www.ebi.ac.uk/2can/tutorials/protein/clusta
lw4.html
19
ClustalW on Gluco.- Results
20
Cluster Gluco. Results (cont.)
worm
Fruit fly
http//www.ebi.ac.uk/clustalw/help.html
21
T-COFFEE Visualization of Multiple Alignment
http//www.ch.embnet.org/
http//www.ch.embnet.org/software/TCoffee.html
Results
More accurate program than ClustalW for sequences
with less than 30 identity, but it slower...
http//www.ch.embnet.org/software/ClustalW.html
22
T-COFFEE Results
23
BOXSHADE Visualization of Multiple Sequence
Alignment
http//bioweb.pasteur.fr/seqanal/interfaces/boxsha
de-simple.html
24
BOXSHADE - Results
25
Lecture 5