Why is pairwise sequence alignment different - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Why is pairwise sequence alignment different

Description:

... pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA ... worm. The following matrices are roughly equivalent... PAM100 == Blosum90. PAM120 == Blosum80 ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 26
Provided by: bch7
Category:

less

Transcript and Presenter's Notes

Title: Why is pairwise sequence alignment different


1
Lecture 5
  • Why is pairwise sequence alignment different
  • for proteins and for nucleic acids ?
  • General protein introduction.
  • Scoring systems and matrices for protein data.
  • 3. Wet experience for pairwise sequence
    alignment
  • (for proteins, more options).
  • 4. Special Blast pages.
  • 5. Why is multiple alignment better ?
  • 6. Wet experience for MSA (for proteins).

2
Multiple Sequence Alignment Motivation
  • Helps identify common structures and functions
  • Build gene families.
  • Shared homologous regions.
  • Conserved regions (consensus).
  • Serves as a basis for constructing phylogeny
  • (evolutionary) trees from homologous sequences.

3
Pair wise Alignment (Reminder)
  • Given two sequences (AA or DNA) S and T,
  • an alignment is a pair of sequences S, T
  • consisting of letters and gaps ( _ ) such that
  • 1. S, T have the same length.
  • 2. Removing all gaps from S we get S.
  • 3. Removing all gaps from T we get T.
  • Example
  • S ACTG S AC_TG S ACTG S ACTG
  • T AGT T A_GT_ T AGT_ T _AGT


.
4
  • There are several ways to compute
  • Multiple Alignment
  • Sum of pairs - sum of pairwise distances
  • between all pairs of sequences.
  • Distance from consensus - the consensus
  • is a string of the most common character
  • in each column.

5
Traveling Salesman Problem (TSP)
  • Given
  • n nodes.
  • Distances for each pair of nodes.
  • Find a roundtrip, so that
  • Visit each node exactly once.
  • Minimal total length.

Well studied NP-complete
6
Multiple Sequence Alignment
  • Given h sequences S1, S2,, Sh, a multiple
    sequence
  • alignment is a list of h sequences S1,
    S2,, Sh
  • consisting of letters and gaps ( _ ), such that
  • 1. S1, S2,, Sh have the same length.
  • 2. Removing all gaps from Si we get Si (i
    1,h).
  • Each column contains at least one letter.
  • Example
  • S1 ACTG S1 AC_TG S1 ACTG_
    S1 AC __TG
  • S2 AGT S2 A_GT_ S2 AGT__
    S2 A_ G_T_
  • S3 ACCTG S3 ACCTG S3 ACCTG S3
    AC _CTG


7
MSA and Consensus
A good multiple sequence alignment can be used to
infer a consensus sequence
Example S1 ACTCGT S1
AC_ _TCGT S2 CAGTG S2
_CAGT_G_ S3 ACATCG S3 ACA_TCG_
Consensus
ACAGTCGT
8
Similarity Score of MSA
  • Each h-tuple of characters in the alignment
  • gets a value, depending on its identity.
  • The similarity score of the alignment
  • is the sum of all h-tuple values.
  • A popular way to compute h-tuple values SP
  • Sum of Pairs, where each pair gets the
  • score from the similarity matrix (PAM, BLOSUM).

Goal Find MSA with maximum similarity score. Bad
News This problem is NP hard.
9
The Dimension Problem
If we consider only short sequences and only two
taxa, we can handle the comparison
manually. For example, 2 taxas
matrix But if you were to do this for h75
taxa, you'd have to use 75 dimensional
space. This is tough even for modern computers
!!!
Taxa 1
Taxa 2
10
How is Multiple Sequence Alignment Being Done ?
We are given h sequences of length m each, and
want to find the MSA with the maximum similarity
score. A generalization of the the
Smith-Waterman algorithm solves the problem in O
(m h) computational steps. This is infeasible
for moderate values of m and for h gt 3.
NP-hardness implies that exact and efficient
algorithms do not exist. Solution Use
heuristics.
11
Multiple Sequence Alignment Heuristics
Example - 4 sequences A, B, C, D.
A.
B D A C
A B C D
Perform all 6 pair wise alignments. Find
scores. Build a similarity tree.
similarity
B.
Multiple alignment following the tree from A.
B
Align most similar pairs allowing gaps to
optimize alignment.
D
A
Align the next most similar pair.
C
Now align the alignments, introducing gaps if
necessary to optimize alignment of (BD) with
(AC).
12
Lecture 5
  • Why is pairwise sequence alignment different
  • for proteins and for nucleic acids ?
  • General protein introduction.
  • Scoring systems and matrices for protein data.
  • 3. Wet experience for pairwise sequence
    alignment
  • (for proteins, more options).
  • 4. Special Blast pages.
  • 5. Why is multiple alignment better ?
  • 6. Wet experience for MSA (for proteins).

13
Popular Software for MSA (all heuristic !)
CLUSTAL family
http//www.ebi.ac.uk/clustalw /
http//www.clustalw.genome.ad.jp/
http//bioweb.pasteur.fr/seqanal/interfaces/clusta
lw.html
http//prodes.toulouse.inra.fr/multalin/multalin.h
tml
http//npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?p
age/NPSA/npsa_clustalw.html
14
http//www2.ebi.ac.uk/clustalw/
15
Input Format for MSA (ClustalW)
gtworm
16
Equivalent PAM and Blosum Matrices
The following matrices are roughly
equivalent... PAM100 gt
Blosum90 PAM120 gt Blosum80
PAM160 gt Blosum60
PAM200 gt Blosum52 PAM250 gt
Blosum45
17
Clustal Parameters
  • Scoring Matrices Default values are
  • DNA DNA Identity matrix.
  • Protein Gonnet 250.
  • (Gonnet - These matrices were derived using
    almost the same procedure as the Dayhoff one
    (above) but are much more up to date and are
    based on a far larger data set. They appear to be
    more sensitive than the Dayhoff series. We use
    the GONNET 40, 80, 120, 160, 250 and 350
    matrices).
  • Gap parameters.
  • Open 10 Extent 0.05

18
Gap Parameters
Gapopen - Set the penalty for opening a gap.
The default value is 10. Endgap - Set the
penalty for closing a gap. Gapext - Set the
penalty for extending a gap. The default value is
0.05. Gapdist - Set the gap separation penalty
- avoiding gaps that are too close. The default
value is 8.
http//www.ebi.ac.uk/2can/tutorials/protein/clusta
lw4.html
19
ClustalW on Gluco.- Results
20
Cluster Gluco. Results (cont.)
worm
Fruit fly
http//www.ebi.ac.uk/clustalw/help.html
21
T-COFFEE Visualization of Multiple Alignment
http//www.ch.embnet.org/
http//www.ch.embnet.org/software/TCoffee.html
Results
More accurate program than ClustalW for sequences
with less than 30 identity, but it slower...
http//www.ch.embnet.org/software/ClustalW.html
22
T-COFFEE Results
23
BOXSHADE Visualization of Multiple Sequence
Alignment
http//bioweb.pasteur.fr/seqanal/interfaces/boxsha
de-simple.html
24
BOXSHADE - Results
25
Lecture 5
  • Why is pairwise sequence alignment different
  • for proteins and for nucleic acids ?
  • General protein introduction. Scoring
    systems
  • and matrices for protein data.
  • Wet experience for pairwise sequence alignment
  • (for proteins, more options).
  • Special Blast pages.
  • Why is multiple alignment better ?
  • Wet experience for MSA (for proteins).
Write a Comment
User Comments (0)
About PowerShow.com