Algorithms for Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Algorithms for Pairwise Sequence Alignment

Description:

Title: Algorithms for Pairwise Sequence Alignment Author: hygiea Last modified by: hygiea Created Date: 9/12/2002 8:24:52 PM Document presentation format – PowerPoint PPT presentation

Number of Views:238

Avg rating:3.0/5.0

Slides: 27

Provided by: hyg7

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Pairwise Sequence Alignment

1
Algorithms for Pairwise Sequence Alignment

Craig A. Struble, Ph.D.
Marquette University

2
Overview

Pairwise Sequence Alignment
Dynamic Programming Solution
Global Alignment
Local Alignment
BLAST and FASTA

3
Pairwise Sequence Alignment

As weve seen, sequence similarity is an
indicator of homology
There are other uses for sequence similarity
Database queries
Comparative genomics

4
Pairwise Sequence Alignment

Example
Which one is better?

HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
5
Scoring

To compare two sequence alignments, calculate a
score
PAM or BLOSUM matrices
Matches and mismatches
Gap penalty
Initiating a gap
Gap extension penalty
Extending a gap

6
Example

Gap penalty -8
Gap extension -8

A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
HEAGAWGHE-E
--P-AW-HEAE
(-8) (-8) (-1) 5 15 (-8) 10 6
(-8) 6 9
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
7
Formal Description

Problem PairSeqAlign
Input Two sequences x,y
Scoring matrix s
Gap penalty d
Gap extension penalty e
Output The optimal sequence alignment

8
How Difficult Is This?

Consider two sequences of length n
There are
possible global alignments, and we need to find
an optimal one from amongst those!

9
So what?

So at n 20, we have over 120 billion possible
alignments
We want to be able to align much, much longer
sequences
Some proteins have 1000 amino acids
Genes can have several thousand base pairs

10
Dynamic Programming

General algorithmic development technique
Reuses the results of previous computations
Store intermediate results in a table for reuse
Look up in table for earlier result to build from

11
Global Alignment

Needleman-Wunsch 1970
Idea Build up optimal alignment from optimal
alignments of subsequences

HEAG --P- -25
Add score from table
HEAGA --P-A -20
HEAG- --P-A -33
HEAGA --P -33
Gap with bottom
Top and bottom
Gap with top
12
Global Alignment

Notation
xi ith letter of string x
yj jth letter of string y
x1..i Prefix of x from letters 1 through I
F matrix of optimal scores
F(i,j) represents optimal score lining up x1..i
with y1..j
d gap penalty
s scoring matrix

13
Global Alignment

The work is to build up F
Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
Fill from top left to bottom right using the
recursive relation

14
Global Alignment
yj aligned to gap
F(i-1,j-1) F(i,j-1)
F(i-1,j) F(i,j)
Move ahead in both
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
15
Example
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16
W -24
H -32
E -40
A -48
E -56
16
Completed Table
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
17
Traceback

Trace arrows back from the lower right to top
left
Diagonal both
Up upper gap
Left lower gap

H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
HEAGAWGHE-E --P-AW-HEAE
18
Summary

Uses recursion to fill in intermediate results
table
Uses O(nm) space and time
O(n2) algorithm
Feasible for moderate sized sequences, but not
for aligning whole genomes.

19
Local Alignment

Smith-Waterman (1981)
Another dynamic programming solution

20
Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
21
Traceback
Start at highest score and traceback to first 0
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
AWGHE AW-HE
22
Summary

Similar to global alignment algorithm
For this to work, expected match with random
sequence must have negative score.
Behavior is like global alignment otherwise
Similar extensions for repeated and overlap
matching
Care must be given to gap penalties to maintain
O(nm) time complexity

23
Repeat and Overlap Matches

Repeat matches allow for sections of a sequence
to match repeatedly
Repeated domain or motif
Overlap matches
Matching when the two sequences overlap
Does not penalize overhanging ends

x
x
y
y
24
BLAST

O(n2) algorithms are too slow for large scale
searches
BLAST developed by Altschul et al (1990)
Uses probabilistic approach to searching
Idea True alignments will have a short stretch
of identities (perfect match)

25
BLAST Overview

Make a list of neighborhood words
Length 3 for proteins, 11 for nucleic acids
Match query with score higher than some threshold
Usually 2 bits per residue
Scans database for words
When a hit is obtained, extends the match in both
direction as ungapped alignment

26
FASTA