Algorithms for Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Algorithms for Pairwise Sequence Alignment

Description:

Title: Algorithms for Pairwise Sequence Alignment Author: hygiea Last modified by: hygiea Created Date: 9/12/2002 8:24:52 PM Document presentation format – PowerPoint PPT presentation

Number of Views:238
Avg rating:3.0/5.0
Slides: 27
Provided by: hyg7
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Pairwise Sequence Alignment


1
Algorithms for Pairwise Sequence Alignment
  • Craig A. Struble, Ph.D.
  • Marquette University

2
Overview
  • Pairwise Sequence Alignment
  • Dynamic Programming Solution
  • Global Alignment
  • Local Alignment
  • BLAST and FASTA

3
Pairwise Sequence Alignment
  • As weve seen, sequence similarity is an
    indicator of homology
  • There are other uses for sequence similarity
  • Database queries
  • Comparative genomics

4
Pairwise Sequence Alignment
  • Example
  • Which one is better?

HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
5
Scoring
  • To compare two sequence alignments, calculate a
    score
  • PAM or BLOSUM matrices
  • Matches and mismatches
  • Gap penalty
  • Initiating a gap
  • Gap extension penalty
  • Extending a gap

6
Example
  • Gap penalty -8
  • Gap extension -8

A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
HEAGAWGHE-E
--P-AW-HEAE
(-8) (-8) (-1) 5 15 (-8) 10 6
(-8) 6 9
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
7
Formal Description
  • Problem PairSeqAlign
  • Input Two sequences x,y
  • Scoring matrix s
  • Gap penalty d
  • Gap extension penalty e
  • Output The optimal sequence alignment

8
How Difficult Is This?
  • Consider two sequences of length n
  • There are
  • possible global alignments, and we need to find
    an optimal one from amongst those!

9
So what?
  • So at n 20, we have over 120 billion possible
    alignments
  • We want to be able to align much, much longer
    sequences
  • Some proteins have 1000 amino acids
  • Genes can have several thousand base pairs

10
Dynamic Programming
  • General algorithmic development technique
  • Reuses the results of previous computations
  • Store intermediate results in a table for reuse
  • Look up in table for earlier result to build from

11
Global Alignment
  • Needleman-Wunsch 1970
  • Idea Build up optimal alignment from optimal
    alignments of subsequences

HEAG --P- -25
Add score from table
HEAGA --P-A -20
HEAG- --P-A -33
HEAGA --P -33
Gap with bottom
Top and bottom
Gap with top
12
Global Alignment
  • Notation
  • xi ith letter of string x
  • yj jth letter of string y
  • x1..i Prefix of x from letters 1 through I
  • F matrix of optimal scores
  • F(i,j) represents optimal score lining up x1..i
    with y1..j
  • d gap penalty
  • s scoring matrix

13
Global Alignment
  • The work is to build up F
  • Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
  • Fill from top left to bottom right using the
    recursive relation

14
Global Alignment
yj aligned to gap
F(i-1,j-1) F(i,j-1)
F(i-1,j) F(i,j)
Move ahead in both
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
15
Example
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16
W -24
H -32
E -40
A -48
E -56
16
Completed Table
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
17
Traceback
  • Trace arrows back from the lower right to top
    left
  • Diagonal both
  • Up upper gap
  • Left lower gap

H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
HEAGAWGHE-E --P-AW-HEAE
18
Summary
  • Uses recursion to fill in intermediate results
    table
  • Uses O(nm) space and time
  • O(n2) algorithm
  • Feasible for moderate sized sequences, but not
    for aligning whole genomes.

19
Local Alignment
  • Smith-Waterman (1981)
  • Another dynamic programming solution

20
Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
21
Traceback
Start at highest score and traceback to first 0
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
AWGHE AW-HE
22
Summary
  • Similar to global alignment algorithm
  • For this to work, expected match with random
    sequence must have negative score.
  • Behavior is like global alignment otherwise
  • Similar extensions for repeated and overlap
    matching
  • Care must be given to gap penalties to maintain
    O(nm) time complexity

23
Repeat and Overlap Matches
  • Repeat matches allow for sections of a sequence
    to match repeatedly
  • Repeated domain or motif
  • Overlap matches
  • Matching when the two sequences overlap
  • Does not penalize overhanging ends

x
x
y
y
24
BLAST
  • O(n2) algorithms are too slow for large scale
    searches
  • BLAST developed by Altschul et al (1990)
  • Uses probabilistic approach to searching
  • Idea True alignments will have a short stretch
    of identities (perfect match)

25
BLAST Overview
  • Make a list of neighborhood words
  • Length 3 for proteins, 11 for nucleic acids
  • Match query with score higher than some threshold
  • Usually 2 bits per residue
  • Scans database for words
  • When a hit is obtained, extends the match in both
    direction as ungapped alignment

26
FASTA
  • Pearson Lipman (1988)
  • Find all matching words of length ktup
  • 1 or 2 for proteins, 4 or 6 for DNA
  • Look for diagonals supporting word matches
  • Extend with ungapped alignment
  • Join ungapped regions with gaps
Write a Comment
User Comments (0)
About PowerShow.com