Title: Biological Sequence Alignment
1Biological Sequence Alignment
- Objectives
- Terminology surrounding sequence alignment
- Simple Edit Distance Dynamic Programming Algorithm
2Biological Sequence Alignment
- Materials adapted from many sources
- Gerton Lunter
- Paul Higgs, McAlister Univ
- Mclure, Montana State Univ.
- Aoife McLysagh
- For your reference
- http//etutorials.org/Misc/blast/PartIITheory/Ch
apter3.SequenceAlignment/ - http//lectures.molgen.mpg.de/Alg/
- A good on-line text
- Has an applet (a program-based web page) that
illustrates the algorithms. But I like the
following applet better - http//bibiserv.techfak.uni-bielefeld.de/media/seq
analysis/align-applet.html - It single steps through the algorithms more
clearly. - This applet also associated with an on-line text.
It too is good, but it is very expansive.
3Progression of Algorithms
- The basic algorithm
- Edit distance
- Origins how many typewriter key strokes to fix
- // (minimum)
- Sequence of refinements
- Global alignment Needleman-Wunsch70
- Weighting substitutionsSellers74, gaps
- // evolutionary model
- Local Alignment Smith-Waterman81
4Variations on Edit Distance
5Differences (confusion)
- 1. Allowable edits
- e.g. S(Pet,Pep) 1
- T substitutes P,
- One operation
- Or S(Pet, Pep) 2
- delete T, insert P
- Two operations
6Edit Distance - no substitutions
- (peter, piper) (peter, pepper)
- (peter, piper)
- Delete e, Insert i
- Delete t, Insert p
- S(peter, piper) 4
- (peter, pepper)
- Delete t, insert p
- Insert p
- S(peter, pepper) 3
7Edit Distance - substitutions
- (peter, piper)
- (peter, pepper)
- S(peter, piper) 2
- substitute e/i
- substitute t/p
- S(peter, pepper) 2
- substitute e/i
- Insert p
8Weighting
Should substitutions be cheaper than
insertions? substitute 1, insertion
1.2 S(peter, piper) 2, S(peter, pepper)
2.2 Should vowel substitutions be cheaper,
(0.7), than consonants (1.0)? S(peter, piper)
1.7 S(peter, pepper) 1.9
- (peter, piper)
- (peter, pepper)
- S(peter, piper) 2
- substitute e/i
- substitute t/p
- S(peter, pepper) 2
- substitute e/i
- Insert p
9Variations on Edit Distance
10Goal Model Biology (Evolution)
- Point Mutations
- substitution
- insertion
- deletion
- Impact relative to position in a codon
11To date, biological sequence analysis --gt local
alignment
- local alignment
- weighted point substitution
- no penalty for boundary problem
- gaps associated with a penalty
- similarity weighting, not distance
12An Alignment Illustration Method
- Peter Pet_er Pe_ter
- Per Pe er Pe er
- Piper Pepper Pepper
13Notation
- Let, V, W be two strings,
- Let V v1, , vn, W w1, , wm where
- n is the length of V, m is the length of W
- vi or wj represent the i or j th character
- // capital letters for the strings, small letters
for the characters, - Let Vi represent a prefix of the string V, where
Vi v1,, vi - e.g.
- V betty,
- v1 b, v2 e, v3 t, . , v5 y
- V2 be,
- W butter, W3 but
- Let S(Vi,Wj) Sij edit distance for the
strings, v1,vi and w1,,wj - e.g.
- S(V2,W3) S(be, but)
14More Notation
- Lecture slides are in standard mathematical
notation - string subscripts start at 1 unless stated
otherwise - matrices
- Code uses coding conventions, (start at 0)
15More
- Indel A hybrid term (combining the words
"insertion" and "deletion") used to describe a
difference in sequence due to either an insertion
or a deletion event especially used when the
evolutionary direction of the change is
unspecified http//www.yeastgenome.org/help/gloss
ary.html - pet_er
- pepper
indel
16Dynamic Programming Algorithms
- Dynamic programming is a generic template
constituting an entire class of algorithm. - // In the sense that divide and conquer
constitute a class of algorithm - Sequence alignment problems are mostly solved
using dynamic programming. - // (but not BLAST)
17Aside Principle of Optimality
- If the solution to a problem, P, can be broken
into two subproblems, P1 and P2, where combining
the solutions to P1 and P2 constitute solving P - And
- If the solution to P is optimal the solutions to
P1 and P2 are optimal. - Principle of optimality holds for alignment
problems. - In general, dynamic programming algorithms are
applicable for problems where the principle of
optimality holds. - the subproblems do not have to be of equal size
18- S(pete,pipe)
- S(p,p) 0
- S(pe, p) 1
- S(pe, pi) 1
- S(pet, pi) 2
- S(pet, pip) 2
- S(pete, pip) 3
- S(pete, pipe) 2 // ? What // of the principle
of optimality?
- Actually 3 ways to make the problem bigger, start
at S(p,p) -
S(p, pi)
S(pe,pi)
S(pe,p)
19But, principle of optimality says, pick the best
of the smaller problems
S(p, pi)
S(pe,pi)
S(pe,p)
20Recursive Definition ofSimple Edit Distance
- for strings V v1, , vn, W w1, , wm
where vi or wj represent the i or j th character, - Cost of substituting a character x, with y, is
represented as c(x,y) - indel an insert or delete, represented by _
- Si-1,j-1 c(vi,wj)
- min Si,j-1 c(_,wj) // when
defined - Si,j Si-1,j c(vi, _)
- S00 0
-
- c(vi,wj)
- if vi wj then c(vi,wj) 0
- if vi? wj then c(vi,wj) 1
- c(_,wj) 1
- c(vi, _) 1
21We Dont Implement as a Recursive Function
- V PEPPER
- W PETER
- Create a table, so
- we can save the answers
- to the smaller problem
- instances
- We will populate the table starting from the
smallest problem S(,) - // empty string
- Dirty trick
- In a program index from 0.
- Strings will still index starting from 1.
- v0, and w0, will mean .
22Example - Simple Edit Distancecomplete base-case
and boundary
Si-1,j-1 c(vi,wj) Si,j min Si,j-1
c(_,wj) Si-1,j c(vi, _)
23Example - Simple Edit DistanceExample Cell S22
Si-1,j-1 c(vi,wj) Si,j min Si,j-1
c(_,wj) Si-1,j c(vi, _)
S1,1 c(E,E) S2,2 min S2,1 c(_,E)
S1,2 c(E, _)
S(P,P) c(E,E) S(PE,PE) min
S(PE,P) c(_,E) S(P, PE) c(E, _)
0 0 S(PE,PE) min 11
11
24Example - Simple Edit Distance
25Example - Simple Edit DistanceCell example 2,
S43 -what is special?
Si-1,j-1 c(vi,wj) Si,j min Si,j-1
c(_,wj) Si-1,j c(vi, _)
S3,2 c(P,T) S4,3 min S3,3 c(P,_)
S4,2 c(_, T)
S(PEPP,PET) S(PEP,PE) c(P, T)
min S(PEP,PET) c(P, _) S(PEPP,
PE)c(_, T)
1 1 S(PE,PE) min 11
2 1
// a tie for the winner
26Example - Simple Edit DistanceCell example 2, S
4,3 - there is a tie
27Edit distance 2,what is are the edits?
- Traceback step
- mark the winning
- cases
- more than one if there
- are ties