Title: Sequence Alignment
1Sequence Alignment
- Gary Jackoway
- February 26, 2002
- CISC 889 Bioinformatics
2Sequence Alignment Outline
- Dynamic Programming for Sequence Alignment
- Equivalent Problems
- Algorithm Description
- O(MN) Proof By Example
- Global versus Local Alignment
- Nucleotide Substitution Matrix
3Sequence Alignment Outline (cont)
- PAM Substitution Matrix
- BLOSUM Substitution Matrix
- Log Odds Form
- Gap Penalty
- Alignment Issues
- Summary
4Dynamic Programming for Sequence Alignment
Problem What is the optimal alignment of two
DNA sequences. Input Two DNA sequences (either
Nucleotides or Amino Acids). Output An
alignment (mapping one sequence onto the other,
possibly with gaps) and a score which defines
the quality of the match.
5Equivalent Problems
- Optical Character Recognition
- Document Comparison
- Spell Checker / Corrector
cornment ?comment
Four-score and seven years ago Four score and
seven years ago
mispeld ? misspelled
REFERENCE Skienas The Algorithm Design Manual
8.7.4
6Algorithm Description
DP algorithms have a strong relationship to
recursion define a base case and prove that you
can extend. If you already have the optimal
solution to XY AB then you know the next
pair of characters will either be XYZ
or XY- or XYZ ABC ABC AB- (where -
indicates a gap). So you can extend the match by
determining which of these has the highest score.
7Needleman-Wunsch Algorithm Single Step
X
Z
Y
X0match(a1,b1) Y(1 gap) (1 gap) Z(1 gap)
(1 gap)
8Needleman-Wunsch Algorithm Single Step (numeric)
X
Z
Y
X 21 (-3) ? match(G,A) Y 28 (-10) ? (1
gap) Z 14 (-10) ? (1 gap)
9O(MN)Proof By Example
We will prove that the dynamic programmingalgorit
hm for sequence alignment can beexecuted in
O(MN) time, where Mlength of first
sequence Nlength of second sequence
10Global versus Local Alignment
Want to find local matching areas, even when
farremoved from each other in the sequence
ACTTAGCAGACTAACGTAAC
CCATGACTAACGGGACCTAC
Smith-Waterman Use Needleman-Wunsch but add IF
valuelt0, replace with 0 (and set backtrack to
none). When matrix is complete, backtrack from
all localmaxima, creating local matching
alignments.
11Nucleotide Substitution Matrix
- Two options for Nucleotide Substitution Matrix
- Use the same penalty for all mismatches.
- Use a lesser penalty for transitions (A??G,
C??T)than for transversions ( AG ?? CT).
12PAM Percent Accepted Mutation Substitution
Matrix (Dayhoff)
- Substitution matrices based on sound evolutionary
principles. - Find PAM1 by comparing groups of proteins known
to be evolutionarily closely related. - Find PAM-n my multiplying PAM1 by itself n times.
- PAM60 60 similar, PAM250 20 similar.
- The more distant the expected relationship, the
higher PAM-n should be used.
13BLOSUM BLOcks SUbstition Matrix
- Start with highly-conserved patterns (blocks) in
a large set of closely related proteins. - Use the likelihood of substitutions found in
those sequences to create a substitution
probability matrix. - BLOSUM-n means that the sequences used were n
identical. - BLOSUM62 is standard.
14Log Odds Form
BLOSUM and PAM matrices start as a likelihood of
substitution. Conversion to odds form yields a
matrix that gives the odds that a change is
evolutionarily significant versus purely
random. Conversion to log odds form means that as
you add each character to the pattern, you can
add the values instead of multiplying them (as
you would need to do for odds form).
15Gap Penalty
- The gap penalty has to work with the
substitution matrix. (Ex. if you have a gap
penalty that is not more severe than two
substitutions, then you will get an insert /
delete pair instead of a substitution.) - If gap penalty is too costly, will get mismatches
when a gap would lead to a better match. - If gap penalty is too cheap, will get meaningless
gaps, just to line up one or two characters.
16Gap Penalty (cont.)
- It is intuitively appealing to use a gap penalty
of the form grx where x is the length of the
gap,r is the gap extension penalty. It is
better to have one big gap than scattered small
ones. - NOTE If the gap penalty (or extension) is not
more costly than all substitutions, the
recurrence relation needs correction need to
look back along the current row and column to
assure optimality. Violates the triangle
inequality.
17How good is my alignment?
(Starting with log odds form helps.) Most online
programs give a number of statistical
formulations that attempt to answer the
question. score the value calculated for the
sequence using the substitution matrix and the
gap penalties. percent identity percent of exact
matching symbols. Expected value (E)
probability that a match with this score would be
obtained comparing two random sequences. NOTE
different systems use different forms of this
statistic.
18Alignment Questions
Should I use a global or a local alignment
algorithm? Which substitution matrix should I
use? What gap penalty structure should I use?
The answer to all of these questions lies in
yourresponse to this question
What are you trying to find out?
19What are you trying to find out?
- Are you trying to locate similar domains or
motifs? ? Local alignment is probably best. - Are you trying to determine whether the sequences
are from the same family? ? Use one of the
BLOSUM matrices. - Are you trying to determine how closely related
the sequences are evolutionarily? ? Use one of
the PAM matrices.
20Summary
- Sequence Alignment is a powerful tool for
determining relatedness between two sequences. - There are many options and decisions to make in
determining how to do the alignment. - It is essential to understand what type of
relationship one is looking for in order to apply
the right tool with the right parameter set.
21Summary (cont)
- Online resources can be found in table 3.1 of the
book or www.bioinformaticsonline.org.Recommend
BCM-SIM, BCM-BLAST2, FASTA-LALIGN, FASTA-PRSS,
BLAST2 - Another interesting resource is the Genome
Multimedia Site ocelot.bio.brandeis.edu /
pages/classes/InterpGenes/Project/menu.htm - Never underestimate the power of a good
spreadsheet!