Sequence Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Alignment

Description:

Output: An 'alignment' (mapping one sequence onto the other, possibly with gaps) ... is the Genome Multimedia Site: ocelot.bio.brandeis.edu / pages/classes ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 22
Provided by: garyja9
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
  • Gary Jackoway
  • February 26, 2002
  • CISC 889 Bioinformatics

2
Sequence Alignment Outline
  • Dynamic Programming for Sequence Alignment
  • Equivalent Problems
  • Algorithm Description
  • O(MN) Proof By Example
  • Global versus Local Alignment
  • Nucleotide Substitution Matrix

3
Sequence Alignment Outline (cont)
  • PAM Substitution Matrix
  • BLOSUM Substitution Matrix
  • Log Odds Form
  • Gap Penalty
  • Alignment Issues
  • Summary

4
Dynamic Programming for Sequence Alignment
Problem What is the optimal alignment of two
DNA sequences. Input Two DNA sequences (either
Nucleotides or Amino Acids). Output An
alignment (mapping one sequence onto the other,
possibly with gaps) and a score which defines
the quality of the match.
5
Equivalent Problems
  • Optical Character Recognition
  • Document Comparison
  • Spell Checker / Corrector

cornment ?comment
Four-score and seven years ago Four score and
seven years ago
mispeld ? misspelled
REFERENCE Skienas The Algorithm Design Manual
8.7.4
6
Algorithm Description
DP algorithms have a strong relationship to
recursion define a base case and prove that you
can extend. If you already have the optimal
solution to XY AB then you know the next
pair of characters will either be XYZ
or XY- or XYZ ABC ABC AB- (where -
indicates a gap). So you can extend the match by
determining which of these has the highest score.
7
Needleman-Wunsch Algorithm Single Step
X
Z
Y
X0match(a1,b1) Y(1 gap) (1 gap) Z(1 gap)
(1 gap)
8
Needleman-Wunsch Algorithm Single Step (numeric)
X
Z
Y
X 21 (-3) ? match(G,A) Y 28 (-10) ? (1
gap) Z 14 (-10) ? (1 gap)
9
O(MN)Proof By Example
We will prove that the dynamic programmingalgorit
hm for sequence alignment can beexecuted in
O(MN) time, where Mlength of first
sequence Nlength of second sequence
10
Global versus Local Alignment
Want to find local matching areas, even when
farremoved from each other in the sequence
ACTTAGCAGACTAACGTAAC
CCATGACTAACGGGACCTAC
Smith-Waterman Use Needleman-Wunsch but add IF
valuelt0, replace with 0 (and set backtrack to
none). When matrix is complete, backtrack from
all localmaxima, creating local matching
alignments.
11
Nucleotide Substitution Matrix
  • Two options for Nucleotide Substitution Matrix
  • Use the same penalty for all mismatches.
  • Use a lesser penalty for transitions (A??G,
    C??T)than for transversions ( AG ?? CT).

12
PAM Percent Accepted Mutation Substitution
Matrix (Dayhoff)
  • Substitution matrices based on sound evolutionary
    principles.
  • Find PAM1 by comparing groups of proteins known
    to be evolutionarily closely related.
  • Find PAM-n my multiplying PAM1 by itself n times.
  • PAM60 60 similar, PAM250 20 similar.
  • The more distant the expected relationship, the
    higher PAM-n should be used.

13
BLOSUM BLOcks SUbstition Matrix
  • Start with highly-conserved patterns (blocks) in
    a large set of closely related proteins.
  • Use the likelihood of substitutions found in
    those sequences to create a substitution
    probability matrix.
  • BLOSUM-n means that the sequences used were n
    identical.
  • BLOSUM62 is standard.

14
Log Odds Form
BLOSUM and PAM matrices start as a likelihood of
substitution. Conversion to odds form yields a
matrix that gives the odds that a change is
evolutionarily significant versus purely
random. Conversion to log odds form means that as
you add each character to the pattern, you can
add the values instead of multiplying them (as
you would need to do for odds form).
15
Gap Penalty
  • The gap penalty has to work with the
    substitution matrix. (Ex. if you have a gap
    penalty that is not more severe than two
    substitutions, then you will get an insert /
    delete pair instead of a substitution.)
  • If gap penalty is too costly, will get mismatches
    when a gap would lead to a better match.
  • If gap penalty is too cheap, will get meaningless
    gaps, just to line up one or two characters.

16
Gap Penalty (cont.)
  • It is intuitively appealing to use a gap penalty
    of the form grx where x is the length of the
    gap,r is the gap extension penalty. It is
    better to have one big gap than scattered small
    ones.
  • NOTE If the gap penalty (or extension) is not
    more costly than all substitutions, the
    recurrence relation needs correction need to
    look back along the current row and column to
    assure optimality. Violates the triangle
    inequality.

17
How good is my alignment?
(Starting with log odds form helps.) Most online
programs give a number of statistical
formulations that attempt to answer the
question. score the value calculated for the
sequence using the substitution matrix and the
gap penalties. percent identity percent of exact
matching symbols. Expected value (E)
probability that a match with this score would be
obtained comparing two random sequences. NOTE
different systems use different forms of this
statistic.
18
Alignment Questions
Should I use a global or a local alignment
algorithm? Which substitution matrix should I
use? What gap penalty structure should I use?
The answer to all of these questions lies in
yourresponse to this question
What are you trying to find out?
19
What are you trying to find out?
  • Are you trying to locate similar domains or
    motifs? ? Local alignment is probably best.
  • Are you trying to determine whether the sequences
    are from the same family? ? Use one of the
    BLOSUM matrices.
  • Are you trying to determine how closely related
    the sequences are evolutionarily? ? Use one of
    the PAM matrices.

20
Summary
  • Sequence Alignment is a powerful tool for
    determining relatedness between two sequences.
  • There are many options and decisions to make in
    determining how to do the alignment.
  • It is essential to understand what type of
    relationship one is looking for in order to apply
    the right tool with the right parameter set.

21
Summary (cont)
  • Online resources can be found in table 3.1 of the
    book or www.bioinformaticsonline.org.Recommend
    BCM-SIM, BCM-BLAST2, FASTA-LALIGN, FASTA-PRSS,
    BLAST2
  • Another interesting resource is the Genome
    Multimedia Site ocelot.bio.brandeis.edu /
    pages/classes/InterpGenes/Project/menu.htm
  • Never underestimate the power of a good
    spreadsheet!
Write a Comment
User Comments (0)
About PowerShow.com