Sequence Analysis - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Sequence Analysis

Description:

Representing proteins or nucleic acid (DNA/RNA) molecules ... two sequences are alike in the strictest terms. Similarity: Quantity that relates how much ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 33
Provided by: hui59
Category:

less

Transcript and Presenter's Notes

Title: Sequence Analysis


1
Sequence Analysis
  • CSC 487/687 Introduction to computing for
    Bioinformatics

2
Aligning Sequences
  • Sequences
  • Representing proteins or nucleic acid (DNA/RNA)
    molecules
  • Order of amino acids (for proteins nucleotides
    for DNA/RNA) along one chain
  • Sequence alignment
  • The identification of residue-residue
    correspondences
  • Any assignment of correspondences that preserves
    the order of residues within the sequences

3
Evolutionary Basis of Sequence Alignment
  • Identity Quantity that describes how much
  • two sequences are alike in the strictest
    terms.
  • Similarity Quantity that relates how much
  • two amino acid sequences are alike.
  • Homology a conclusion drawn from data
  • suggesting that two genes share a common
  • evolutionary history.

4
Evolutionary Basis of Sequence Alignment
  • Homologous sequences
  • Related by evolution (common ancestors)
  • Alignment of homologous sequences
  • Identifying relationship between the sequence
    elements
  • Match up characters coming from same characters
    in ancestor

5
Alignment and Evolution
  • Assume we know evolutionary history relating q
    and d
  • The true alignment can be found using h as a
    template
  • h GLVS T
  • q GLISVT
  • d GIV--T

6
Alignment Evolution
  • Given an alignment, several different
    evolutionary histories may be (equally) plausible
  • Example
  • Alignment
  • q GLISVT
  • d G-I-VT
  • One possible history
  • HGLIVT
  • /\
  • -gtS / \ L-gt
  • / \
  • qGLISVT dGIVT

7
Global and Local Alignment
  • Global
  • Assuming that the complete sequences are the
    results of evolution from the same ancestor
    sequence
  • Local
  • Align segments of the sequences so that the
    segments are evolutionarily related

8
Pairwise sequence alignments Vs Multiple sequence
alignments
  • Pairwise sequence alignment two sequence
  • Multiple sequence alignments a mutual alignment
    of more than two sequences

9
The dotplot
10
The dotplot
  • Captures not only the overall similarity of two
    sequences, but also the complete set and relative
    quality of different possible alignments
  • Diagonal ?
  • Horizontal ? a gap is introduced in the sequence
    indexing the rows
  • Vertical ? a gap is introduced in the sequence
    indexing the columns

11
Dotplots and alignments
  • A path through the dotplot is as an edit script
  • Each move performs an operation ? a
    substitution, an insertion or a deletion.
  • When the end of the path is reached, the effect
    will change one sequence into the other.
  • Several different sequences of edit operations
    may convert one string to the other in the same
    number of steps.

12
Dotplots and alignments
  • Although a sequence of edit operations derived
    from an optimal alignment may correspond to an
    actual evolutionary pathway
  • Impossible to prove that it does.
  • The larger the edit distance, the larger the
    number of reasonable evolutionary pathways
    between two sequences.

13
Dotplots and alignments
  • The dotplots between pairs of proteins with
    increasingly more distant relationships.
  • The dotplot comparisons of the sulphydryl
    proteinase papain from papaya, with four
    homologues ? the close relative, kiwi fruit
    actinidin, the more distant relatives, human
    procathepsin L, human cathepsin B, and
    staphyloccus anueus.

14
Example
15
Example
16
Example
17
Example
18
Measures of sequence similarity
  • Hamming distance ? the number of positions with
    mismatching characters.
  • Edit distance ? the minimum number of edit
    operations required to change one string into
    the other.

19
What is an Alignment?
  • A global alignment of two sequences A and B
    contains all characters of A and B in the same
    order
  • one symbol from A can be aligned with one symbol
    from B
  • a symbol can be aligned with a blank, written as
    -
  • two blanks cannot be aligned
  • Every symbol from A and from B must be aligned
  • Example
  • AINVEST, BINTEREST
  • IN--VEST INV--EST IN-V--EST
  • INTEREST INTEREST IN-TEREST

20
Computing Alignments
  • There exist a large number of alignments for a
    pair of sequences
  • In order to use a computer to do the alignment
    process in a meaningful way, we need
  • Scoring scheme mathematical way to calculate
    goodness of candidate alignments
  • Search method algorithm able to identify high
    scoring alignments

21
Choosing Scoring Scheme
  • Scoring scheme should be
  • Simple to allow for
  • efficient calculation and
  • search for best alignment
  • Biologically meaningful (give score to
    biologically good alignments)

22
Simple Scoring Scheme
  • Assign score to each column in the alignment
  • Columns are of the following sorts
  • Alignment score sum of score over all columns
  • R matrix giving score for all possible character
    pairs (e.g., all pairs of amino acid symbols)

23
Alignment Score Example
  • R identity matrix identical characters score1,
    unequal 0,
  • g1
  • ALIGN1
  • V - E I T G E I S T
  • P R E - T E R I - T
  • 0 -1 1 -1 1 0 0 1 -1 1 Score 1
  • ALIGN2
  • V E I T G E I S T
  • P R E T - E R I T
  • 0 0 0 1 -1 1 0 0 1 Score 2

24
Finding the Minimum Scoring Alignment
  • Large number of possible alignments cannot
    generate all and score them to find the best
  • Task align
  • Aa1a2...am and
  • Bb1b2...bn

25
Independence Between Sub-alignments
  • Observations
  • The score of the alignment up to and including
    character i from A and character j from B is
    independent of how the rest of the sequences are
    aligned
  • The best solution to (i,j) can be locked, its
    score recorded in Di,j
  • Dm,n is the score of the best global alignment
  • Amenable to dynamic Programming

26
Dynamic programming algorithm
  • Individual edit operations include
  • Substitution of bj for ai ? represented (ai, bj)
  • Deletion of ai from sequence A? represented
    (ai,?)
  • Deletion of bj from sequence B? represented
    (?,bj)

27
Dynamic programming algorithm
  • A cost function d is defined on edit operations
  • d(ai, bj)cost of a mutation in an alignment in
    which position i of sequence A corresponds to
    position j of sequence B
  • d(ai,?) or d(? bj) cost of a deletion or
    insertion
  • The minimum weighted distance between sequences A
    and B as
  • D(A,B)min (?d(x,y))

28
Three Alternative Alignment Ends
  • The alignment between a1a2...ai and b1b2...bi
    ends in one of three ways

a1..i-1 b1..j
ai -
To calculate Di,j we pick the one thatgives the
lowest cost
a1..i-1 b1..j-1
ai bj
a1..i b1..j-1
- bj
29
Recurrence Relation
Assume that Di-1,j, Di-1,j-1, Di,j-1 have been
calculated already
a1..i-1 b1..j
ai -
d(ai,?)
a1..i-1 b1..j-1
ai bj
d(ai,bj)
a1..i b1..j-1
d(?,bj)
- bj
30
Basis of Recursion
  • Align empty string to string of length i (resp.
    j) can be done by aligning to i (resp. j)
    blanks

31
Calculating Score of Best Alignment Using Matrix
H matrix
32
Time Complexity
  • Sequences of lengths n and m
  • Two sequences of length l
Write a Comment
User Comments (0)
About PowerShow.com